Stochastic Gradient Descent

This performs gradient descent on each of the samples in the dataset.

$g_t = \frac{\partial J(\theta_t, x^i, y^i)}{\partial \theta_t}$

Where:

$x^i, y^i$ = i’th input sample and label in the dataset
all other equations and parameters are as explained in gradient descent

Pros:

Runs fast
Similar convergence guarantees as in batch gradient descent, though one should gradually anneal $\eta$ over time to achieve this.
Supports online learning

Cons:

Creates high variance updates to the parameters

See also: gradient-descent batch-gradient-descent minibatch-gradient-descent
AKA: SGD, incremental gradient descent
References: