This performs gradient descent on each of the samples in the dataset.

Where:

  1. = i’th input sample and label in the dataset
  2. all other equations and parameters are as explained in gradient descent

Pros:

  1. Runs fast
  2. Similar convergence guarantees as in batch gradient descent, though one should gradually anneal over time to achieve this.
  3. Supports online learning

Cons:

  1. Creates high variance updates to the parameters