This performs gradient descent on a smaller subset of samples in the dataset.

Where:

  1. = batch size
  2. = i’th input batch from the dataset.
  3. all other equations and parameters are as explained in gradient descent

Pros:

  1. Reduces high variance updates seen in stochastic gradient descent
  2. Can convert some level-2 blas ops into level-3
  3. Similar convergence guarantees as in batch gradient descent
  4. supports online learning

Cons:

  1. is now a hyper parameter to be tuned for effect by decreasing over time.

This is one of the most commonly used minimization procedures in ML.