Minibatch Gradient Descent

This performs gradient descent on a smaller subset of samples in the dataset.

$g_t = \frac{\partial J(\theta_t, x_n^i, y_n^i)}{\partial \theta_t}$

Where:

$n$ = batch size
$x_n^i, y_n^i$ = i’th input batch from the dataset.
all other equations and parameters are as explained in gradient descent

Pros:

Reduces high variance updates seen in stochastic gradient descent
Can convert some level-2 blas ops into level-3
Similar convergence guarantees as in batch gradient descent
supports online learning

Cons:

$n$ is now a hyper parameter to be tuned for effect by decreasing $\eta$ over time.

This is one of the most commonly used minimization procedures in ML.

See also: gradient-descent batch-gradient-descent stochastic-gradient-descent
References:

https://www.slideshare.net/SebastianRuder/optimization-for-deep-learning