• AdaDelta

    Same as Adagrad, but, the running average is limited to a fixed window.
  • Adagrad

    This method adapts the learning rate to each parameter. Large updates for infrequent params, small updates for the frequent ones.
  • Adam

    Same as AdaDelta and RMSProp, stores the running average of past gradients to use them as moment.
  • Adamax

    Same as Adam, but uses the norm in the running average of past gradients.
  • AMSgrad

    Similar to Adam but keeps the previous max gradient to avoid converging to local optima.
  • bagging

    An ensemble ML technique to reduce overfitting and variance.
  • Batch Gradient Descent

    This performs gradient descent with the entire dataset.
  • bias

    Bias is the amount of assumptions made by the ML algo in order to learn a target function from training data.
  • CART

    Short for Classification And Regression Tree, an umbrella term for both classification and regression decision trees.
  • Cholesky Decomp

    Cholesky decomposition is a hermitian positive-definite matrix factorization technique.
  • classification tree

    A special case of decision trees where the outcome is the class of the input data.
  • conjugate transpose

    Conjugate transpose of a complex matrix M is performed by taking transpose of M, followed by complex conjugate of every elements.
  • covariance

    Covariance is a measure of joint-variability of two random variables.
  • decision trees

    A decision tree is a flowchart which each node performing a condition check on an attribute in the input data. This check is then recursively performed on the non-leaf child nodes, until a leaf node is encountered.
  • Evidence Lower Bound

    ELBO is an objective for optimization in Variational Bayes.
  • factor analysis

    Factor Analysis is a way of describing observed (and possibly) correlated input variables in terms of lesser number of unobserved or latent variables.
  • gbt

    Popular gradient boosting method where the weak learners are decision trees.
  • Givens Rotation

    Product of a givens rotation matrix with a vector represents counter clockwise rotation on a given plane, by given degrees/radians.
  • gradient boosting

    An ML technique to build a model using an ensemble of weak learners
  • Gradient Descent

    Gradient Descent, in general, are iterative and stochastic approximation to minimize the function of interest. This function is called as an objective function or cost function, in the case of Machine Learning. The process of minimizing this function leads to a “learned” model.
  • Hermitian matrix

    A matrix is said to be hermitian if it is equal to its conjugate transpose.
  • hyper-parameter optimization

    The process of figuring out the optimal set of hyper-parameters for the given dataset, model and learning algorithm.
  • hyper-parameters

    Hyper-parameters are the parameters that define the model itself and how it is learnt from data.
  • instance segmentation

    Instance segmentation is a segmentation problem where one has to identify individual objects, even if they are overlapping and are of the same “type” of object.
  • lasso

    Lasso is a regression analysis technique to perform both variable selection as well as model regularization.
  • latent variable

    Latent variables are the ones that are not directly observed, but are inferred using some sort of mathematical models.
  • learning rate

    It is a hyper-parameter that controls the step-size during gradient-based updates for the parameters of the model.
  • local response normalization

    One of the layers used in deep networks
  • LU

    LU decomposition is a square matrix factorization technique.
  • Markov Chain Monte Carlo

    MCMC methods help us in estimating the posterior distributions via sampling from a complicated probability distribution. It is a non-parametric approach to estimating the posterior distribution and as such, with enough iterations it can converge to the expected distribution. As the name suggests, it consists of 2 parts: Monte Carlo and Markov Chain.
  • matrix-free methods

    Methods to solve linear systems or eigenvalues which do not explicitly store the matrix coefficients. Instead, they’ll be computed typically based on a gemv operation.
  • Metropolis-Hastings

    MH is one way of doing MCMC.
  • Minibatch Gradient Descent

    This performs gradient descent on a smaller subset of samples in the dataset.
  • Momentum Gradient Descent

    To minimize the issues of local-optima convergence of gradient descent, one can use momentum.
  • Nadam

    Combines NAG and Adam.
  • Nesterov Accelerated Gradient

    momentum gradient descent performs big jumps due to momentum. To avoid this, NAG first computes the gradient and then makes a big jump.
  • Non-parametric Bayesian

    A non-parametric bayesian model is a bayesian model with infinite number of parameters.
  • overfitting

    Overfitting is a scenario in ML where the model has sort of memorized the data and its patterns.
  • parameters

    Parameters are the ones that are learnt from the data during training process.
  • pca

    Statistical technique to convert a set of (possibly) correlated inputs into a set of uncorrelated variables.
  • Positive-definite matrix

    A hermitian matrix is said to be positive-definite if for-all non-zero vectors .
  • Posterior Probability

    Posterior Probability of a random event is the resulting probability after taking into account, the relevant evidence.
  • QR

    QR decomposition is a square matrix factorization technique.
  • Quasi Newton Method

    QN are a family of methods for optimization, based on Newton methods. They, however, do not require the computation of the Hessian matrix.
  • random forests

    Random forests are a type of bagging, with every tree in the forest potentially constructed in parallel by sampling the dataset with replacement.
  • RankGauss

    A data pre-processing technique for input normalization.
  • regression tree

    A special case of decision trees where the outcome is a real number. Eg: stock price.
  • RMSProp

    Same as Adagrad, but, the running average is limited to a fixed window. Developed around the same time as that of AdaDelta.
  • Row Stochastic Matrix

    A probability matrix is called row stochastic, if each of it’s rows sum upto one.
  • segmentation

    Image segmentation is just the classification of pixels in an image based on which objects they are part of.
  • semantic segmentation

    Semantic segmentation is a segmentation problem where one has to identify all category of objects.
  • shapley values

    Shapley Values are a way to explain the contribution of each feature value to the final prediction.
  • Simplex Algorithm

    Simplex algorithm is one of the methods to solve linear programming problems.
  • SMO

    SMO is an iterative algo for efficiently solving quadratic programming (QP). QP most commonly arises while solving SVMs.
  • Stochastic Average Gradient Descent

    This is a hybrid between batch gradient descent and stochastic gradient descent.
  • Stochastic Gradient Descent

    This performs gradient descent on each of the samples in the dataset.
  • svd

    SVD is a matrix factorization technique.
  • underfitting

    Underfitting is a scenario in ML where the model is unable to capture the general structure of the data and its patterns.
  • unitary matrix

    A square matrix M is called unitary if:
    • M = a real/complex square matrix
    • M* = conjugate transpose of M
    • I = identity matrix
  • variance

    Variance is the amount of change in the target estimation with a different training data.
  • Variational Inference

    VI another form of approximating the posterior distribution (or likelihood), but using a parametric distribution. Thus, it is computationally simpler but causes errors to linger around in the approximated final distribution.
  • weights