AdaDelta
Same as Adagrad, but, the running average is limited to a fixed window.
Adagrad
This method adapts the learning rate to each parameter. Large updates for infrequent params, small updates for the frequent ones.
Adam
Same as AdaDelta and RMSProp, stores the running average of past gradients to use them as moment.
Adamax
Same as Adam, but uses the $l_{\infty}$ norm in the running average of past gradients.
AMSgrad
Similar to Adam but keeps the previous max gradient to avoid converging to local optima.
bagging
An ensemble ML technique to reduce overfitting and variance.
Batch Gradient Descent
This performs gradient descent with the entire dataset.
bias
Bias is the amount of assumptions made by the ML algo in order to learn a target function from training data.
CART
Short for Classification And Regression Tree, an umbrella term for both classification and regression decision trees.
Cholesky Decomp
Cholesky decomposition is a hermitian positive-definite matrix factorization technique.
classification tree
A special case of decision trees where the outcome is the class of the input data.
conjugate transpose
Conjugate transpose of a complex matrix M is performed by taking transpose of M, followed by complex conjugate of every elements.
covariance
Covariance is a measure of joint-variability of two random variables.
decision trees
A decision tree is a flowchart which each node performing a condition check on an attribute in the input data. This check is then recursively performed on the non-leaf child nodes, until a leaf node is encountered.
Evidence Lower Bound
ELBO is an objective for optimization in Variational Bayes.
factor analysis
Factor Analysis is a way of describing observed (and possibly) correlated input variables in terms of lesser number of unobserved or latent variables.
gbt
Popular gradient boosting method where the weak learners are decision trees.
Givens Rotation
Product of a givens rotation matrix with a vector represents counter clockwise rotation on a given plane, by given degrees/radians.
gradient boosting
An ML technique to build a model using an ensemble of weak learners
Gradient Descent
Gradient Descent, in general, are iterative and stochastic approximation to minimize the function of interest. This function is called as an objective function or cost function, in the case of Machine Learning. The process of minimizing this function leads to a “learned” model.
Hermitian matrix
A matrix $M$ is said to be hermitian if it is equal to its conjugate transpose.
hyper-parameter optimization
The process of figuring out the optimal set of hyper-parameters for the given dataset, model and learning algorithm.
hyper-parameters
Hyper-parameters are the parameters that define the model itself and how it is learnt from data.
instance segmentation
Instance segmentation is a segmentation problem where one has to identify individual objects, even if they are overlapping and are of the same “type” of object.
lasso
Lasso is a regression analysis technique to perform both variable selection as well as model regularization.
latent variable
Latent variables are the ones that are not directly observed, but are inferred using some sort of mathematical models.
learning rate
It is a hyper-parameter that controls the step-size during gradient-based updates for the parameters of the model.
local response normalization
One of the layers used in deep networks
LU
LU decomposition is a square matrix factorization technique.
Markov Chain Monte Carlo
MCMC methods help us in estimating the posterior distributions via sampling from a complicated probability distribution. It is a non-parametric approach to estimating the posterior distribution and as such, with enough iterations it can converge to the expected distribution. As the name suggests, it consists of 2 parts: Monte Carlo and Markov Chain.
matrix-free methods
Methods to solve linear systems or eigenvalues which do not explicitly store the matrix coefficients. Instead, they’ll be computed typically based on a gemv operation.
Metropolis-Hastings
MH is one way of doing MCMC.
Minibatch Gradient Descent
This performs gradient descent on a smaller subset of samples in the dataset.
Momentum Gradient Descent
To minimize the issues of local-optima convergence of gradient descent, one can use momentum.
Nadam
Combines NAG and Adam.
Nesterov Accelerated Gradient
momentum gradient descent performs big jumps due to momentum. To avoid this, NAG first computes the gradient and then makes a big jump.
Non-parametric Bayesian
A non-parametric bayesian model is a bayesian model with infinite number of parameters.
overfitting
Overfitting is a scenario in ML where the model has sort of memorized the data and its patterns.
parameters
Parameters are the ones that are learnt from the data during training process.
pca
Statistical technique to convert a set of (possibly) correlated inputs into a set of uncorrelated variables.
Positive-definite matrix
A hermitian matrix $M$ is said to be positive-definite if $z* M z > 0$ for-all non-zero vectors $z$ .
Posterior Probability
Posterior Probability of a random event is the resulting probability after taking into account, the relevant evidence.
QR
QR decomposition is a square matrix factorization technique.
Quasi Newton Method
QN are a family of methods for optimization, based on Newton methods. They, however, do not require the computation of the Hessian matrix.
random forests
Random forests are a type of bagging, with every tree in the forest potentially constructed in parallel by sampling the dataset with replacement.
RankGauss
A data pre-processing technique for input normalization.
regression tree
A special case of decision trees where the outcome is a real number. Eg: stock price.
RMSProp
Same as Adagrad, but, the running average is limited to a fixed window. Developed around the same time as that of AdaDelta.
Row Stochastic Matrix
A probability matrix is called row stochastic, if each of it’s rows sum upto one.
segmentation
Image segmentation is just the classification of pixels in an image based on which objects they are part of.
semantic segmentation
Semantic segmentation is a segmentation problem where one has to identify all category of objects.
shapley values
Shapley Values are a way to explain the contribution of each feature value to the final prediction.
Simplex Algorithm
Simplex algorithm is one of the methods to solve linear programming problems.
SMO
SMO is an iterative algo for efficiently solving quadratic programming (QP). QP most commonly arises while solving SVMs.
Stochastic Average Gradient Descent
This is a hybrid between batch gradient descent and stochastic gradient descent.
Stochastic Gradient Descent
This performs gradient descent on each of the samples in the dataset.
svd
SVD is a matrix factorization technique.
underfitting
Underfitting is a scenario in ML where the model is unable to capture the general structure of the data and its patterns.
unitary matrix
A square matrix M is called unitary if:
- M = a real/complex square matrix
- M* = conjugate transpose of M
- I = identity matrix
variance
Variance is the amount of change in the target estimation with a different training data.
Variational Inference
VI another form of approximating the posterior distribution (or likelihood), but using a parametric distribution. Thus, it is computationally simpler but causes errors to linger around in the approximated final distribution.
weights