Same as Adagrad , but, the running average is limited to a fixed window. This method adapts the learning rate to each parameter. Large updates for infrequent params, small updates for the frequent ones. Same as AdaDelta and RMSProp , stores the running average of past gradients to use them as moment. Same as Adam , but uses the norm in the running average of past gradients. Similar to Adam but keeps the previous max gradient to avoid converging to local optima. An ensemble ML technique to reduce overfitting and variance. This performs gradient descent with the entire dataset. Bias is the amount of assumptions made by the ML algo in order to learn a target function from training data. Short for Classification And Regression Tree, an umbrella term for both classification and regression decision trees. Cholesky decomposition is a hermitian positive-definite matrix factorization technique. A special case of decision trees where the outcome is the class of the input data. Conjugate transpose of a complex matrix M is performed by taking transpose of M , followed by complex conjugate of every elements. Covariance is a measure of joint-variability of two random variables. A decision tree is a flowchart which each node performing a condition check on an attribute in the input data. This check is then recursively performed on the non-leaf child nodes, until a leaf node is encountered. ELBO is an objective for optimization in Variational Bayes. Factor Analysis is a way of describing observed (and possibly) correlated input variables in terms of lesser number of unobserved or latent variables. Popular gradient boosting method where the weak learners are decision trees. Product of a givens rotation matrix with a vector represents counter clockwise rotation on a given plane, by given degrees/radians. An ML technique to build a model using an ensemble of weak learners Gradient Descent, in general, are iterative and stochastic approximation to minimize the function of interest. This function is called as an objective function or cost function, in the case of Machine Learning. The process of minimizing this function leads to a “learned” model. A matrix is said to be hermitian if it is equal to its conjugate transpose. The process of figuring out the optimal set of hyper-parameters for the given dataset, model and learning algorithm. Hyper-parameters are the parameters that define the model itself and how it is learnt from data. Instance segmentation is a segmentation problem where one has to identify individual objects, even if they are overlapping and are of the same “type” of object. Lasso is a regression analysis technique to perform both variable selection as well as model regularization. Latent variables are the ones that are not directly observed, but are inferred using some sort of mathematical models. It is a hyper-parameter that controls the step-size during gradient-based updates for the parameters of the model. One of the layers used in deep networks LU decomposition is a square matrix factorization technique. MCMC methods help us in estimating the posterior distributions via sampling from a complicated probability distribution. It is a non-parametric approach to estimating the posterior distribution and as such, with enough iterations it can converge to the expected distribution. As the name suggests, it consists of 2 parts: Monte Carlo and Markov Chain. Methods to solve linear systems or eigenvalues which do not explicitly store the matrix coefficients. Instead, they’ll be computed typically based on a gemv operation. MH is one way of doing MCMC. This performs gradient descent on a smaller subset of samples in the dataset. To minimize the issues of local-optima convergence of gradient descent , one can use momentum. Combines NAG and Adam . momentum gradient descent performs big jumps due to momentum. To avoid this, NAG first computes the gradient and then makes a big jump. A non-parametric bayesian model is a bayesian model with infinite number of parameters. Overfitting is a scenario in ML where the model has sort of memorized the data and its patterns. Parameters are the ones that are learnt from the data during training process. Statistical technique to convert a set of (possibly) correlated inputs into a set of uncorrelated variables. A hermitian matrix is said to be positive-definite if for-all non-zero vectors . Posterior Probability of a random event is the resulting probability after taking into account, the relevant evidence. QR decomposition is a square matrix factorization technique. QN are a family of methods for optimization, based on Newton methods. They, however, do not require the computation of the Hessian matrix. Random forests are a type of bagging, with every tree in the forest potentially constructed in parallel by sampling the dataset with replacement. A data pre-processing technique for input normalization. A special case of decision trees where the outcome is a real number. Eg: stock price. Same as Adagrad , but, the running average is limited to a fixed window. Developed around the same time as that of AdaDelta . A probability matrix is called row stochastic, if each of it’s rows sum upto one. Image segmentation is just the classification of pixels in an image based on which objects they are part of. Semantic segmentation is a segmentation problem where one has to identify all category of objects. Shapley Values are a way to explain the contribution of each feature value to the final prediction. Simplex algorithm is one of the methods to solve linear programming problems. SMO is an iterative algo for efficiently solving quadratic programming (QP). QP most commonly arises while solving SVMs. This is a hybrid between batch gradient descent and stochastic gradient descent. This performs gradient descent on each of the samples in the dataset. SVD is a matrix factorization technique. Underfitting is a scenario in ML where the model is unable to capture the general structure of the data and its patterns. A square matrix M is called unitary if: M = a real/complex square matrix M* = conjugate transpose of M I = identity matrix Variance is the amount of change in the target estimation with a different training data. VI another form of approximating the posterior distribution (or likelihood), but using a parametric distribution. Thus, it is computationally simpler but causes errors to linger around in the approximated final distribution.