My notes while taking the ML course from Dr. Andrew Ng on coursera.

Week1

ML grew out of work in AI
ML is also used where applications can't be programmed by hand
- NLP, Comp.Vision, etc
- self-customizing programs
- understanding human learning
most common ML algos: supervised and unsupervised
other algos: reinforcement learning, recommender systems, etc
supervised learning:
- regression = predict continuous output variable
- classification = predict/classify discreet valued output
unsupervised learning:
- clustering of unlabeled data
supervised learning:
- training set = data-set (labelled)
- m = num of training examples
- x = 'input' variable/features
- y = 'output' variable/ 'target' variable
- $$x_i$$, $$y_i$$ - ith training sample
- its flow:
  - training-set --> learning algo --> h (hypothesis)
  - test/real-world-input --> h --> prediction/classification
univariate linear regression:
- $$h(x) = t_0 + t_1 x$$
- cost function: $$J(t_0,t_1) = \frac{\sum(h(x_i) - y_i)^2}{2 * m}$$
  - $$min_{(t_0,t_1)} J(t_0,t_1)$$
  - also called squared-error function
  - most commonly used cost function
  - gives a nice quadratic surface to minimize
- gradient descent
  - start with some initial value
  - keep following down the gradient and hopefully will reach minimum
  - local minima can trip us here!
  - repeat until convergence:
    - $$t_i -= \alpha \frac{\partial J(t0,..)}{\partial t_i}$$
    - for every parameter $$t_i$$
    - these updates are concurrent for all params!
    - $$\alpha$$ = learning rate (controls the step size)
  - convex functions always ensure that gradient descent will converge to global optimum
  - the above algo is 'batch' gradient descent
    - meaning at each step, we are using all - of the training samples
- linear algebra: notation and set of ops using matrices and vectors

Week2

multi-variate linear regression
- n = number of features
- x(i) = input features of ith training sample
- x(i)(j) = feature j value in ith training sample
- h(x) = sum(ti *xi) i=0:n
- vectorized version: h(X) = T' - X x0 is always 1!
gradient descent for multivariate linear regression
- J(T) = sum(((T' - X) - Y).^2) / (2*m)
- T = n+1 x 1
- X = m x n+1
- Y = m x 1
- updation: T = T - alpha - d(J(T))/dT
- T = T - alpha - (((X - T) - Y)' - X)' / m
- such an X is called design matrix
gradient descent - feature scaling
- make sure features are on a similar scale
- convergence is quicker this way!
- because the contour plots of cost function will have incredibly skewed ellipses. Thus, the convergence will take quite a long time
- generally, get the feature into [-1, 1] range
- Can this be the reason we have sigmoid functions in neural nets!??
- use mean normalization brings features to [-0.5, 0.5] range with mean=0
  - (xi - meani) / range
  - mean can be the mean of that feature in the training set
  - range can be (max - min) or standard-deviation
gradient descent - learning rate
- plot J(T) over iterations, should help debug the learning algo
- J(T) should decrease after every iteration
- this will also tell us when to stop running more iterations
- one other approach is to monitor the changes in J(T) over the iterations and stop when the decrease is 'insignificant'
- but better to not automate this and use plots to confirm
- if J(T) is increasing over iterations, choose a smaller alpha
- but if alpha is too small, convergence might be slower!
depending on the insight into the problem, by defining new features based on the training set, we might get better model
looking at the data, we could also decide to probably use polynomial model or cubic model, so on...
- for such models, the feature scaling becomes increasingly important!
normal equation:
- solve for T analytically, instead of iterative gradient descent
- T = pinv(X' - X) - X' - Y
- for this method, feature scaling is not required
- thus doesn't need to iterate and no need to choose alpha
- becomes very slow for large values of 'n'
normal equation non-invertibility
- what if (X' - X) is non-invertible (singular/degenerate)?
- pretty rare case
- however, as long as we use "pinv" for inverse of X'*X, we are OK
- causes for it to be singular:
  - redundant features (linearly dependent)
  - too many features (m <= n)
    - Better to delete some features or use regularization

Week3

Binary classification: y in {0,1} 0=negative-class, 1=positive-class
There's multi-class classification problem as well.
threshold classifier:
- fit linear regression and then apply a threshold on its output
- but this is very sensitive to the position of inputs passed!
  - example: think what happens if there's an outlier far away from the current set of poinets?
- output needs to be either 0 or 1, but linear regression outputs continuous values
that's where logistic regression is applied for classification problem
hypothesis representation for logistic regression
- want h(x) in [0,1]
- h(x) = g(theta' - x)
- g(i) = 1 / (1 + e^-i)
- g = logistic function or sigmoid function
one interpretation of the above h(x) is probability that y=1 on input x
- ie. h(x) = P(y=1|x;theta)
this basically means y=1 whenever theta' - x >= 0, and y=0 otherwise
- this is the decision boundary for logistic regression hypothesis
- in this particular case this boundary is linear
- if we need a non-linear decision boundary, then use non-linear terms in the logistic regression equation (similar to what was discussed in linear regression lecture)
if we use quadratic cost function as was used in linear regression model, now it will be a non-convex function
cost function for logistic regression:
- cost(h(x),y) = -log(h(x)) if y=1, -log(1-h(x)) if y=0
- intuition is that for correct classification there is no penalty and for incorrect, there's a very large cost
- this is gives us a convex function which only one minima
simplified cost function for the above:
- cost(h(x),y) = -y*log(h(x)) - (1-y)*log(1-h(x))
gradient descent on this will now be:
- T = T - alpha - ((h(X)-Y)' - X)'
- looks similar to logistic regression!
- feature scaling also works here.
optimization algos:
- gradient descent
- conjugate descent
- BFGS
- LBFGS
- these only require J(theta) and d(J(theta))/d(theta)
advantages of the other 3 algos
- no need to manually pick alpha
- faster than gradient descent in terms of convergence
disadvantages of these:
- more complex
use 'fminunc' (function minimization unconstrained) for optimization using these different optimization techniques
logistic regression on multi-class classification:
- one-vs-all classification (one-vs-rest)
  - make all others under one class, except the current class
  - apply logistic regression for binary classification
  - repeat this for all classes
  - pick the class that has highest probability (max value of h(x))
Regularization (problem of overfitting)
- under fitting (high bias)
- over fitting (high variance)
- over fitting - if we have too many features in the learned hypothesis, it may fit the training set very well, but fail to generalize on new examples
- lot of features and very less training data can also lead to this
- to address overfitting:
  - reduce number of features
    - manually select which features to keep
    - model selection algo
  - regularization
    - keep all features, but reduce the magnitude/values of params
    - works well when there are lot of features
- having small values for params
  - 'simpler' hypothesis
  - less prone to overfitting
- J(theta) = (sum((h(x)-y)^2) + lambda*T(1:)'T(1:))/(2m)
  - theta(0) is not part of the above sum!
  - lambda - regularization param
    - controls trade off between fitting to training data well at the same time keeping params small
    - if it is large, then we will see underfitting
- the gradient for the descent also changes according to this regularizer
- for normal equation with regularization
  - m = eye(n+1,n+1)
  - m[0,0] = 0
  - T = pinv((X' - X) + (lambda - m)) - X' - Y
  - as long as lambda > 0, the above matrix is invertible!

Week4

non-linear hypotheses (neural networks representation)
- using quadratic features leads to huge number of features
- can lead to overfitting of the data as well
- this is especially difficult with image applications!
neural networks are algos designed to mimic brain
neuron model and logistic unit
- bias unit
- inputs
- input wires/params/weights
- sigmoid (logistic) activation function
neural network is now a collection of these neurons stacked into layers
- first layer is input layer
- last layer is output layer
- in between layers are hidden layers
- a[i][j] = 'activation' of unit i in layer j
- theta[j] = matrix of weights from layer j to j+1
- if net has s[j] units in layer j and s[j+1] in j+1, then theta[j] is of dimension s[j+1]x(s[j]+1). +1 for bias unit
- vectorization of this computation is basically matrix-vector multiplication
- a[i+1] = g(theta[i] - a[i]) (add bias unit at each layer!)
- a[1] = input layer
- this process is called forward propagation
- forward propagation of input activation to output activations
neural net learning its own features
- hidden and output layer look like a logistic regression
- each unit in hidden layer is set of new features learned from the input!
- ANN can use this hidden layer to learn more complex features to feed to the output layer
neural nets for multi-class classification
- by using multiple units in the output layer

Week5

nomenclature
- L = total no. of layers in network
- s(l) = no. of units (w/o bias unit) in layer l
neural net cost function will be (using NLL with L2-norm)
- NLL will be the same except that h(x) will now be the entire MLP!
- regularization will be summed across all weights on all layers
backpropagation algo
- del[l][j] = 'error' of node j in layer l
- then propagate back these 'del' till the input layer is reached
- del[l] = (theta[l]' - del[l+1]) . - g'(z[l])
  - g' is derivative of g
  - theta' is matrix transpose
- dJ/d(theta[l][i][j]) = (a[l][j] - del[l+1][i]) + (lambda - theta[l][i][j])
gradient checking
- can catch bugs in the forward/backward prop implementation
- estimate slope using theta-eps and theta+eps
- such a two-sided difference gives better estimation than theta-eps, theta (one-sided).
- call your implementation for evaluating gradients
- compare them!
- for vectors, partial derivatives can be obtained using the same principle
  - just work on each dimension, one at a time!
initial value of theta
- cannot initialize all theta in a net to zero
  - after each gradient update, params corresponding to inputs to each hidden units will be identical
  - this is the problem of symmetric weights
- thus for symmetry-breaking, we'll have to perform random init.
  - init each theta between [-eps, eps]
putting it all together
- number of inputs = dimension of input features
- number of outputs = number of classes (For a classification problem)
- reasonable default: have same of number of hidden units in each layer
- can usually start with 1 hidden layer
- usually the more the hidden layers, the better the network

Week6

debugging a learning algo
- if the system shows huge prediction errors on new data
  - get more training data
  - try smaller set of features
  - try to get additional features
  - try adding polynomial features
  - play with lambda value
machine learning diagnostic:
- a test to gain insight on what is/isnt working with a learning algo and how to best improve its performance.
evaluating a hypothesis
- split dataset into training/test set (70:30 split)
  - train using training set and measure its performance using test set
  - another way can be to just measure the misclassification rate
training error is usually lower than the generalization error
model selection
- select those hyper-params for which we get the lowest test error
- but we cannot estimate generalization error for this, since we had selected the model based on the test error itself!
- hence, split dataset into training/cross-validation/test sets. (split ratio: 60:20:20)
- now measure the model's performance on the CV set.
- pick the hypothesis with the lowest CV error
diagnosis bias vs variance (under vs over fitting)
- both high training and high CV error = bias problem (simple model)
- low training and high CV error = variance problem (complex model)
choosing lambda
- in this case, the training/CV errors do not contain regularization param
- only the cost function during training the model itself, contains so!
- to choose an optimal lambda, then use similar approach as before in model selection
- both high training and high CV error = bias problem (large lambda)
- low training and high CV error = variance problem (small lambda)
learning curves
- plot training/CV errors as a function of dataset size
- here average training error grows with dataset size
- average CV errors decays with dataset size
- for large data if training error ~= CV error => bias problem (with high error value)
- high bias problem cannot be solved using more data!
- for large data if training error != CV error and there's a large difference between them => variance problem (training error smaller than CV error)
- high variance problem cannot be somewhat reduced with more data
What to do next?
- getting more training examples (helps for high variance case)
- try smaller set of features (high variance)
- try getting additional features (helps for high bias problem)
- try adding polynomial features (high bias)
- try decreasing lambda (high bias)
- try increasing lambda (high variance)
neural nets
- smaller neural net (less params) = more prone to underfitting
- larger neural net = more prone to overfitting
- use regularization to address overfitting
ML system design
- prioritizing what to work on: collect more data or planning what features to use
- have a single evaluation metric (eg: error percentage) to evaluate different ideas/approaches
- for skewed classes (rare classes in the dataset)
  - usual classification accuracy is bad!
  - in such cases, use confusion matrix to measure error rates in classification
  - can also use precision/recall technique
- precision/recall
  - precision = #true positives / #positives predicted
  - recall = #true positives / #actual positives
trading off between precision and recall
- lower precision; higher recall (avoid false negatives)
- higher precision; lower recall (avoid false positives)
comparing precision/recall (F score): F = 2 - P - R / (P + R)
how much data to train on?
- in ML, it is not who has the best system that wins, but the one who has the most data!

Week7

logistic regression optimization objective
- y=1, we need theta' - x >> 0 and for y=0, << 0
- instead of putting lambda hyper-parameter on regularization, in SVMs people putting the inverse of it onto the cost itself.
- and they don't use '1/m' multiplier as convention
Support Vector Machines (SVMs)
- instead of giving out probabilities like in logistic regression, SVM hypothesis directly gives out y=0/1.
- meaning, h(x) = 1 if theta' - x >= 0, and 0 otherwise
- they are also called as large margin classifiers
- they create a more robust decision boundary
- but these classifiers can be sensitive to outliers! (in which case, the regularization parameter 'C' plays vital role, similar to lambda)
kernels (non-linear decision boundary) - to classify complex non-linear functions
gaussian kernel
- fi = exp(-||x-li||^2 / (2*sigma^2))
- features used in the "x" vector of: theta' - x
- li = landmarks (manually chosen)
- i = 1,...n
- we could use other similarity functions as well
how to get these landmarks?
- to begin with make each training example as a landmark
- but the number of features with this approach = number of training samples!
bias/variance trade-off:
- C = 1/lambda
- large C = lower bias, higher variance
- small C = higher bias, lower variance
- in gaussian kernel
- large sigma = features vary smoothly = higher bias, lower variance
- small sigma = features vary sharply = lower bias, higher variance
Using SVM
- choice of C
- choice of kernel (and its parameter values)
- no kernel = linear kernel
- if using gaussian kernel, do perform feature scaling
not all kernels are valid kernels!
- they need to satisfy 'mercers theorem' for them to be valid
- this will make sure that the SVM optimizations run correctly to convergence
- other kernels are
  - polynomial kernel: (x' - l + c)^d. Performs worse than gaussian kernel
  - string kernel
  - chi-square kernel
  - histogram intersection kernel
multi-class classification
- one vs all method
- some SVMs have multi-class classification built-in
logistic regression vs SVMs
- n = number of features, m = number of training samples
- if n is large compared to m, use logistic regression or linear kernel SVMs
- if n is small and m is intermediate, use gaussian kernel SVMs
- if n is small and m is large, create more features and then use logistic regression or linear kernel SVMs
neural nets will work well for all these cases, but for some of these problems, these nets can be slower to train
SVM optimization is a convex optimization problem

Week8

unsupervised learning
- work with unlabelled data
- ask it to find some structure in the data
k-means algo
- most popular algo
- randomly initialize 'k' cluster centroids
- cluster assignment - depending upon the distance, datapoints are assigned to these centroids
- centroids evaluation - move centroids to the average of these newly assigned clusters
- repeat until convergence/max-iterations
optimization objective
- J = 1/m - sum(||xi - uci||^2)
- m = dataset size
- uci = cluster centroid to which xi is assigned
cluster centroid initialization
- random works ok, but not good
- randomly pick K training examples and set centroids to these
  - this can avoid getting stuck at local optimum
one better approach to avoid local optima
- run K-means with random initializations and repeat this multiple times
- [50,1000] times is a good number
- pick the output of that initialization with results in the lowest cost function
- this can work for K=[2,10], for higher values of K, you might not need this
how to choose K?
- by far the best way to do is choose it manually!
- one other method is 'elbow method'
  - basically pick that value of K which results in an 'elbow' in the plot of cost function vs K
  - but sometimes, if there are no elbows, this becomes ambiguous!
- evaluate K-means using on another metric based on its application
dimensionality reduction
- in case we have highly correlated features, we dont need all of them
- used for data compression too
- it is also used to visualize data (specifically for input data in many ML applications)
Principal Component Analysis - PCA
- tries to find a direction (vectors) on which to project the data which reduces the projection error
- PCA is not linear regression!
  - we try to minimize squared error in linear regression
  - we try to minimize projection error in PCA
- Step1: data preprocessing or feature-scaling or mean normalization
  - perform x[i] = (x[i] - u[i]), where u[i] is mean over all x[i]'s
  - if different features are on different scales, scale them to be on similar ranges
- Step2: calculate covariance matrix
- Step3: compute eigen vectors of this matrix (using 'svd')
- Step4: if you take the first 'k' columns of the output U matrix from svd, you've got k vectors on which to project the data
- Step5: to reduce dim of input data, perform: z = Ured' - X
choosing 'k' for PCA
- typically choose 'k' to be the smallest value such that
  - sum(|| x[i] - xapprox[i] ||^2) / sum(|| x[i] ||^2) <= 0.01
  - ie, 99% of the data variance is retained
  - 'S' diagonal matrix given as output by 'svd' can be used for this
  - 1 - sum_to_k(S[i][i]) / sum_to_n(S[i][i]). This is the same as above!
reconstructing the original data: Xapprox = Ured - z
applying PCA
- used to speedup supervised learning problem. But always make sure that you've tried the learning problem without PCA too!
- but the Ureduce should be calculated using only the training set
- then apply the mapping on all of training/cv/test sets
- PCA should not be used to address overfitting. Use regularization instead!

Week9

anamoly detection
- somewhat like unsupervised learning
- perform density estimation and build a model 'p' based on the input feature vectors
- now for the test case, if p(x) < eps, then flag it as an anamoly
- eg: fraud detection, manufacturing
gaussian distribution (normal distribution)
- p(x) = exp(-(x-u)/2sigmasigma)/sqrt(2pisigma*sigma)
- parameter estimation
  - u = mean(x)
  - sigma^2 = variance(x)
density estimation
- p(x) = prod(p(xi)) for each element xi in the feature vector x
- each p(xi) is a gaussian distribution
- this assumes that each element xi are independent of each other
- then use the above gaussian parameter estimation equation on all the feature elements
- now to detect anamoly, apply the above density equation using a 'eps' value to flag the output as anamoly or not
it is always better to have a way to evaluate the learning algorithm
- becomes easy to fine-tune hyper-parameters
evaluation
- lets assume now that these examples are labelled as anamolous and non-anamolous
- split this into training/cross-validation/test sets, as usual
- typically there are no anamolous examples kept in the training set!
- then perform density estimation on the training set
- predict its performance on cv and test sets
  - can use f1-score, precision/recall, ...
  - because the data set is very skewed
- use cv set to estimate a good value for 'eps'
choosing features
- try plotting the histogram of data to confirm whether it could fit a bell-shaped curve or not
- if it doesn't fit gaussian, try using other distributions
- try to see if 'log' can convert this into gaussian, or any other transformations
- use cv set extensively to choose features
multi-variate gaussian
- in case we have features related to each other, then the previous gaussian approach might miss some anamolies
- u = mean vector of all features, cov = covariance matrix
- p(x) = exp(-(x-u)' - cov^-1 - (x-u) / 2) / ((2 - pi)^(n/2) - sqrt(det(cov)))
- u = mean(x)
- cov = mean((x-u)*(x-u)')
recommender systems:
- nu = number of users
- nm = number of items to be rated
- y(i,j) = rating by user j to item i
- predict ratings for all items and recommend items to the users
content based recommendations (CBR)
- assume certain features for the items
- use linear regression on each user to predict his rating
- can also apply regularization while learning these params
collaborative filtering (CF)
- its hard to get features for all the items!
- this approach can self-learn these features!
- but now lets say that each user has given us how much they like each of these features
- so using these preferences for each user as params, we can learn the features for each items!
- for each item, repeat this process
- one can also apply regularization here (not on the params, but now on the features!)
- initially guess the theta values, then using CF, estimate x's.
- then, using CBR, get better values for theta.
CF algo
- we don't need to go back and forth between theta and x
- we could just put both optimizations in a single equation and optimize for both simultaneously
- however, there's no intercept term used here
vectorization
- low rank matrix factorization
- predict user ratings using: X*Theta'
using these learned features/params, we can also find related items (just measure feature vector distance between the 2 items)
mean normalization
- what if a user has not rated any items?!
- this leads to params/features to converge to zero
- in this case, calculate the mean rating for each row (item)
- subtract this mean from every element in the Y matrix
- then while predicting the rating, add back the mean
- so for this user, we'll predict his rating to be mean for each item!
- same thing applies to items which might not have any ratings (but better to not recommend it since it has no ratings!)

Week10

its not who has the best algo that wins, it's who has the most data
with large datasets, we can't update the params using the gradient descent, because 'm' is too large! computationally inhibitive!
SGD: Stochastic Gradient Descent
- GD (or BGD = Batch Gradient Descent) becomes very expensive for large datasets
- SGD works as follows:
  - randomly shuffle the dataset
  - perform the params update stage, one sample at a time! (as opposed to BGD)
- but it performs somewhat like a random walk in the 'J' space
Mini BGD
- sometimes can be even faster than SGD
- it basically uses 'b' examples in each iteration to update params
- b = batch size
- in-between BGD and SGD
- MBGD helps in vectorizing the training wrt SGD, thus can be faster (more parallel stuff to do!)
checking for SGD convergence:
- for every P iterations, calculate the average 'cost' of all the previous P examples
- plot the same
- but the plot can be a bit noisy compared to GD
- increasing P can give smoother curve, but after longer duration
- can also decrease alpha over epochs to make theta converge
online learning
- learning on a stream of data
- one way is to apply SGD on the current data to learn the params
- estimate predicted CTR (Click Through Rate)
map-reduce and data parallelism
- can be used to split training session on machines
ML pipeline:
- image -> object detection -> object segmentation -> object recognition
- to detect, create a rectangle BB (bounding box) and slide it across the image to detect the object and using a classifier
- slide speed on the image is decided by the stride across vertical and horizontal directions
- can also apply expansion operator in the output image
- then use connected components and draw BB
data augmentation
- creating artificial data
- put artificial distortions (or warping)
- change contrast, color, rotate the image, crop the image
- adding noise (especially audio)
- but does NOT help to add meaningless noise to the data!
ceiling analysis
- deciding what part of pipeline to work on next
- first create a measure of performance (accuracy, cost, ...)
- in the pipeline, replace each stage and manually pass the 'correct' inputs to its next stage.
- this way measure the performance of the system
- this will tell us which part of pipeline is lagging behind, and we can spend more effort there.