W3 - 《Machine Learning》

• Linear regression
A prediction method for regression tasks whose predicted variable is numeric.
• Logistic regression
An extension of linear regression for classification tasks whose predicted variable is nominal.
• Overfitting and regularization
• Ridge and Lasso regression

Simple(bivariate) regression
x: independent variable
y: dependent variable
Goal: 1) Prediction 2) Descriptive analysis - assess the relationship between x and y
W3 - 图2
b0 is intercept(截断)
b1 is slope(斜率)
The difference between actual value and predicted value is prediction error or residual .
SSE is the sum of squared prediction errors W3 - 图3
the least squares method (最小二乘法)

W3 - 图4
This solution is obtained by minimizing SSE using differential calculus
R2 measures the goodness of fit of the regression line found by theleast squares method:
W3 - 图5 SST: SSR: W3 - 图6
W3 - 图7
• Hence, SST measures the prediction error when the predicted value is the mean value
• SST is a function of the variance of y (variance = standard deviation^2) => SST is a measure of the variability of y, without considering x and can be used as a baseline -predicting y without knowing x
Values between 0 and 1; the higher the better
• = 1: the regression line fits perfectly the training data
• close to 0: poor fit

If there is an outstanding value far higher or lower than others, how does the linear regression perform?
If a value is super good, the value may cause the regression line entirely moving up, and lead to a bad fitting effect. Otherwise, the regression line will entirely move down, and lead to a bad fitting effect.

Multiple regression
W3 - 图10 %22%20aria-hidden%3D%22true%22%3E%0A%20%3Cuse%20xlink%3Ahref%3D%22%23E1-MJMATHI-52%22%20x%3D%220%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%20%3Cuse%20transform%3D%22scale(0.707)%22%20xlink%3Ahref%3D%22%23E1-MJMAIN-32%22%20x%3D%221074%22%20y%3D%22583%22%3E%3C%2Fuse%3E%0A%3C%2Fg%3E%0A%3C%2Fsvg%3E#card=math&code=R%5E2&id=IBCiy): multiple coefficient of determination

Logistic Regression

• It gives a value between 0 and 1 that is interpreted as the probability for class membership:
• p is the probability for class 1 and 1-p is the probability for class 0
• It uses the maximum likelihood method (极大似然估计) to find the parameters b0 and b1 - the curve that best fits the data

Overfitting and Regularization
• Overfitting:
• Small error on the training set but high error on test set (new examples)
• The classifier has memorized the training examples but has not learned to generalize to new examples!
• It occurs when
• we fit a model too closely to the particularities of the training set – the resulting model is too specific, works well on the training data but doesn’t work well on new data
• Various reasons, e.g.
• Issues with the data
• Noise in the training data
• Too small training set – does not contain enough representative examples
• How the algorithm operates
• Some algorithms are more susceptible to overfitting than others
• Different algorithms have different strategies to deal with overfitting, e.g.
• Decision tree – prune the tree
• Neural networks – early stopping of the training
• …
• Underfitting:
• The model is too simple and doesn’t capture all important aspectsof the data
• It performs badly on both training and test data

Regularization
• Regularization means explicitly restricting a model to avoid overfitting
• It is used in some regression models (e.g. Ridge and Lasso regression) and in some neural
networks
Ridge Regression
• A regularized version of the standard Linear Regression (LR)
• Also called Tikhonov regularization
• Uses the same equation as LR to make predictions
• However, the regression coefficients w are chosen so that they not only fit well the training data (as in LR) but also satisfy an additional constraint:
• the magnitude of the coefficients is as small as possible, i.e. close to 0
• Small values of the coefficients means
• each feature will have little effect on the outcome
• small slope of the regression line
• Rationale: a more restricted model (less complex) is less likely to overfit
• Ridge regression uses the so called L2 regularization (L2 norm of the weight vector)
W3 - 图16
MSE regularization term
Goal: high accuracy on training data (low MSE)
• Parameter α controls the trade-off between the performance on training set and model complexity
low complexity model – w close to 0
• Parameter α controls the trade-off between the performance ontraining set and model complexity
• Increasing α makes the coefficientssmaller (close to 0); this typically decreases the performance on the raining set but may improve the performance on the test set
• Decreasing α means less restricted coefficients. For very small α , Ridge Regression will behave similarly to LR
Lasso regression
W3 - 图17
• Consequence of using L1 – some w will become exactly 0
• => some features will be completely ignored by the model – a form of automatic feature selection
• Less features – simpler model, easier to interpret