• Linear regression
    A prediction method for regression tasks whose predicted variable is numeric.
    • Logistic regression
    An extension of linear regression for classification tasks whose predicted variable is nominal.
    • Overfitting and regularization
    • Ridge and Lasso regression


    Simple(bivariate) regression
    x: independent variable
    y: dependent variable
    Goal: 1) Prediction 2) Descriptive analysis - assess the relationship between x and yimage.png
    W3 - 图2
    b0 is intercept(截断)
    b1 is slope(斜率)
    The difference between actual value and predicted value is prediction error or residual .
    SSE is the sum of squared prediction errors W3 - 图3
    the least squares method (最小二乘法)

    W3 - 图4
    This solution is obtained by minimizing SSE using differential calculus
    R2 measures the goodness of fit of the regression line found by theleast squares method:
    W3 - 图5 SST: SSR: W3 - 图6
    W3 - 图7
    • Hence, SST measures the prediction error when the predicted value is the mean value
    • SST is a function of the variance of y (variance = standard deviation^2) => SST is a measure of the variability of y, without considering x and can be used as a baseline -predicting y without knowing x
    Values between 0 and 1; the higher the better
    • = 1: the regression line fits perfectly the training data
    • close to 0: poor fit
    image.png
    image.png

    • If there is an outstanding value far higher or lower than others, how does the linear regression perform?
      If a value is super good, the value may cause the regression line entirely moving up, and lead to a bad fitting effect. Otherwise, the regression line will entirely move down, and lead to a bad fitting effect.

    Multiple regression
    W3 - 图10%22%20aria-hidden%3D%22true%22%3E%0A%20%3Cuse%20xlink%3Ahref%3D%22%23E1-MJMATHI-52%22%20x%3D%220%22%20y%3D%220%22%3E%3C%2Fuse%3E%0A%20%3Cuse%20transform%3D%22scale(0.707)%22%20xlink%3Ahref%3D%22%23E1-MJMAIN-32%22%20x%3D%221074%22%20y%3D%22583%22%3E%3C%2Fuse%3E%0A%3C%2Fg%3E%0A%3C%2Fsvg%3E#card=math&code=R%5E2&id=IBCiy): multiple coefficient of determination
    image.pngimage.png
    Logistic Regression
    image.png
    • It gives a value between 0 and 1 that is interpreted as the probability for class membership:
    • p is the probability for class 1 and 1-p is the probability for class 0
    • It uses the maximum likelihood method (极大似然估计) to find the parameters b0 and b1 - the curve that best fits the data
    image.png
    Overfitting and Regularization
    Overfitting:
    • Small error on the training set but high error on test set (new examples)
    • The classifier has memorized the training examples but has not learned to generalize to new examples!
    • It occurs when
    • we fit a model too closely to the particularities of the training set – the resulting model is too specific, works well on the training data but doesn’t work well on new data
    • Various reasons, e.g.
    • Issues with the data
    • Noise in the training data
    • Too small training set – does not contain enough representative examples
    • How the algorithm operates
    • Some algorithms are more susceptible to overfitting than others
    • Different algorithms have different strategies to deal with overfitting, e.g.
    • Decision tree – prune the tree
    • Neural networks – early stopping of the training
    • …
    Underfitting:
    • The model is too simple and doesn’t capture all important aspectsof the data
    • It performs badly on both training and test data
    image.png
    Regularization
    • Regularization means explicitly restricting a model to avoid overfitting
    • It is used in some regression models (e.g. Ridge and Lasso regression) and in some neural
    networks
    Ridge Regression
    • A regularized version of the standard Linear Regression (LR)
    • Also called Tikhonov regularization
    • Uses the same equation as LR to make predictions
    • However, the regression coefficients w are chosen so that they not only fit well the training data (as in LR) but also satisfy an additional constraint:
    • the magnitude of the coefficients is as small as possible, i.e. close to 0
    • Small values of the coefficients means
    • each feature will have little effect on the outcome
    • small slope of the regression line
    • Rationale: a more restricted model (less complex) is less likely to overfit
    • Ridge regression uses the so called L2 regularization (L2 norm of the weight vector)
    W3 - 图16
    MSE regularization term
    Goal: high accuracy on training data (low MSE)
    • Parameter α controls the trade-off between the performance on training set and model complexity
    low complexity model – w close to 0
    • Parameter α controls the trade-off between the performance ontraining set and model complexity
    • Increasing α makes the coefficientssmaller (close to 0); this typically decreases the performance on the raining set but may improve the performance on the test set
    • Decreasing α means less restricted coefficients. For very small α , Ridge Regression will behave similarly to LR
    Lasso regression
    W3 - 图17
    • Consequence of using L1 – some w will become exactly 0
    • => some features will be completely ignored by the model – a form of automatic feature selection
    • Less features – simpler model, easier to interpret
    image.png
    image.png