How to choose two models?

https://pubs.rsna.org/doi/pdf/10.1148/radiology.148.3.6878708
image.png
image.png

How to conduct feature selection?

highest mutual information:
the one that reduces the uncertainty (entropy) in the label.
multi-collinear features:
fraction of variance unexplained(FVU), build a linear regression model to predict feature K, given features A,B,C and D.
ML interview - 图3

  1. Pearson Correlation:

we check the absolute value of the Pearson’s correlation between the target and numerical features in our dataset

  1. chi-square

image.png

  1. RFE, recursive feature elimination
  2. Lasso, SelectFromModel
  3. one-hot, ordinalEncoding, LabelEncoder

    What’s log loss?

    Log-loss is indicative of how close the prediction probability is to the corresponding actual/true value (0 or 1 in case of binary classification). The more the predicted probability diverges from the actual value, the higher is the log-loss value.
    To sum up, farther the prediction probability is from the actual value, higher is its log-loss value.
    image.png
    image.png
    both logloss and MSE indicate how far the predictions are from the actual values.

    Relative entropy, cross entropy, KL divergence

    cross entropy = log loss
    average message length
    image.png
    image.png
    True Probability Distribution p and Predicted Probability Distribution q
    If our predictions are perfect, that is the predicted distribution is equal to the true distribution, then the cross-entropy is simply equal to entropy. But, if the distributions differ, then the cross-entropy will be greater than the entropy by some number of bits. This amount by which the cross-entropy exceeds the entropy is called the Relative Entropy or more commonly known as the Kullback-Leibler Divergence (KL Divergence). In short,
    image.png
    softmax作为做后一层, 用cross entropy算loss
    SparseCategoricalCrossentropy and CategoricalCrossentropy both compute categorical cross-entropy. The only difference is in how the targets/labels should be encoded.

forward v.s. backward

forward pass: loss computation
backward propagation: gradient computation

softmax v.s. sigmoid

Softmax Function Sigmoid Function
Used for multi-classification in logistic regression model. Used for binary classification in logistic regression model.
The probabilities sum will be 1 The probabilities sum need not be 1.
Used in the different layers of neural networks. Used as activation function while building neural networks.
The high value will have the higher probability than other values. The high value will have the high probability but not the higher probability.

If you use the softmax function in a machine learning model, you should be careful before interpreting it as a true probability, since it has a tendency to produce values very close to 0 or 1. If a neural network had output scores of [8, 5, 0], like in this example, then the softmax function would have assigned 95% probability to the first class, when in reality there could have been more uncertainty in the neural network’s predictions. This could give the impression that the neural network prediction had a high confidence when that was not the case.

image.png
“max” because amplifies probablity of largest xi,
“soft” because still assigns some probability to smaller xi.

DNN为什么要有biasterm,biasterm的intuition是什么

https://stackoverflow.com/questions/2480650/what-is-the-role-of-the-bias-in-neural-networks
a bias value allows you to shift the activation function to the left or right
That’s useful, but what if you wanted the network to output 0 when x is 2? Just changing the steepness of the sigmoid won’t really work — you want to be able to shift the entire curve to the right
wx, w只能scale改变x, 不能移动