image.png

The curse of dimensionality

  • Properties of high dimensional problems Line
  • 随着维数的增加,分类器的性能随之增加,直到达到最佳的特征数量。进一步增加维度而不增加训练样本的数量导致分类器性能的降低。
    • image.png
  • 高维度特征会带来几方面的问题:

  • 原因

    • 引入了稀疏性。当我们增加维数时,训练样本的密度呈指数下降,特征空间的维度随之增长,并且变得更稀疏【具体(低维)→抽象、独立(高维)】;

    • image.png
    • W3 Sparse Regression - 图9

参考:

  1. https://cloud.tencent.com/developer/article/1178591?from=article.detail.1178593

Blessings of dimensionality

  1. Several features will be correlated and we can average over them
  2. Underlying distribution will be finite, informative data will lie on a low-dimensional manifold(流形)
  3. Underlying structure in data (samples from continuous processes, images etc) will give an approximate finite dimensionality.

Dimension reduction

image.png
1645110201(1).png

Combinatoric search

image.png

Forward selection

image.png

Backward elimination

image.png

Regularizationimage.png

Models and algorithms

image.png

Shrink methods

image.png

Ridge

image.png

image.png
image.png

The Lasso

image.png
image.png

image.pngAlgorithms for Lassoimage.png

LARS

image.png
image.png
image.png
https://learn.inside.dtu.dk/d2l/le/content/103753/viewContent/409418/View

image.png

Cyclical coordinate descent

image.png
image.png
image.png

The elastic net 弹性网(coordinate descent)

Lasso 惩罚某种程度上不受强相关的变量集的选择的影响;另一方面,岭惩罚趋向于将相关变量的系数相互收缩。弹性网 (elastic net) 惩罚是一种妥协的方式,其形式为第二项鼓励高相关的特征进行平均,而第一项鼓励在平均特征系数中的稀疏解。
image.png
image.png
弹性网惩罚可以用到任何线性模型中,特别是用于回归或分类中.

image.png

the family-wise error rate(FWER)

FWER is the probability of at least one false rejection, and is a commonly used overall measure of error.
image.png

Multiple Hypothesis(Testing)

Find p-value (significance) in scikit-learn LinearRegression

Feature assessment

image.png
image.png

Bonferroni correction: Controls FWER

image.png

False Discovery Rate(FDR): Controls FDR

image.png
image.png
image.png
image.png