W1 - 《Machine Learning》

Two kinds of task:
- Supervised task: classification and regression
  - Classification: predicted variable is categorical
  - Regression: predicted variable is numerical
- Unsupervised task: clustering
Others:
- Associate rule mining
- reinforcement learning
- outlier detection
  - Outliers are noise and should be removed before data analysis
  - Outliers are the goal of our DManalysis – to detect unusual behavior, e.g. credit card fraud detection or intrusion detection
DM process:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modelling
5. Evaluation
6. Deployment
Nominal and numeric attributes
Data cleaning
- Noise
  1. distortion of values
  2. addition of spurious examples
  3. inconsistent and duplicate data
  - Solutions for type i and ii
    - Using signal and image processing and outlier detection techniques before DM
    - Using ML algorithms that are more robust to noise – give acceptable results in presence of noise
  - Missing values
Data preprocessing
- Data aggregation
- Feature extraction
- Feature subset selection
- Converting features from one type to another
- Normalization of feature values
Similarity measures
- Euclidean, Manhattan, Minkowski
- Hamming, SMC, Jaccard coefficient
- Cosine similarity
- Correlation

the one with minimum total entropy
Normaliztion and Standardization(归一化和标准化)
Used to avoid the dominance of attributes with large values over attributes with small values.

Measuring similarity

Distance
- Hamming distance: One-hot matrix using Manhattan distance
- Similarity coefficients
- Simple Matching Coefficient (SMC) - matching 1-1 and 0-0 / num.
- Jaccard Coefficient = SMC exclude FNs (TP、 FN)
Correlation

Measuring similarity