Part 02 Feature Engineering - Baseline Model - 《Kaggle Learning Note - Daniel》

8种常见机器学习算法比较
P01 Construct features from timestamps
P02 Label Encoding
P03 One-Hot Encoding
P04 Train / Test splits with time series data

baseline一般是自己算法优化和调参过程中自己和自己比较，目标是越来越好，当性能超过benchmark时，可以发表了，当性能甚至超过SOTA时，恭喜你，考虑投顶会顶刊啦。

8种常见机器学习算法比较

算法选择参考：
之前翻译过一些国外的文章，有一篇文章中给出了一个简单的算法选择技巧：
1. 首当其冲应该选择逻辑回归，如果它的效果不怎么样，那么可以将它的结果作为基准来参考，在基础上与其他算法进行比较；
2. 然后试试决策树（随机森林）看看是否可以大幅度提升你的模型性能。即便最后你并没有把它当做为最终模型，你也可以使用随机森林来移除噪声变量，做特征选择；
3. 如果特征的数量和观测样本特别多，那么当资源和时间充足时（这个前提很重要），使用SVM不失为一种选择。

通常情况下：
【GBDT>=SVM>=RF>=Adaboost>=Other…】，现在深度学习很热门，很多领域都用到，它是以神经网络为基础的，目前我自己也在学习，只是理论知识不是很厚实，理解的不够深，这里就不做介绍了。
算法固然重要，但好的数据却要优于好的算法，设计优良特征是大有裨益的。假如你有一个超大数据集，那么无论你使用哪种算法可能对分类性能都没太大影响（此时就可以根据速度和易用性来进行抉择）。

P01 Construct features from timestamps

Hint: With a timestamp column in a dataframe, you can get access to datetime attibutes and functions with the .dt attribute. For example tscolumn.dt.day will convert a timestamp column to days

 # Split up the times
    click_times = click_data['click_time']
    clicks['day'] = click_times.dt.day.astype('uint8')
    clicks['hour'] = click_times.dt.hour.astype('uint8')
    clicks['minute'] = click_times.dt.minute.astype('uint8')
    clicks['second'] = click_times.dt.second.astype('uint8')

P02 Label Encoding

Hint:
Try looping through each of the categorical features and using the using LabelEncoder’s .fit_transform method

from sklearn import preprocessing
cat_features = ['ip', 'app', 'device', 'os', 'channel']
# Create new columns in clicks using preprocessing.LabelEncoder()
label_encoder = preprocessing.LabelEncoder()
for feature in cat_features:
    encoded = label_encoder.fit_transform(clicks[feature])
    clicks[feature + '_labels'] = encoded

P03 One-Hot Encoding

Would it have also made sense to instead use one-hot encoding for the categorical variables 'ip', 'app', 'device' , 'os' , or 'channel' ?

Solution:
The ip column has 58,000 values, which means it will create an extremely sparse matrix with 58,000 columns. This many columns will make your model run very slow, so in general you want to avoid one-hot encoding features with many levels. LightGBM models work with label encoded features, so you don’t actually need to one-hot encode the categorical features.

P04 Train / Test splits with time series data

Are there any special considerations when creating train/test splits for time series?

Solution:
Since our model is meant to predict events in the future, we must also validate the model on events in the future. If the data is mixed up between the training and test sets, then future data will leak in to the model and our validation results will overestimate the performance on new data.

Create train / Validation / Test splits

Train with LightGBM

Evaluate the model