

1. 首当其冲应该选择逻辑回归,如果它的效果不怎么样,那么可以将它的结果作为基准来参考,在基础上与其他算法进行比较;
2. 然后试试决策树(随机森林)看看是否可以大幅度提升你的模型性能。即便最后你并没有把它当做为最终模型,你也可以使用随机森林来移除噪声变量,做特征选择;
3. 如果特征的数量和观测样本特别多,那么当资源和时间充足时(这个前提很重要),使用SVM不失为一种选择。


P01 Construct features from timestamps

Hint: With a timestamp column in a dataframe, you can get access to datetime attibutes and functions with the .dt attribute. For example will convert a timestamp column to days

  1. # Split up the times
  2. click_times = click_data['click_time']
  3. clicks['day'] ='uint8')
  4. clicks['hour'] = click_times.dt.hour.astype('uint8')
  5. clicks['minute'] = click_times.dt.minute.astype('uint8')
  6. clicks['second'] = click_times.dt.second.astype('uint8')

P02 Label Encoding

Try looping through each of the categorical features and using the using LabelEncoder’s .fit_transform method

  1. from sklearn import preprocessing
  2. cat_features = ['ip', 'app', 'device', 'os', 'channel']
  3. # Create new columns in clicks using preprocessing.LabelEncoder()
  4. label_encoder = preprocessing.LabelEncoder()
  5. for feature in cat_features:
  6. encoded = label_encoder.fit_transform(clicks[feature])
  7. clicks[feature + '_labels'] = encoded

P03 One-Hot Encoding

Would it have also made sense to instead use one-hot encoding for the categorical variables 'ip', 'app', 'device' , 'os' , or 'channel' ?

The ip column has 58,000 values, which means it will create an extremely sparse matrix with 58,000 columns. This many columns will make your model run very slow, so in general you want to avoid one-hot encoding features with many levels. LightGBM models work with label encoded features, so you don’t actually need to one-hot encode the categorical features.

P04 Train / Test splits with time series data

Are there any special considerations when creating train/test splits for time series?

Since our model is meant to predict events in the future, we must also validate the model on events in the future. If the data is mixed up between the training and test sets, then future data will leak in to the model and our validation results will overestimate the performance on new data.

Create train / Validation / Test splits

Train with LightGBM

Evaluate the model