https://tianchi.aliyun.com/forum/postDetail?spm=5176.12586969.1002.6.6b17329eVBTIwN&postId=333161

    时序问题 - 图1

    • 时间特征
      年、月、日、小时、星期、季度、是否工作日等。

      1. def get_time_feature(df, col, keep=False):
      2. """
      3. 为df增加时间特征列,包括:年,月,日,小时,dayofweek,weekofyear
      4. :param df:
      5. :param col: 时间列的列名
      6. :param keep: 是否保留原始时间列
      7. :return:
      8. """
      9. df_copy = df.copy()
      10. prefix = col + "_"
      11. df_copy[col] = pd.to_datetime(df_copy[col])
      12. df_copy[prefix + 'year'] = df_copy[col].dt.year
      13. df_copy[prefix + 'month'] = df_copy[col].dt.month
      14. df_copy[prefix + 'day'] = df_copy[col].dt.day
      15. df_copy[prefix + 'hour'] = df_copy[col].dt.hour
      16. df_copy[prefix + 'weekofyear'] = df_copy[col].dt.weekofyear
      17. df_copy[prefix + 'dayofweek'] = df_copy[col].dt.dayofweek
      18. df_copy[prefix + 'is_wknd'] = df_copy[col].dt.dayofweek // 4
      19. df_copy[prefix + 'quarter'] = df_copy[col].dt.quarter
      20. df_copy[prefix + 'is_month_start'] = df_copy[col].dt.is_month_start.astype(int)
      21. df_copy[prefix + 'is_month_end'] = df_copy[col].dt.is_month_end.astype(int)
      22. if keep:
      23. return df_copy
      24. else:
      25. return df_copy.drop([col], axis=1)
      26. df = get_time_feature(df, "time_col")
    • lag特征
      表示同一元素历史时间点的值(例如在这个赛题中,同一个unit昨天、前天、上周对应的使用量)

      1. keys = ['unit']
      2. val = 'qty'
      3. lag = 1
      4. df.groupby(keys)[val].transform(lambda x: x.shift(lag))
    • 滑动窗口统计特征:历史时间窗口内的统计值

      1. keys = ['unit']
      2. val = 'qty'
      3. window = 7
      4. df.groupby(keys)[val].transform(
      5. lambda x: x.rolling(window=window, min_periods=3, win_type="triang").mean())
      6. df.groupby(keys)[val].transform(
      7. lambda x: x.rolling(window=window, min_periods=3).std())
    • 指数加权移动平均

      1. keys = ['unit']
      2. val = 'qty'
      3. lag = 1
      4. alpha=0.95
      5. df_temp.groupby(keys)[val].transform(lambda x: x.shift(lag).ewm(alpha=alpha).mean())
    • 标签的转化
      当label的分布范围差异较大时,可以尝试做变换后再进行训练,训练完模型后,对测试集预测后再逆操作回来。
      常用的方法有log1p,适用于label值均为正值的场景。转化后的训练结果会更加稳定。

      1. # 训练时转化
      2. assert(train[target].min() >= 0)
      3. train[target] = np.log1p(train[target])
      4. # 预测结果需要逆操作
      5. pred = np.expm1(pred)
    • 验证集的切分
      时序预测问题验证集的切分常常采用按时间切分的方式,可以用sklearn中的TimeSeriesSplit。在验证集上获得模型的最佳迭代轮数之后,再用全量数据重新训练。重新训练的时候,迭代轮次可以是之前迭代轮次的k倍。k的参考值=全量数据样本量/除去验证集的数据样本量。
      参考如下代码:

      1. from sklearn.model_selection import TimeSeriesSplit
      2. import gc
      3. ts_folds = TimeSeriesSplit(n_splits = 5)
      4. for fold_n, (train_index, valid_index) in enumerate(ts_folds.split(train[used_features])):
      5. if fold_n in [0, 1, 2, 3]:
      6. continue
      7. print('Training with validation')
      8. trn_data = lgb.Dataset(train[used_features].iloc[train_index], label=train[target].iloc[train_index],
      9. categorical_feature="")
      10. val_data = lgb.Dataset(train[used_features].iloc[valid_index], label=train[target].iloc[valid_index],
      11. categorical_feature="")
      12. clf = lgb.train(params, trn_data, num_boost_round=N_round, valid_sets=[trn_data, val_data], verbose_eval=Verbose,
      13. early_stopping_rounds=Early_Stopping_Rounds)
      14. val = clf.predict(train[used_features].iloc[valid_index])
      15. if target_log:
      16. mae_ = mean_absolute_error(np.expm1(train.iloc[valid_index][target]), np.expm1(val))
      17. else:
      18. mae_ = mean_absolute_error(train.iloc[valid_index][target], val)
      19. print('MAE: {}'.format(mae_))
      20. MAEs.append(mae_)
      21. print("ReTraining on all data")
      22. gc.enable()
      23. del trn_data, val_data
      24. gc.collect()
      25. Best_iteration = clf.best_iteration
      26. print("Best_iteration: ", Best_iteration)
      27. trn_data = lgb.Dataset(train[used_features], label=train[target], categorical_feature="")
      28. clf = lgb.train(params, trn_data, num_boost_round=int(Best_iteration * 1.2),
      29. valid_sets=[trn_data], verbose_eval=Verbose)
      30. pred = clf.predict(test[used_features])