数据集简介

  1. 纽约出租车协会提供了两百万超出租车行车记录
  2. 根据载客情况,准确回归模型预测毎一单出车用时
  3. 预测纽约出租车用时,有助于在行程开始前对车费的预测(有意思的是,滴滴是在行车结束后计算车费的)

分析步骤

  1. 数据读取

    1. 读取训练集和测试集
    2. 读取行程数据
    3. 读取节假日数据
    4. 读取天气数据
  2. 数据可视化 KMeans Clustering + Matplotlib

  3. 特征工程

    1. 时间特征
    2. 距离特征
    3. 天气特征
    4. 区域拥挤特征
    5. One hot encode/get dummies
  4. 运行模型 XGBoost: eXtreme Gradient Boosting

    1. One-Hot-Enconde类别型数据
    2. 区分 训练集,验证集和测试集
    3. 设置参数,评估得分
    4. 写出结果,上传提交

数据读取

  1. %matplotlib inline
  2. import numpy as np
  3. import pandas as pd
  4. from datetime import datetime, date
  5. from sklearn .model_selection import train_test_split
  6. import xgboost as xgb
  7. from sklearn.linear_model import LinearRegression, Ridge, BayesianRidge
  8. from sklearn.cluster import MiniBatchKMeans
  9. from sklearn.metrics import mean_squared_error
  10. from math import radians, cos, sin, asin, sqrt
  11. import seaborn as sns
  12. import matplotlib.pyplot as plt
  13. plt.rcParams['figure.figsize'] = [60, 30]
  14. from sklearn.cluster import KMeans
  15. from sklearn.model_selection import GridSearchCV
  16. from sklearn.model_selection import cross_val_score, KFold

xgboost没有被包含在Anaconda的环境中,另行下载安装

  1. train = pd.read_csv('./kaggle纽约出租车原始文件/train.csv', parse_dates = ['pickup_datetime'])
  2. test = pd.read_csv('./kaggle纽约出租车原始文件/test.csv', parse_dates = ['pickup_datetime'])
  1. train.head()
id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N 455
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N 663
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027 40.763939 -74.005333 40.710087 N 2124
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 -74.010040 40.719971 -74.012268 40.706718 N 429
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 -73.973053 40.793209 -73.972923 40.782520 N 435
  1. print('train shape:', train.shape)
  2. print('test shape:', test.shape)
  1. train shape: (1458644, 11)
  2. test shape: (625134, 9)

测试集的特征比训练集少两个

  1. print([i for i in train.columns if i not in test.columns])
  1. ['dropoff_datetime', 'trip_duration']

Holiday

  1. holiday = pd.read_csv('./kaggle纽约出租车原始文件/NYC_2016Holidays.csv', sep = ';')
  2. holiday.head()
Day Date Holiday
0 Friday January 01 New Years Day
1 Monday January 18 Martin Luther King Jr. Day
2 Friday February 12 Lincoln’s Birthday
3 Monday February 15 Presidents’ Day
4 Sunday May 08 Mother’s Day
  1. holiday['Date'] = holiday['Date'].apply(lambda x : x + ' 2016')
  2. holiday.head()
Day Date Holiday
0 Friday January 01 2016 New Years Day
1 Monday January 18 2016 Martin Luther King Jr. Day
2 Friday February 12 2016 Lincoln’s Birthday
3 Monday February 15 2016 Presidents’ Day
4 Sunday May 08 2016 Mother’s Day

.apply()、.map()和.applymap()的区别

  1. .apply()让函数作用于列或者行
  2. .applymap()让函数作用于DataFrame每一个元素
  3. .map()让函数作用于Series每一个元素
  1. holidays = [datetime.strptime(holiday.loc[i, 'Date'], '%B %d %Y').date() for i in range(len(holiday))]
  2. holidays
  1. [datetime.date(2016, 1, 1),
  2. datetime.date(2016, 1, 18),
  3. datetime.date(2016, 2, 12),
  4. datetime.date(2016, 2, 15),
  5. datetime.date(2016, 5, 8),
  6. datetime.date(2016, 5, 30),
  7. datetime.date(2016, 6, 19),
  8. datetime.date(2016, 7, 4),
  9. datetime.date(2016, 9, 5),
  10. datetime.date(2016, 10, 10),
  11. datetime.date(2016, 11, 11),
  12. datetime.date(2016, 11, 24),
  13. datetime.date(2016, 12, 26),
  14. datetime.date(2016, 7, 4),
  15. datetime.date(2016, 11, 8)]
  1. time.strptime(string[, format])根据指定的格式把一个时间字符串解析为时间元组,返回struct_time对象
    1. string — 时间字符串
    2. format — 格式化字符串
  2. .local[]索引
  3. struct_time.date()返回日期,去除时分秒信息

Routes from open source routing machine

训练集数据预处理

  1. fastrout1 = pd.read_csv('./kaggle纽约出租车原始文件/fastest_routes_train_part_1.csv',
  2. usecols = ['id', 'total_distance', 'total_travel_time', 'number_of_steps', 'step_direction'])
  3. fastrout2 = pd.read_csv('./kaggle纽约出租车原始文件/fastest_routes_train_part_2.csv',
  4. usecols = ['id', 'total_distance', 'total_travel_time', 'number_of_steps', 'step_direction'])
  5. fastrout = pd.concat((fastrout1, fastrout2))
  6. fastrout.head()
id total_distance total_travel_time number_of_steps step_direction
0 id2875421 2009.1 164.9 5 left|straight|right|straight|arrive
1 id2377394 2513.2 332.0 6 none|right|left|right|left|arrive
2 id3504673 1779.4 235.8 4 left|left|right|arrive
3 id2181028 1614.9 140.1 5 right|left|right|left|arrive
4 id0801584 1393.5 189.4 5 right|right|right|left|arrive

转弯多,花的时间也多,因为可能还要等红灯。

美国大部分地方允许等红灯时右转,纽约除外。

  1. left_turn = []
  2. right_turn = []
  3. left_turn += list(map(lambda x : x.count('left') - x.count('slight left'), fastrout.step_direction))
  4. right_turn += list(map(lambda x : x.count('right') - x.count('slight right'), fastrout.step_direction))
  5. osrm_data = fastrout[['id', 'total_distance', 'total_travel_time', 'number_of_steps']]
  6. osrm_data = osrm_data.assign(left_steps = left_turn)
  7. osrm_data = osrm_data.assign(right_steps = right_turn)
  8. osrm_data.head()
id total_distance total_travel_time number_of_steps left_steps right_steps
0 id2875421 2009.1 164.9 5 1 1
1 id2377394 2513.2 332.0 6 2 2
2 id3504673 1779.4 235.8 4 2 1
3 id2181028 1614.9 140.1 5 2 2
4 id0801584 1393.5 189.4 5 1 3

测试集数据预处理

  1. left_turn_test = []
  2. right_turn_test = []
  3. fastrout_test = pd.read_csv('./kaggle纽约出租车原始文件/fastest_routes_test.csv')
  4. left_turn_test += list(map(lambda x : x.count('left') - x.count('slight left'), fastrout_test.step_direction))
  5. right_turn_test += list(map(lambda x : x.count('right') - x.count('slight right'), fastrout_test.step_direction))
  6. osrm_test = fastrout_test[['id', 'total_distance', 'total_travel_time', 'number_of_steps']]
  7. osrm_test = osrm_test.assign(left_steps = left_turn_test)
  8. osrm_test = osrm_test.assign(right_steps = right_turn_test)
  9. osrm_test.head()
id total_distance total_travel_time number_of_steps left_steps right_steps
0 id0771704 1497.1 200.2 7 3 2
1 id3274209 1427.1 141.5 2 0 0
2 id2756455 2312.3 324.6 9 4 4
3 id3684027 931.8 84.2 4 1 2
4 id3101285 2501.7 294.7 8 3 3

Weather

  1. weather = pd.read_csv('./kaggle纽约出租车原始文件/KNYC_Metars.csv', parse_dates = ['Time'])
  2. weather.head()
Time Temp. Windchill Heat Index Humidity Pressure Dew Point Visibility Wind Dir Wind Speed Gust Speed Precip Events Conditions
0 2015-12-31 02:00:00 7.8 7.1 NaN 0.89 1017.0 6.1 8.0 NNE 5.6 0.0 0.8 None Overcast
1 2015-12-31 03:00:00 7.2 5.9 NaN 0.90 1016.5 5.6 12.9 Variable 7.4 0.0 0.3 None Overcast
2 2015-12-31 04:00:00 7.2 NaN NaN 0.90 1016.7 5.6 12.9 Calm 0.0 0.0 0.0 None Overcast
3 2015-12-31 05:00:00 7.2 5.9 NaN 0.86 1015.9 5.0 14.5 NW 7.4 0.0 0.0 None Overcast
4 2015-12-31 06:00:00 7.2 6.4 NaN 0.90 1016.2 5.6 11.3 West 5.6 0.0 0.0 None Overcast

可视化 Visualization

  1. longitude = list(train.pickup_longitude) + list(train.dropoff_longitude)
  2. latitude = list(train.pickup_latitude) + list(train.dropoff_latitude)
  3. print(len(train.pickup_longitude), len(train.dropoff_longitude), len(longitude))
  4. print(len(train.pickup_latitude), len(train.dropoff_latitude), len(latitude))
  5. loc_df = pd.DataFrame()
  6. loc_df['longitude'] = longitude
  7. loc_df['latitude'] = latitude
  8. loc_df.head()
  1. 1458644 1458644 2917288
  2. 1458644 1458644 2917288
longitude latitude
0 -73.982155 40.767937
1 -73.980415 40.738564
2 -73.979027 40.763939
3 -74.010040 40.719971
4 -73.973053 40.793209
  1. xlim = [-74.03, -73.77]
  2. ylim = [40.63, 40.85]
  3. print(loc_df.shape)
  4. loc_df = loc_df[(loc_df.longitude > xlim[0]) & (loc_df.longitude < xlim[1])]
  5. loc_df = loc_df[(loc_df.latitude > ylim[0]) & (loc_df.latitude < ylim[1])]
  6. print(loc_df.shape)
  1. (2917288, 2)
  2. (2895748, 2)
  1. kmeans = KMeans(n_clusters = 15, random_state = 2, n_init = 10)
  2. kmeans.fit(loc_df)
  3. loc_df['label'] = kmeans.labels_
  4. loc_df.head()
longitude latitude label
0 -73.982155 40.767937 12
1 -73.980415 40.738564 7
2 -73.979027 40.763939 12
3 -74.010040 40.719971 5
4 -73.973053 40.793209 9
  1. plt.figure(figsize = (10, 10))
  2. for label in loc_df.label.unique():
  3. plt.plot(loc_df.longitude[loc_df.label == label], loc_df.latitude[loc_df.label == label],
  4. '.', markersize = 0.3, alpha = 0.3)
  5. plt.title('Clusters of New York')
  1. Text(0.5, 1.0, 'Clusters of New York')

output_18_0.png

特征工程 Feature Engineering

时间特征 Data feature, including holidays

  1. print('train shape:', train.shape)
  2. print('test shape:', test.shape)
  1. train shape: (1458644, 11)
  2. test shape: (625134, 9)

细分时间信息,以便进一步分类处理

  1. for df in (train, test):
  2. df['year'] = df['pickup_datetime'].dt.year
  3. df['month'] = df['pickup_datetime'].dt.month
  4. df['day'] = df['pickup_datetime'].dt.day
  5. df['hour'] = df['pickup_datetime'].dt.hour
  6. df['minute'] = df['pickup_datetime'].dt.minute
  7. df['store_and_fwd_flag'] = 1 * (df.store_and_fwd_flag.values == 'Y')
  1. train = train.assign(log_trip_duration = np.log(train.trip_duration + 1))

分割休息日与工作日

  1. def restday(year, month, day, holidays):
  2. is_rest = [None] * len(year)
  3. is_weekend = [None] * len(year)
  4. i = 0
  5. for yy, mm, dd in zip(year, month, day):
  6. is_weekend[i] = date(yy, mm, dd).isoweekday() in (6, 7)
  7. is_rest[i] = is_weekend[i] or date(yy, mm, dd) in holidays
  8. i += 1
  9. return is_rest, is_weekend
  1. date.weekday():返回weekday,如果是星期一,返回0;如果是星期2,返回1,以此类推;
  2. data.isoweekday():返回工作日,如果是星期一,返回1;如果是星期2,返回2,以此类推
  1. rest_day, weekend = restday(train.year, train.month, train.day, holiday)
  2. train['rest_day'] = rest_day
  3. train['weekend'] = weekend
  4. rest_day, weekend = restday(test.year, test.month, test.day, holiday)
  5. test['rest_day'] = rest_day
  6. test['weekend'] = weekend
  7. # 简化时间信息
  8. train['pickup_time'] = train.hour + train.minute/60
  9. test['pickup_time'] = test.hour + test.minute/60

分割早晚高峰rush hour、白天day和夜晚night

  1. for df in (train, test):
  2. df['hr_categori'] = np.nan
  3. df.loc[(df.rest_day == False) & (df.hour <= 9) & (df.hour >= 7), 'hr_categori'] = 'rush'
  4. df.loc[(df.rest_day == False) & (df.hour <= 18) & (df.hour >= 16), 'hr_categori'] = 'rush'
  5. df.loc[(df.rest_day == False) & (df.hour < 16) & (df.hour > 9), 'hr_categori'] = 'day'
  6. df.loc[(df.rest_day == False) & (df.hour < 7) | (df.hour > 18), 'hr_categori'] = 'night'
  7. df.loc[(df.rest_day == True) & (df.hour < 18) & (df.hour > 7), 'hr_categori'] = 'day'
  8. df.loc[(df.rest_day == True) & (df.hour <= 7) | (df.hour >= 18), 'hr_categori'] = 'night'
  1. print('train shape:', train.shape)
  2. print('test shape:', test.shape)
  1. train shape: (1458644, 21)
  2. test shape: (625134, 18)
  1. print([i for i in train.columns if i not in test.columns])
  1. ['dropoff_datetime', 'trip_duration', 'log_trip_duration']

距离特征 Distance feature

  1. train_join = train.join(osrm_data.set_index('id'), on = 'id')
  2. test_join = test.join(osrm_test.set_index('id'), on = 'id')
  3. print('train_join shape:', train_join.shape)
  4. print('test_join shape:', test_join.shape)
  1. train_join shape: (1458644, 26)
  2. test_join shape: (625134, 23)

天气特征 Weather feature

  1. weather['snow'] = 1 * (weather.Events == 'Snow') + 1 * (weather.Events == 'Fog\n\t,\nSnow')
  2. weather['year'] = weather['Time'].dt.year
  3. weather['month'] = weather['Time'].dt.month
  4. weather['day'] = weather['Time'].dt.day
  5. weather['hour'] = weather['Time'].dt.hour
  6. weather = weather[weather.year == 2016][['month', 'day', 'hour', 'Temp.', 'Precip', 'snow', 'Visibility']]
  7. print(weather.shape)
  8. weather.head()
  1. (8739, 7)
month day hour Temp. Precip snow Visibility
22 1 1 0 5.6 0.0 0 16.1
23 1 1 1 5.6 0.0 0 16.1
24 1 1 2 5.6 0.0 0 16.1
25 1 1 3 5.0 0.0 0 16.1
26 1 1 4 5.0 0.0 0 16.1
  1. train = pd.merge(train_join, weather, on = ['month', 'day', 'hour'], how = 'left')
  2. test = pd.merge(test_join, weather, on = ['month', 'day', 'hour'], how = 'left')
  3. print('train shape:', train.shape)
  4. print('test shape:', test.shape)
  1. train shape: (1458644, 30)
  2. test shape: (625134, 27)

Cluster and find speed

  1. coords = np.vstack((train[['pickup_longitude', 'pickup_latitude']].values,
  2. train[['dropoff_longitude', 'dropoff_latitude']].values,
  3. test[['pickup_longitude', 'pickup_latitude']].values,
  4. test[['dropoff_longitude', 'dropoff_latitude']].values))
  5. coords
  1. array([[-73.98215485, 40.76793671],
  2. [-73.98041534, 40.73856354],
  3. [-73.97902679, 40.7639389 ],
  4. ...,
  5. [-73.87660217, 40.74866486],
  6. [-73.85426331, 40.89178848],
  7. [-73.96932983, 40.76937866]])
  1. kmeans = MiniBatchKMeans(n_clusters = 8, batch_size = 10000).fit(coords)
  1. train.loc[:, 'pickup_cluster'] = kmeans.predict(train[['pickup_longitude', 'pickup_latitude']])
  2. train.loc[:, 'dropoff_cluster'] = kmeans.predict(train[['dropoff_longitude', 'dropoff_latitude']])
  3. test.loc[:, 'pickup_cluster'] = kmeans.predict(test[['pickup_longitude', 'pickup_latitude']])
  4. test.loc[:, 'dropoff_cluster'] = kmeans.predict(test[['dropoff_longitude', 'dropoff_latitude']])
  5. train[['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'pickup_cluster', 'dropoff_cluster']].head()
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude pickup_cluster dropoff_cluster
0 -73.982155 40.767937 -73.964630 40.765602 2 5
1 -73.980415 40.738564 -73.999481 40.731152 4 4
2 -73.979027 40.763939 -74.005333 40.710087 2 6
3 -74.010040 40.719971 -74.012268 40.706718 6 6
4 -73.973053 40.793209 -73.972923 40.782520 5 5
  1. plt.figure(figsize = (10, 10))
  2. for label in np.unique(kmeans.labels_):
  3. plt.plot(loc_df.longitude[loc_df.label == label], loc_df.latitude[loc_df.label == label],
  4. '.', markersize = 0.3, alpha = 0.3)

output_52_0.png

  1. print('train shape:', train.shape)
  2. print('test shape:', test.shape)
  1. train shape: (1458644, 32)
  2. test shape: (625134, 29)

Count features

  1. a = pd.concat([train, test]).groupby(['pickup_cluster']).size().reset_index()
  2. b = pd.concat([train, test]).groupby(['dropoff_cluster']).size().reset_index()
  1. C:\anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
  2. of pandas will change to not sort by default.
  3. To accept the future behavior, pass 'sort=False'.
  4. To retain the current behavior and silence the warning, pass 'sort=True'.
  5. """Entry point for launching an IPython kernel.
  6. C:\anaconda\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
  7. of pandas will change to not sort by default.
  8. To accept the future behavior, pass 'sort=False'.
  9. To retain the current behavior and silence the warning, pass 'sort=True'.
  1. train = pd.merge(train, a, on = 'pickup_cluster', how = 'left')
  2. train = pd.merge(train, b, on = 'dropoff_cluster', how = 'left')
  3. test = pd.merge(test, a, on = 'pickup_cluster', how = 'left')
  4. test = pd.merge(test, b, on = 'dropoff_cluster', how = 'left')
  1. train['speed'] = train['total_distance'] / train['trip_duration']
  2. train[['speed', 'total_distance', 'trip_duration']].head()
speed total_distance trip_duration
0 4.415604 2009.1 455
1 3.790649 2513.2 663
2 5.207533 11060.8 2124
3 4.147786 1779.4 429
4 3.712414 1614.9 435
  1. pickup_speed = train[['speed', 'pickup_cluster']].groupby('pickup_cluster').mean().reset_index()
  2. pickup_speed = pickup_speed.rename(columns = {'speed' : 'ave_pickup_speed'})
  3. dropoff_speed = train[['speed', 'dropoff_cluster']].groupby('dropoff_cluster').mean().reset_index()
  4. dropoff_speed = dropoff_speed.rename(columns = {'speed' : 'ave_dropoff_speed'})
  1. print(pickup_speed)
  2. print(dropoff_speed)
  1. pickup_cluster ave_pickup_speed
  2. 0 0 5.093225
  3. 1 1 11.282694
  4. 2 2 5.072040
  5. 3 3 6.075631
  6. 4 4 4.826975
  7. 5 5 5.393436
  8. 6 6 5.450792
  9. 7 7 8.413295
  10. dropoff_cluster ave_dropoff_speed
  11. 0 0 4.738659
  12. 1 1 12.016480
  13. 2 2 4.788073
  14. 3 3 7.571144
  15. 4 4 4.717927
  16. 5 5 5.437323
  17. 6 6 5.675050
  18. 7 7 8.394009
  1. train = pd.merge(train, pickup_speed, on = 'pickup_cluster', how = 'left')
  2. train = pd.merge(train, dropoff_speed, on = 'dropoff_cluster', how = 'left')
  3. test = pd.merge(test, pickup_speed, on = 'pickup_cluster', how = 'left')
  4. test = pd.merge(test, dropoff_speed, on = 'dropoff_cluster', how = 'left')
  1. train = train.drop('speed', axis = 1)
  1. print('train shape:', train.shape)
  2. print('test shape:', test.shape)
  1. train shape: (1458644, 36)
  2. test shape: (625134, 33)
  1. print([i for i in train.columns if i not in test.columns])
  1. ['dropoff_datetime', 'trip_duration', 'log_trip_duration']

Dummy variables: One-hot encode 处理所有类型特征

类型特征用数字标记,但数字间的大小关系会对预测产生我们不希望的影响

  1. vendor_train = pd.get_dummies(train['vendor_id'], prefix = 'vi', prefix_sep = '_')
  2. pickup_cluster_train = pd.get_dummies(train['pickup_cluster'], prefix = 'p', prefix_sep = '_')
  3. dropoff_cluster_train = pd.get_dummies(train['dropoff_cluster'], prefix = 'd', prefix_sep = '_')
  4. store_fwd_flag_train = pd.get_dummies(train['store_and_fwd_flag'], prefix = 's', prefix_sep = '_')
  5. vendor_test = pd.get_dummies(test['vendor_id'], prefix = 'vi', prefix_sep = '_')
  6. pickup_cluster_test = pd.get_dummies(test['pickup_cluster'], prefix = 'p', prefix_sep = '_')
  7. dropoff_cluster_test = pd.get_dummies(test['dropoff_cluster'], prefix = 'd', prefix_sep = '_')
  8. store_fwd_flag_test = pd.get_dummies(test['store_and_fwd_flag'], prefix = 's', prefix_sep = '_')
  9. train['dayofweek'] = train['pickup_datetime'].dt.dayofweek
  10. test['dayofweek'] = test['pickup_datetime'].dt.dayofweek
  11. month_train = pd.get_dummies(train['month'], prefix = 'm', prefix_sep = '_')
  12. dam_train = pd.get_dummies(train['day'], prefix = 'dam', prefix_sep = '_')
  13. daw_train = pd.get_dummies(train['dayofweek'], prefix = 'daw', prefix_sep = '_')
  14. hr_train = pd.get_dummies(train['hour'], prefix = 'h', prefix_sep = '_')
  15. hr_cate_train = pd.get_dummies(train['hr_categori'], prefix = 'hc', prefix_sep = '_')
  16. month_test = pd.get_dummies(test['month'], prefix = 'm', prefix_sep = '_')
  17. dam_test = pd.get_dummies(test['day'], prefix = 'dam', prefix_sep = '_')
  18. daw_test = pd.get_dummies(test['dayofweek'], prefix = 'daw', prefix_sep = '_')
  19. hr_test = pd.get_dummies(test['hour'], prefix = 'h', prefix_sep = '_')
  20. hr_cate_test = pd.get_dummies(test['hr_categori'], prefix = 'hc', prefix_sep = '_')
  1. month_train.head()
m_1 m_2 m_3 m_4 m_5 m_6
0 0 0 1 0 0 0
1 0 0 0 0 0 1
2 1 0 0 0 0 0
3 0 0 0 1 0 0
4 0 0 1 0 0 0
  1. train = train.drop(['id', 'vendor_id', 'pickup_cluster', 'dropoff_cluster',
  2. 'store_and_fwd_flag', 'month', 'day', 'dayofweek', 'hour', 'hr_categori', 'dropoff_datetime', 'trip_duration'], axis = 1)
  3. Test_id = test['id']
  4. test = test.drop(['id', 'vendor_id', 'pickup_cluster', 'dropoff_cluster',
  5. 'store_and_fwd_flag', 'month', 'day', 'dayofweek', 'hour', 'hr_categori'], axis = 1)

Modeling

  1. Train_Master和Test_Master作为预测目标,上传Kaggle
  2. Train_Master源自train数据集
  3. Test_Master源自test数据集
  1. Train_Master = pd.concat([train, vendor_train, pickup_cluster_train, dropoff_cluster_train, store_fwd_flag_train,
  2. month_train, dam_train, daw_train, hr_train, hr_cate_train], axis = 1)
  3. Test_Master = pd.concat([test, vendor_test, pickup_cluster_test, dropoff_cluster_test, store_fwd_flag_test,
  4. month_test, dam_test, daw_test, hr_test, hr_cate_test], axis = 1)

将Train_Master分为Train和Test,不涉及Test_Master

  1. Train_Master = Train_Master.drop('pickup_datetime', axis = 1)
  2. Test_Master = Test_Master.drop('pickup_datetime', axis = 1)
  3. Train, Test = train_test_split(Train_Master, test_size = 0.01)
  4. X_Train = Train.drop('log_trip_duration', axis = 1)
  5. Y_Train = Train['log_trip_duration']
  6. X_Test = Test.drop('log_trip_duration', axis = 1)
  7. Y_Test = Test['log_trip_duration']
  8. Y_Train = Y_Train.reset_index().drop('index', axis = 1)
  9. Y_Test = Y_Test.reset_index().drop('index', axis = 1)

使用xgb特有的数据结构DMatrix,速度更快

  1. dtrain = xgb.DMatrix(X_Train, label = Y_Train)
  2. dvalid = xgb.DMatrix(X_Test, label = Y_Test)
  3. dtest = xgb.DMatrix(Test_Master)
  4. watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
  5. # 理想的结果是模型在watchlist中的两个set上表现一样地好
  1. xgb_pars = {'objective' : 'reg:linear',
  2. 'learning_rate' : 0.05,
  3. 'max_depth' : 7,
  4. 'subsample' : 0.8,
  5. 'colsample_bytree' : 0.7,
  6. 'colsample_bylevel' : 0.7,
  7. 'silent' : 1,
  8. 'reg_alpha' : 1}
  9. model = xgb.train(xgb_pars, dtrain, 100, watchlist, early_stopping_rounds = 5,
  10. maximize = False, verbose_eval = 1)
  11. print('Modeling RMSLE % .5f' %model.best_score)
  1. [0] train-rmse:5.72071 valid-rmse:5.71359
  2. Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.
  3. Will train until valid-rmse hasn't improved in 5 rounds.
  4. [1] train-rmse:5.43652 valid-rmse:5.42959
  5. [2] train-rmse:5.16664 valid-rmse:5.15993
  6. [3] train-rmse:4.91048 valid-rmse:4.90398
  7. [4] train-rmse:4.66798 valid-rmse:4.66166
  8. [5] train-rmse:4.43687 valid-rmse:4.4307
  9. [6] train-rmse:4.21742 valid-rmse:4.21139
  10. [7] train-rmse:4.00899 valid-rmse:4.0031
  11. [8] train-rmse:3.81135 valid-rmse:3.80559
  12. [9] train-rmse:3.62356 valid-rmse:3.61793
  13. [10] train-rmse:3.44519 valid-rmse:3.43965
  14. [11] train-rmse:3.27661 valid-rmse:3.27129
  15. [12] train-rmse:3.11585 valid-rmse:3.11064
  16. [13] train-rmse:2.96326 valid-rmse:2.95812
  17. [14] train-rmse:2.81842 valid-rmse:2.8134
  18. [15] train-rmse:2.68097 valid-rmse:2.67604
  19. [16] train-rmse:2.55061 valid-rmse:2.54573
  20. [17] train-rmse:2.42696 valid-rmse:2.42213
  21. [18] train-rmse:2.30962 valid-rmse:2.30485
  22. [19] train-rmse:2.1983 valid-rmse:2.19364
  23. [20] train-rmse:2.09279 valid-rmse:2.08819
  24. [21] train-rmse:1.99277 valid-rmse:1.98825
  25. [22] train-rmse:1.89802 valid-rmse:1.89356
  26. [23] train-rmse:1.80818 valid-rmse:1.80382
  27. [24] train-rmse:1.72373 valid-rmse:1.71942
  28. [25] train-rmse:1.64307 valid-rmse:1.63881
  29. [26] train-rmse:1.56664 valid-rmse:1.56242
  30. [27] train-rmse:1.49434 valid-rmse:1.49023
  31. [28] train-rmse:1.42606 valid-rmse:1.42196
  32. [29] train-rmse:1.36146 valid-rmse:1.35739
  33. [30] train-rmse:1.30082 valid-rmse:1.29687
  34. [31] train-rmse:1.24305 valid-rmse:1.23917
  35. [32] train-rmse:1.18844 valid-rmse:1.18463
  36. [33] train-rmse:1.13687 valid-rmse:1.13312
  37. [34] train-rmse:1.08825 valid-rmse:1.08456
  38. [35] train-rmse:1.04279 valid-rmse:1.03918
  39. [36] train-rmse:0.999537 valid-rmse:0.995962
  40. [37] train-rmse:0.958848 valid-rmse:0.955314
  41. [38] train-rmse:0.920506 valid-rmse:0.917043
  42. [39] train-rmse:0.884371 valid-rmse:0.880961
  43. [40] train-rmse:0.850502 valid-rmse:0.847148
  44. [41] train-rmse:0.818737 valid-rmse:0.815504
  45. [42] train-rmse:0.788936 valid-rmse:0.785738
  46. [43] train-rmse:0.760988 valid-rmse:0.757844
  47. [44] train-rmse:0.73487 valid-rmse:0.731798
  48. [45] train-rmse:0.710508 valid-rmse:0.7075
  49. [46] train-rmse:0.687611 valid-rmse:0.684679
  50. [47] train-rmse:0.666258 valid-rmse:0.663418
  51. [48] train-rmse:0.646386 valid-rmse:0.643666
  52. [49] train-rmse:0.627901 valid-rmse:0.625226
  53. [50] train-rmse:0.610705 valid-rmse:0.608164
  54. [51] train-rmse:0.594841 valid-rmse:0.592374
  55. [52] train-rmse:0.580073 valid-rmse:0.577685
  56. [53] train-rmse:0.566276 valid-rmse:0.563964
  57. [54] train-rmse:0.553586 valid-rmse:0.551364
  58. [55] train-rmse:0.541802 valid-rmse:0.539665
  59. [56] train-rmse:0.530938 valid-rmse:0.528837
  60. [57] train-rmse:0.520891 valid-rmse:0.518795
  61. [58] train-rmse:0.511724 valid-rmse:0.509689
  62. [59] train-rmse:0.503191 valid-rmse:0.501274
  63. [60] train-rmse:0.495394 valid-rmse:0.493556
  64. [61] train-rmse:0.488261 valid-rmse:0.486498
  65. [62] train-rmse:0.481713 valid-rmse:0.480063
  66. [63] train-rmse:0.475621 valid-rmse:0.474073
  67. [64] train-rmse:0.470123 valid-rmse:0.468653
  68. [65] train-rmse:0.465024 valid-rmse:0.463596
  69. [66] train-rmse:0.460254 valid-rmse:0.458901
  70. [67] train-rmse:0.455962 valid-rmse:0.4547
  71. [68] train-rmse:0.452054 valid-rmse:0.450878
  72. [69] train-rmse:0.448391 valid-rmse:0.44725
  73. [70] train-rmse:0.445074 valid-rmse:0.443996
  74. [71] train-rmse:0.442037 valid-rmse:0.441076
  75. [72] train-rmse:0.439282 valid-rmse:0.438398
  76. [73] train-rmse:0.436759 valid-rmse:0.435955
  77. [74] train-rmse:0.434434 valid-rmse:0.433737
  78. [75] train-rmse:0.43231 valid-rmse:0.43164
  79. [76] train-rmse:0.430316 valid-rmse:0.429741
  80. [77] train-rmse:0.428579 valid-rmse:0.428089
  81. [78] train-rmse:0.426804 valid-rmse:0.42643
  82. [79] train-rmse:0.425351 valid-rmse:0.425114
  83. [80] train-rmse:0.424028 valid-rmse:0.423859
  84. [81] train-rmse:0.422741 valid-rmse:0.422651
  85. [82] train-rmse:0.421421 valid-rmse:0.421356
  86. [83] train-rmse:0.420333 valid-rmse:0.420332
  87. [84] train-rmse:0.419373 valid-rmse:0.419435
  88. [85] train-rmse:0.418452 valid-rmse:0.418557
  89. [86] train-rmse:0.417631 valid-rmse:0.417816
  90. [87] train-rmse:0.416853 valid-rmse:0.417082
  91. [88] train-rmse:0.416167 valid-rmse:0.416446
  92. [89] train-rmse:0.415507 valid-rmse:0.415831
  93. [90] train-rmse:0.414853 valid-rmse:0.415315
  94. [91] train-rmse:0.414216 valid-rmse:0.414727
  95. [92] train-rmse:0.413691 valid-rmse:0.41419
  96. [93] train-rmse:0.413165 valid-rmse:0.41368
  97. [94] train-rmse:0.412663 valid-rmse:0.413233
  98. [95] train-rmse:0.412157 valid-rmse:0.412737
  99. [96] train-rmse:0.411703 valid-rmse:0.412313
  100. [97] train-rmse:0.411311 valid-rmse:0.411921
  101. [98] train-rmse:0.410933 valid-rmse:0.411623
  102. [99] train-rmse:0.410626 valid-rmse:0.411271
  103. Modeling RMSLE 0.41127
  1. ax = xgb.plot_importance(model, max_num_features = 70, height = 0.9)
  2. ax.figure.set_size_inches(20, 30)

output_71_1.png

如何在Jupyter放大XGB的绘图尺寸?

  1. fig, ax = plt.subplots(figsize=(10,8))
  2. plot_importance(model, ax=ax)
    或者,你可以让绘图函数创建图形,然后更改其大小。
  3. ax = plot_importance(model)
  4. ax.figure.set_size_inches(10,8)
  1. pred = model.predict(dtest)
  2. pred = np.exp(pred) - 1
  1. submission = pd.concat([Test_id, pd.DataFrame(pred)], axis = 1)
  2. submission.columns = ['id', 'trip_duration']
  3. submission['trip_duration'] = submission.apply(lambda x : 1 if (x['trip_duration'] <= 0) else x['trip_duration'], axis = 1)
  4. submission.to_csv('submission.csv', index = False)