数据集简介
- 纽约出租车协会提供了两百万超出租车行车记录
- 根据载客情况,准确回归模型预测毎一单出车用时
- 预测纽约出租车用时,有助于在行程开始前对车费的预测(有意思的是,滴滴是在行车结束后计算车费的)
分析步骤
数据读取
- 读取训练集和测试集
- 读取行程数据
- 读取节假日数据
- 读取天气数据
数据可视化 KMeans Clustering + Matplotlib
特征工程
- 时间特征
- 距离特征
- 天气特征
- 区域拥挤特征
- One hot encode/get dummies
运行模型 XGBoost: eXtreme Gradient Boosting
- One-Hot-Enconde类别型数据
- 区分 训练集,验证集和测试集
- 设置参数,评估得分
- 写出结果,上传提交
数据读取
%matplotlib inline
import numpy as np
import pandas as pd
from datetime import datetime, date
from sklearn .model_selection import train_test_split
import xgboost as xgb
from sklearn.linear_model import LinearRegression, Ridge, BayesianRidge
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import mean_squared_error
from math import radians, cos, sin, asin, sqrt
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [60, 30]
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, KFold
xgboost没有被包含在Anaconda的环境中,另行下载安装
train = pd.read_csv('./kaggle纽约出租车原始文件/train.csv', parse_dates = ['pickup_datetime'])
test = pd.read_csv('./kaggle纽约出租车原始文件/test.csv', parse_dates = ['pickup_datetime'])
train.head()
id | vendor_id | pickup_datetime | dropoff_datetime | passenger_count | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | store_and_fwd_flag | trip_duration | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | id2875421 | 2 | 2016-03-14 17:24:55 | 2016-03-14 17:32:30 | 1 | -73.982155 | 40.767937 | -73.964630 | 40.765602 | N | 455 |
1 | id2377394 | 1 | 2016-06-12 00:43:35 | 2016-06-12 00:54:38 | 1 | -73.980415 | 40.738564 | -73.999481 | 40.731152 | N | 663 |
2 | id3858529 | 2 | 2016-01-19 11:35:24 | 2016-01-19 12:10:48 | 1 | -73.979027 | 40.763939 | -74.005333 | 40.710087 | N | 2124 |
3 | id3504673 | 2 | 2016-04-06 19:32:31 | 2016-04-06 19:39:40 | 1 | -74.010040 | 40.719971 | -74.012268 | 40.706718 | N | 429 |
4 | id2181028 | 2 | 2016-03-26 13:30:55 | 2016-03-26 13:38:10 | 1 | -73.973053 | 40.793209 | -73.972923 | 40.782520 | N | 435 |
print('train shape:', train.shape)
print('test shape:', test.shape)
train shape: (1458644, 11)
test shape: (625134, 9)
测试集的特征比训练集少两个
print([i for i in train.columns if i not in test.columns])
['dropoff_datetime', 'trip_duration']
Holiday
holiday = pd.read_csv('./kaggle纽约出租车原始文件/NYC_2016Holidays.csv', sep = ';')
holiday.head()
Day | Date | Holiday | |
---|---|---|---|
0 | Friday | January 01 | New Years Day |
1 | Monday | January 18 | Martin Luther King Jr. Day |
2 | Friday | February 12 | Lincoln’s Birthday |
3 | Monday | February 15 | Presidents’ Day |
4 | Sunday | May 08 | Mother’s Day |
holiday['Date'] = holiday['Date'].apply(lambda x : x + ' 2016')
holiday.head()
Day | Date | Holiday | |
---|---|---|---|
0 | Friday | January 01 2016 | New Years Day |
1 | Monday | January 18 2016 | Martin Luther King Jr. Day |
2 | Friday | February 12 2016 | Lincoln’s Birthday |
3 | Monday | February 15 2016 | Presidents’ Day |
4 | Sunday | May 08 2016 | Mother’s Day |
.apply()、.map()和.applymap()的区别
- .apply()让函数作用于列或者行
- .applymap()让函数作用于DataFrame每一个元素
- .map()让函数作用于Series每一个元素
holidays = [datetime.strptime(holiday.loc[i, 'Date'], '%B %d %Y').date() for i in range(len(holiday))]
holidays
[datetime.date(2016, 1, 1),
datetime.date(2016, 1, 18),
datetime.date(2016, 2, 12),
datetime.date(2016, 2, 15),
datetime.date(2016, 5, 8),
datetime.date(2016, 5, 30),
datetime.date(2016, 6, 19),
datetime.date(2016, 7, 4),
datetime.date(2016, 9, 5),
datetime.date(2016, 10, 10),
datetime.date(2016, 11, 11),
datetime.date(2016, 11, 24),
datetime.date(2016, 12, 26),
datetime.date(2016, 7, 4),
datetime.date(2016, 11, 8)]
- time.strptime(string[, format])根据指定的格式把一个时间字符串解析为时间元组,返回struct_time对象
- string — 时间字符串
- format — 格式化字符串
- .local[]索引
- struct_time.date()返回日期,去除时分秒信息
Routes from open source routing machine
训练集数据预处理
fastrout1 = pd.read_csv('./kaggle纽约出租车原始文件/fastest_routes_train_part_1.csv',
usecols = ['id', 'total_distance', 'total_travel_time', 'number_of_steps', 'step_direction'])
fastrout2 = pd.read_csv('./kaggle纽约出租车原始文件/fastest_routes_train_part_2.csv',
usecols = ['id', 'total_distance', 'total_travel_time', 'number_of_steps', 'step_direction'])
fastrout = pd.concat((fastrout1, fastrout2))
fastrout.head()
id | total_distance | total_travel_time | number_of_steps | step_direction | |
---|---|---|---|---|---|
0 | id2875421 | 2009.1 | 164.9 | 5 | left|straight|right|straight|arrive |
1 | id2377394 | 2513.2 | 332.0 | 6 | none|right|left|right|left|arrive |
2 | id3504673 | 1779.4 | 235.8 | 4 | left|left|right|arrive |
3 | id2181028 | 1614.9 | 140.1 | 5 | right|left|right|left|arrive |
4 | id0801584 | 1393.5 | 189.4 | 5 | right|right|right|left|arrive |
转弯多,花的时间也多,因为可能还要等红灯。
美国大部分地方允许等红灯时右转,纽约除外。
left_turn = []
right_turn = []
left_turn += list(map(lambda x : x.count('left') - x.count('slight left'), fastrout.step_direction))
right_turn += list(map(lambda x : x.count('right') - x.count('slight right'), fastrout.step_direction))
osrm_data = fastrout[['id', 'total_distance', 'total_travel_time', 'number_of_steps']]
osrm_data = osrm_data.assign(left_steps = left_turn)
osrm_data = osrm_data.assign(right_steps = right_turn)
osrm_data.head()
id | total_distance | total_travel_time | number_of_steps | left_steps | right_steps | |
---|---|---|---|---|---|---|
0 | id2875421 | 2009.1 | 164.9 | 5 | 1 | 1 |
1 | id2377394 | 2513.2 | 332.0 | 6 | 2 | 2 |
2 | id3504673 | 1779.4 | 235.8 | 4 | 2 | 1 |
3 | id2181028 | 1614.9 | 140.1 | 5 | 2 | 2 |
4 | id0801584 | 1393.5 | 189.4 | 5 | 1 | 3 |
测试集数据预处理
left_turn_test = []
right_turn_test = []
fastrout_test = pd.read_csv('./kaggle纽约出租车原始文件/fastest_routes_test.csv')
left_turn_test += list(map(lambda x : x.count('left') - x.count('slight left'), fastrout_test.step_direction))
right_turn_test += list(map(lambda x : x.count('right') - x.count('slight right'), fastrout_test.step_direction))
osrm_test = fastrout_test[['id', 'total_distance', 'total_travel_time', 'number_of_steps']]
osrm_test = osrm_test.assign(left_steps = left_turn_test)
osrm_test = osrm_test.assign(right_steps = right_turn_test)
osrm_test.head()
id | total_distance | total_travel_time | number_of_steps | left_steps | right_steps | |
---|---|---|---|---|---|---|
0 | id0771704 | 1497.1 | 200.2 | 7 | 3 | 2 |
1 | id3274209 | 1427.1 | 141.5 | 2 | 0 | 0 |
2 | id2756455 | 2312.3 | 324.6 | 9 | 4 | 4 |
3 | id3684027 | 931.8 | 84.2 | 4 | 1 | 2 |
4 | id3101285 | 2501.7 | 294.7 | 8 | 3 | 3 |
Weather
weather = pd.read_csv('./kaggle纽约出租车原始文件/KNYC_Metars.csv', parse_dates = ['Time'])
weather.head()
Time | Temp. | Windchill | Heat Index | Humidity | Pressure | Dew Point | Visibility | Wind Dir | Wind Speed | Gust Speed | Precip | Events | Conditions | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015-12-31 02:00:00 | 7.8 | 7.1 | NaN | 0.89 | 1017.0 | 6.1 | 8.0 | NNE | 5.6 | 0.0 | 0.8 | None | Overcast |
1 | 2015-12-31 03:00:00 | 7.2 | 5.9 | NaN | 0.90 | 1016.5 | 5.6 | 12.9 | Variable | 7.4 | 0.0 | 0.3 | None | Overcast |
2 | 2015-12-31 04:00:00 | 7.2 | NaN | NaN | 0.90 | 1016.7 | 5.6 | 12.9 | Calm | 0.0 | 0.0 | 0.0 | None | Overcast |
3 | 2015-12-31 05:00:00 | 7.2 | 5.9 | NaN | 0.86 | 1015.9 | 5.0 | 14.5 | NW | 7.4 | 0.0 | 0.0 | None | Overcast |
4 | 2015-12-31 06:00:00 | 7.2 | 6.4 | NaN | 0.90 | 1016.2 | 5.6 | 11.3 | West | 5.6 | 0.0 | 0.0 | None | Overcast |
可视化 Visualization
longitude = list(train.pickup_longitude) + list(train.dropoff_longitude)
latitude = list(train.pickup_latitude) + list(train.dropoff_latitude)
print(len(train.pickup_longitude), len(train.dropoff_longitude), len(longitude))
print(len(train.pickup_latitude), len(train.dropoff_latitude), len(latitude))
loc_df = pd.DataFrame()
loc_df['longitude'] = longitude
loc_df['latitude'] = latitude
loc_df.head()
1458644 1458644 2917288
1458644 1458644 2917288
longitude | latitude | |
---|---|---|
0 | -73.982155 | 40.767937 |
1 | -73.980415 | 40.738564 |
2 | -73.979027 | 40.763939 |
3 | -74.010040 | 40.719971 |
4 | -73.973053 | 40.793209 |
xlim = [-74.03, -73.77]
ylim = [40.63, 40.85]
print(loc_df.shape)
loc_df = loc_df[(loc_df.longitude > xlim[0]) & (loc_df.longitude < xlim[1])]
loc_df = loc_df[(loc_df.latitude > ylim[0]) & (loc_df.latitude < ylim[1])]
print(loc_df.shape)
(2917288, 2)
(2895748, 2)
kmeans = KMeans(n_clusters = 15, random_state = 2, n_init = 10)
kmeans.fit(loc_df)
loc_df['label'] = kmeans.labels_
loc_df.head()
longitude | latitude | label | |
---|---|---|---|
0 | -73.982155 | 40.767937 | 12 |
1 | -73.980415 | 40.738564 | 7 |
2 | -73.979027 | 40.763939 | 12 |
3 | -74.010040 | 40.719971 | 5 |
4 | -73.973053 | 40.793209 | 9 |
plt.figure(figsize = (10, 10))
for label in loc_df.label.unique():
plt.plot(loc_df.longitude[loc_df.label == label], loc_df.latitude[loc_df.label == label],
'.', markersize = 0.3, alpha = 0.3)
plt.title('Clusters of New York')
Text(0.5, 1.0, 'Clusters of New York')
特征工程 Feature Engineering
时间特征 Data feature, including holidays
print('train shape:', train.shape)
print('test shape:', test.shape)
train shape: (1458644, 11)
test shape: (625134, 9)
细分时间信息,以便进一步分类处理
for df in (train, test):
df['year'] = df['pickup_datetime'].dt.year
df['month'] = df['pickup_datetime'].dt.month
df['day'] = df['pickup_datetime'].dt.day
df['hour'] = df['pickup_datetime'].dt.hour
df['minute'] = df['pickup_datetime'].dt.minute
df['store_and_fwd_flag'] = 1 * (df.store_and_fwd_flag.values == 'Y')
train = train.assign(log_trip_duration = np.log(train.trip_duration + 1))
分割休息日与工作日
def restday(year, month, day, holidays):
is_rest = [None] * len(year)
is_weekend = [None] * len(year)
i = 0
for yy, mm, dd in zip(year, month, day):
is_weekend[i] = date(yy, mm, dd).isoweekday() in (6, 7)
is_rest[i] = is_weekend[i] or date(yy, mm, dd) in holidays
i += 1
return is_rest, is_weekend
- date.weekday():返回weekday,如果是星期一,返回0;如果是星期2,返回1,以此类推;
- data.isoweekday():返回工作日,如果是星期一,返回1;如果是星期2,返回2,以此类推
rest_day, weekend = restday(train.year, train.month, train.day, holiday)
train['rest_day'] = rest_day
train['weekend'] = weekend
rest_day, weekend = restday(test.year, test.month, test.day, holiday)
test['rest_day'] = rest_day
test['weekend'] = weekend
# 简化时间信息
train['pickup_time'] = train.hour + train.minute/60
test['pickup_time'] = test.hour + test.minute/60
分割早晚高峰rush hour、白天day和夜晚night
for df in (train, test):
df['hr_categori'] = np.nan
df.loc[(df.rest_day == False) & (df.hour <= 9) & (df.hour >= 7), 'hr_categori'] = 'rush'
df.loc[(df.rest_day == False) & (df.hour <= 18) & (df.hour >= 16), 'hr_categori'] = 'rush'
df.loc[(df.rest_day == False) & (df.hour < 16) & (df.hour > 9), 'hr_categori'] = 'day'
df.loc[(df.rest_day == False) & (df.hour < 7) | (df.hour > 18), 'hr_categori'] = 'night'
df.loc[(df.rest_day == True) & (df.hour < 18) & (df.hour > 7), 'hr_categori'] = 'day'
df.loc[(df.rest_day == True) & (df.hour <= 7) | (df.hour >= 18), 'hr_categori'] = 'night'
print('train shape:', train.shape)
print('test shape:', test.shape)
train shape: (1458644, 21)
test shape: (625134, 18)
print([i for i in train.columns if i not in test.columns])
['dropoff_datetime', 'trip_duration', 'log_trip_duration']
距离特征 Distance feature
train_join = train.join(osrm_data.set_index('id'), on = 'id')
test_join = test.join(osrm_test.set_index('id'), on = 'id')
print('train_join shape:', train_join.shape)
print('test_join shape:', test_join.shape)
train_join shape: (1458644, 26)
test_join shape: (625134, 23)
天气特征 Weather feature
weather['snow'] = 1 * (weather.Events == 'Snow') + 1 * (weather.Events == 'Fog\n\t,\nSnow')
weather['year'] = weather['Time'].dt.year
weather['month'] = weather['Time'].dt.month
weather['day'] = weather['Time'].dt.day
weather['hour'] = weather['Time'].dt.hour
weather = weather[weather.year == 2016][['month', 'day', 'hour', 'Temp.', 'Precip', 'snow', 'Visibility']]
print(weather.shape)
weather.head()
(8739, 7)
month | day | hour | Temp. | Precip | snow | Visibility | |
---|---|---|---|---|---|---|---|
22 | 1 | 1 | 0 | 5.6 | 0.0 | 0 | 16.1 |
23 | 1 | 1 | 1 | 5.6 | 0.0 | 0 | 16.1 |
24 | 1 | 1 | 2 | 5.6 | 0.0 | 0 | 16.1 |
25 | 1 | 1 | 3 | 5.0 | 0.0 | 0 | 16.1 |
26 | 1 | 1 | 4 | 5.0 | 0.0 | 0 | 16.1 |
train = pd.merge(train_join, weather, on = ['month', 'day', 'hour'], how = 'left')
test = pd.merge(test_join, weather, on = ['month', 'day', 'hour'], how = 'left')
print('train shape:', train.shape)
print('test shape:', test.shape)
train shape: (1458644, 30)
test shape: (625134, 27)
Cluster and find speed
coords = np.vstack((train[['pickup_longitude', 'pickup_latitude']].values,
train[['dropoff_longitude', 'dropoff_latitude']].values,
test[['pickup_longitude', 'pickup_latitude']].values,
test[['dropoff_longitude', 'dropoff_latitude']].values))
coords
array([[-73.98215485, 40.76793671],
[-73.98041534, 40.73856354],
[-73.97902679, 40.7639389 ],
...,
[-73.87660217, 40.74866486],
[-73.85426331, 40.89178848],
[-73.96932983, 40.76937866]])
kmeans = MiniBatchKMeans(n_clusters = 8, batch_size = 10000).fit(coords)
train.loc[:, 'pickup_cluster'] = kmeans.predict(train[['pickup_longitude', 'pickup_latitude']])
train.loc[:, 'dropoff_cluster'] = kmeans.predict(train[['dropoff_longitude', 'dropoff_latitude']])
test.loc[:, 'pickup_cluster'] = kmeans.predict(test[['pickup_longitude', 'pickup_latitude']])
test.loc[:, 'dropoff_cluster'] = kmeans.predict(test[['dropoff_longitude', 'dropoff_latitude']])
train[['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'pickup_cluster', 'dropoff_cluster']].head()
pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | pickup_cluster | dropoff_cluster | |
---|---|---|---|---|---|---|
0 | -73.982155 | 40.767937 | -73.964630 | 40.765602 | 2 | 5 |
1 | -73.980415 | 40.738564 | -73.999481 | 40.731152 | 4 | 4 |
2 | -73.979027 | 40.763939 | -74.005333 | 40.710087 | 2 | 6 |
3 | -74.010040 | 40.719971 | -74.012268 | 40.706718 | 6 | 6 |
4 | -73.973053 | 40.793209 | -73.972923 | 40.782520 | 5 | 5 |
plt.figure(figsize = (10, 10))
for label in np.unique(kmeans.labels_):
plt.plot(loc_df.longitude[loc_df.label == label], loc_df.latitude[loc_df.label == label],
'.', markersize = 0.3, alpha = 0.3)
print('train shape:', train.shape)
print('test shape:', test.shape)
train shape: (1458644, 32)
test shape: (625134, 29)
Count features
a = pd.concat([train, test]).groupby(['pickup_cluster']).size().reset_index()
b = pd.concat([train, test]).groupby(['dropoff_cluster']).size().reset_index()
C:\anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
"""Entry point for launching an IPython kernel.
C:\anaconda\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
train = pd.merge(train, a, on = 'pickup_cluster', how = 'left')
train = pd.merge(train, b, on = 'dropoff_cluster', how = 'left')
test = pd.merge(test, a, on = 'pickup_cluster', how = 'left')
test = pd.merge(test, b, on = 'dropoff_cluster', how = 'left')
train['speed'] = train['total_distance'] / train['trip_duration']
train[['speed', 'total_distance', 'trip_duration']].head()
speed | total_distance | trip_duration | |
---|---|---|---|
0 | 4.415604 | 2009.1 | 455 |
1 | 3.790649 | 2513.2 | 663 |
2 | 5.207533 | 11060.8 | 2124 |
3 | 4.147786 | 1779.4 | 429 |
4 | 3.712414 | 1614.9 | 435 |
pickup_speed = train[['speed', 'pickup_cluster']].groupby('pickup_cluster').mean().reset_index()
pickup_speed = pickup_speed.rename(columns = {'speed' : 'ave_pickup_speed'})
dropoff_speed = train[['speed', 'dropoff_cluster']].groupby('dropoff_cluster').mean().reset_index()
dropoff_speed = dropoff_speed.rename(columns = {'speed' : 'ave_dropoff_speed'})
print(pickup_speed)
print(dropoff_speed)
pickup_cluster ave_pickup_speed
0 0 5.093225
1 1 11.282694
2 2 5.072040
3 3 6.075631
4 4 4.826975
5 5 5.393436
6 6 5.450792
7 7 8.413295
dropoff_cluster ave_dropoff_speed
0 0 4.738659
1 1 12.016480
2 2 4.788073
3 3 7.571144
4 4 4.717927
5 5 5.437323
6 6 5.675050
7 7 8.394009
train = pd.merge(train, pickup_speed, on = 'pickup_cluster', how = 'left')
train = pd.merge(train, dropoff_speed, on = 'dropoff_cluster', how = 'left')
test = pd.merge(test, pickup_speed, on = 'pickup_cluster', how = 'left')
test = pd.merge(test, dropoff_speed, on = 'dropoff_cluster', how = 'left')
train = train.drop('speed', axis = 1)
print('train shape:', train.shape)
print('test shape:', test.shape)
train shape: (1458644, 36)
test shape: (625134, 33)
print([i for i in train.columns if i not in test.columns])
['dropoff_datetime', 'trip_duration', 'log_trip_duration']
Dummy variables: One-hot encode 处理所有类型特征
类型特征用数字标记,但数字间的大小关系会对预测产生我们不希望的影响
vendor_train = pd.get_dummies(train['vendor_id'], prefix = 'vi', prefix_sep = '_')
pickup_cluster_train = pd.get_dummies(train['pickup_cluster'], prefix = 'p', prefix_sep = '_')
dropoff_cluster_train = pd.get_dummies(train['dropoff_cluster'], prefix = 'd', prefix_sep = '_')
store_fwd_flag_train = pd.get_dummies(train['store_and_fwd_flag'], prefix = 's', prefix_sep = '_')
vendor_test = pd.get_dummies(test['vendor_id'], prefix = 'vi', prefix_sep = '_')
pickup_cluster_test = pd.get_dummies(test['pickup_cluster'], prefix = 'p', prefix_sep = '_')
dropoff_cluster_test = pd.get_dummies(test['dropoff_cluster'], prefix = 'd', prefix_sep = '_')
store_fwd_flag_test = pd.get_dummies(test['store_and_fwd_flag'], prefix = 's', prefix_sep = '_')
train['dayofweek'] = train['pickup_datetime'].dt.dayofweek
test['dayofweek'] = test['pickup_datetime'].dt.dayofweek
month_train = pd.get_dummies(train['month'], prefix = 'm', prefix_sep = '_')
dam_train = pd.get_dummies(train['day'], prefix = 'dam', prefix_sep = '_')
daw_train = pd.get_dummies(train['dayofweek'], prefix = 'daw', prefix_sep = '_')
hr_train = pd.get_dummies(train['hour'], prefix = 'h', prefix_sep = '_')
hr_cate_train = pd.get_dummies(train['hr_categori'], prefix = 'hc', prefix_sep = '_')
month_test = pd.get_dummies(test['month'], prefix = 'm', prefix_sep = '_')
dam_test = pd.get_dummies(test['day'], prefix = 'dam', prefix_sep = '_')
daw_test = pd.get_dummies(test['dayofweek'], prefix = 'daw', prefix_sep = '_')
hr_test = pd.get_dummies(test['hour'], prefix = 'h', prefix_sep = '_')
hr_cate_test = pd.get_dummies(test['hr_categori'], prefix = 'hc', prefix_sep = '_')
month_train.head()
m_1 | m_2 | m_3 | m_4 | m_5 | m_6 | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 0 | 0 | 1 | 0 | 0 | 0 |
train = train.drop(['id', 'vendor_id', 'pickup_cluster', 'dropoff_cluster',
'store_and_fwd_flag', 'month', 'day', 'dayofweek', 'hour', 'hr_categori', 'dropoff_datetime', 'trip_duration'], axis = 1)
Test_id = test['id']
test = test.drop(['id', 'vendor_id', 'pickup_cluster', 'dropoff_cluster',
'store_and_fwd_flag', 'month', 'day', 'dayofweek', 'hour', 'hr_categori'], axis = 1)
Modeling
- Train_Master和Test_Master作为预测目标,上传Kaggle
- Train_Master源自train数据集
- Test_Master源自test数据集
Train_Master = pd.concat([train, vendor_train, pickup_cluster_train, dropoff_cluster_train, store_fwd_flag_train,
month_train, dam_train, daw_train, hr_train, hr_cate_train], axis = 1)
Test_Master = pd.concat([test, vendor_test, pickup_cluster_test, dropoff_cluster_test, store_fwd_flag_test,
month_test, dam_test, daw_test, hr_test, hr_cate_test], axis = 1)
将Train_Master分为Train和Test,不涉及Test_Master
Train_Master = Train_Master.drop('pickup_datetime', axis = 1)
Test_Master = Test_Master.drop('pickup_datetime', axis = 1)
Train, Test = train_test_split(Train_Master, test_size = 0.01)
X_Train = Train.drop('log_trip_duration', axis = 1)
Y_Train = Train['log_trip_duration']
X_Test = Test.drop('log_trip_duration', axis = 1)
Y_Test = Test['log_trip_duration']
Y_Train = Y_Train.reset_index().drop('index', axis = 1)
Y_Test = Y_Test.reset_index().drop('index', axis = 1)
使用xgb特有的数据结构DMatrix,速度更快
dtrain = xgb.DMatrix(X_Train, label = Y_Train)
dvalid = xgb.DMatrix(X_Test, label = Y_Test)
dtest = xgb.DMatrix(Test_Master)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
# 理想的结果是模型在watchlist中的两个set上表现一样地好
xgb_pars = {'objective' : 'reg:linear',
'learning_rate' : 0.05,
'max_depth' : 7,
'subsample' : 0.8,
'colsample_bytree' : 0.7,
'colsample_bylevel' : 0.7,
'silent' : 1,
'reg_alpha' : 1}
model = xgb.train(xgb_pars, dtrain, 100, watchlist, early_stopping_rounds = 5,
maximize = False, verbose_eval = 1)
print('Modeling RMSLE % .5f' %model.best_score)
[0] train-rmse:5.72071 valid-rmse:5.71359
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.
Will train until valid-rmse hasn't improved in 5 rounds.
[1] train-rmse:5.43652 valid-rmse:5.42959
[2] train-rmse:5.16664 valid-rmse:5.15993
[3] train-rmse:4.91048 valid-rmse:4.90398
[4] train-rmse:4.66798 valid-rmse:4.66166
[5] train-rmse:4.43687 valid-rmse:4.4307
[6] train-rmse:4.21742 valid-rmse:4.21139
[7] train-rmse:4.00899 valid-rmse:4.0031
[8] train-rmse:3.81135 valid-rmse:3.80559
[9] train-rmse:3.62356 valid-rmse:3.61793
[10] train-rmse:3.44519 valid-rmse:3.43965
[11] train-rmse:3.27661 valid-rmse:3.27129
[12] train-rmse:3.11585 valid-rmse:3.11064
[13] train-rmse:2.96326 valid-rmse:2.95812
[14] train-rmse:2.81842 valid-rmse:2.8134
[15] train-rmse:2.68097 valid-rmse:2.67604
[16] train-rmse:2.55061 valid-rmse:2.54573
[17] train-rmse:2.42696 valid-rmse:2.42213
[18] train-rmse:2.30962 valid-rmse:2.30485
[19] train-rmse:2.1983 valid-rmse:2.19364
[20] train-rmse:2.09279 valid-rmse:2.08819
[21] train-rmse:1.99277 valid-rmse:1.98825
[22] train-rmse:1.89802 valid-rmse:1.89356
[23] train-rmse:1.80818 valid-rmse:1.80382
[24] train-rmse:1.72373 valid-rmse:1.71942
[25] train-rmse:1.64307 valid-rmse:1.63881
[26] train-rmse:1.56664 valid-rmse:1.56242
[27] train-rmse:1.49434 valid-rmse:1.49023
[28] train-rmse:1.42606 valid-rmse:1.42196
[29] train-rmse:1.36146 valid-rmse:1.35739
[30] train-rmse:1.30082 valid-rmse:1.29687
[31] train-rmse:1.24305 valid-rmse:1.23917
[32] train-rmse:1.18844 valid-rmse:1.18463
[33] train-rmse:1.13687 valid-rmse:1.13312
[34] train-rmse:1.08825 valid-rmse:1.08456
[35] train-rmse:1.04279 valid-rmse:1.03918
[36] train-rmse:0.999537 valid-rmse:0.995962
[37] train-rmse:0.958848 valid-rmse:0.955314
[38] train-rmse:0.920506 valid-rmse:0.917043
[39] train-rmse:0.884371 valid-rmse:0.880961
[40] train-rmse:0.850502 valid-rmse:0.847148
[41] train-rmse:0.818737 valid-rmse:0.815504
[42] train-rmse:0.788936 valid-rmse:0.785738
[43] train-rmse:0.760988 valid-rmse:0.757844
[44] train-rmse:0.73487 valid-rmse:0.731798
[45] train-rmse:0.710508 valid-rmse:0.7075
[46] train-rmse:0.687611 valid-rmse:0.684679
[47] train-rmse:0.666258 valid-rmse:0.663418
[48] train-rmse:0.646386 valid-rmse:0.643666
[49] train-rmse:0.627901 valid-rmse:0.625226
[50] train-rmse:0.610705 valid-rmse:0.608164
[51] train-rmse:0.594841 valid-rmse:0.592374
[52] train-rmse:0.580073 valid-rmse:0.577685
[53] train-rmse:0.566276 valid-rmse:0.563964
[54] train-rmse:0.553586 valid-rmse:0.551364
[55] train-rmse:0.541802 valid-rmse:0.539665
[56] train-rmse:0.530938 valid-rmse:0.528837
[57] train-rmse:0.520891 valid-rmse:0.518795
[58] train-rmse:0.511724 valid-rmse:0.509689
[59] train-rmse:0.503191 valid-rmse:0.501274
[60] train-rmse:0.495394 valid-rmse:0.493556
[61] train-rmse:0.488261 valid-rmse:0.486498
[62] train-rmse:0.481713 valid-rmse:0.480063
[63] train-rmse:0.475621 valid-rmse:0.474073
[64] train-rmse:0.470123 valid-rmse:0.468653
[65] train-rmse:0.465024 valid-rmse:0.463596
[66] train-rmse:0.460254 valid-rmse:0.458901
[67] train-rmse:0.455962 valid-rmse:0.4547
[68] train-rmse:0.452054 valid-rmse:0.450878
[69] train-rmse:0.448391 valid-rmse:0.44725
[70] train-rmse:0.445074 valid-rmse:0.443996
[71] train-rmse:0.442037 valid-rmse:0.441076
[72] train-rmse:0.439282 valid-rmse:0.438398
[73] train-rmse:0.436759 valid-rmse:0.435955
[74] train-rmse:0.434434 valid-rmse:0.433737
[75] train-rmse:0.43231 valid-rmse:0.43164
[76] train-rmse:0.430316 valid-rmse:0.429741
[77] train-rmse:0.428579 valid-rmse:0.428089
[78] train-rmse:0.426804 valid-rmse:0.42643
[79] train-rmse:0.425351 valid-rmse:0.425114
[80] train-rmse:0.424028 valid-rmse:0.423859
[81] train-rmse:0.422741 valid-rmse:0.422651
[82] train-rmse:0.421421 valid-rmse:0.421356
[83] train-rmse:0.420333 valid-rmse:0.420332
[84] train-rmse:0.419373 valid-rmse:0.419435
[85] train-rmse:0.418452 valid-rmse:0.418557
[86] train-rmse:0.417631 valid-rmse:0.417816
[87] train-rmse:0.416853 valid-rmse:0.417082
[88] train-rmse:0.416167 valid-rmse:0.416446
[89] train-rmse:0.415507 valid-rmse:0.415831
[90] train-rmse:0.414853 valid-rmse:0.415315
[91] train-rmse:0.414216 valid-rmse:0.414727
[92] train-rmse:0.413691 valid-rmse:0.41419
[93] train-rmse:0.413165 valid-rmse:0.41368
[94] train-rmse:0.412663 valid-rmse:0.413233
[95] train-rmse:0.412157 valid-rmse:0.412737
[96] train-rmse:0.411703 valid-rmse:0.412313
[97] train-rmse:0.411311 valid-rmse:0.411921
[98] train-rmse:0.410933 valid-rmse:0.411623
[99] train-rmse:0.410626 valid-rmse:0.411271
Modeling RMSLE 0.41127
ax = xgb.plot_importance(model, max_num_features = 70, height = 0.9)
ax.figure.set_size_inches(20, 30)
如何在Jupyter放大XGB的绘图尺寸?
- fig, ax = plt.subplots(figsize=(10,8))
- plot_importance(model, ax=ax)
或者,你可以让绘图函数创建图形,然后更改其大小。 - ax = plot_importance(model)
- ax.figure.set_size_inches(10,8)
pred = model.predict(dtest)
pred = np.exp(pred) - 1
submission = pd.concat([Test_id, pd.DataFrame(pred)], axis = 1)
submission.columns = ['id', 'trip_duration']
submission['trip_duration'] = submission.apply(lambda x : 1 if (x['trip_duration'] <= 0) else x['trip_duration'], axis = 1)
submission.to_csv('submission.csv', index = False)