数据集简介

纽约出租车协会提供了两百万超出租车行车记录
根据载客情况，准确回归模型预测毎一单出车用时
预测纽约出租车用时，有助于在行程开始前对车费的预测（有意思的是，滴滴是在行车结束后计算车费的）

分析步骤

数据读取
1. 读取训练集和测试集
2. 读取行程数据
3. 读取节假日数据
4. 读取天气数据
数据可视化 KMeans Clustering + Matplotlib
特征工程
1. 时间特征
2. 距离特征
3. 天气特征
4. 区域拥挤特征
5. One hot encode/get dummies
运行模型 XGBoost: eXtreme Gradient Boosting
1. One-Hot-Enconde类别型数据
2. 区分训练集，验证集和测试集
3. 设置参数，评估得分
4. 写出结果，上传提交

数据读取

%matplotlib inline
import numpy as np
import pandas as pd
from datetime import datetime, date
from sklearn .model_selection import train_test_split
import xgboost as xgb
from sklearn.linear_model import LinearRegression, Ridge, BayesianRidge
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import mean_squared_error
from math import radians, cos, sin, asin, sqrt
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [60, 30]
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, KFold

xgboost没有被包含在Anaconda的环境中，另行下载安装

train = pd.read_csv('./kaggle纽约出租车原始文件/train.csv', parse_dates = ['pickup_datetime'])
test = pd.read_csv('./kaggle纽约出租车原始文件/test.csv', parse_dates = ['pickup_datetime'])

train.head()

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration
0	id2875421	2	2016-03-14 17:24:55	2016-03-14 17:32:30	1	-73.982155	40.767937	-73.964630	40.765602	N	455
1	id2377394	1	2016-06-12 00:43:35	2016-06-12 00:54:38	1	-73.980415	40.738564	-73.999481	40.731152	N	663
2	id3858529	2	2016-01-19 11:35:24	2016-01-19 12:10:48	1	-73.979027	40.763939	-74.005333	40.710087	N	2124
3	id3504673	2	2016-04-06 19:32:31	2016-04-06 19:39:40	1	-74.010040	40.719971	-74.012268	40.706718	N	429
4	id2181028	2	2016-03-26 13:30:55	2016-03-26 13:38:10	1	-73.973053	40.793209	-73.972923	40.782520	N	435

print('train shape:', train.shape)
print('test shape:', test.shape)

train shape: (1458644, 11)
test shape: (625134, 9)

测试集的特征比训练集少两个

print([i for i in train.columns if i not in test.columns])

['dropoff_datetime', 'trip_duration']

Holiday

holiday = pd.read_csv('./kaggle纽约出租车原始文件/NYC_2016Holidays.csv', sep = ';')
holiday.head()

	Day	Date	Holiday
0	Friday	January 01	New Years Day
1	Monday	January 18	Martin Luther King Jr. Day
2	Friday	February 12	Lincoln’s Birthday
3	Monday	February 15	Presidents’ Day
4	Sunday	May 08	Mother’s Day

holiday['Date'] = holiday['Date'].apply(lambda x : x + ' 2016')
holiday.head()

	Day	Date	Holiday
0	Friday	January 01 2016	New Years Day
1	Monday	January 18 2016	Martin Luther King Jr. Day
2	Friday	February 12 2016	Lincoln’s Birthday
3	Monday	February 15 2016	Presidents’ Day
4	Sunday	May 08 2016	Mother’s Day

.apply()、.map()和.applymap()的区别

.apply()让函数作用于列或者行
.applymap()让函数作用于DataFrame每一个元素
.map()让函数作用于Series每一个元素

holidays = [datetime.strptime(holiday.loc[i, 'Date'], '%B %d %Y').date() for i in range(len(holiday))]
holidays

[datetime.date(2016, 1, 1),
 datetime.date(2016, 1, 18),
 datetime.date(2016, 2, 12),
 datetime.date(2016, 2, 15),
 datetime.date(2016, 5, 8),
 datetime.date(2016, 5, 30),
 datetime.date(2016, 6, 19),
 datetime.date(2016, 7, 4),
 datetime.date(2016, 9, 5),
 datetime.date(2016, 10, 10),
 datetime.date(2016, 11, 11),
 datetime.date(2016, 11, 24),
 datetime.date(2016, 12, 26),
 datetime.date(2016, 7, 4),
 datetime.date(2016, 11, 8)]

time.strptime(string[, format])根据指定的格式把一个时间字符串解析为时间元组，返回struct_time对象
1. string — 时间字符串
2. format — 格式化字符串
.local[]索引
struct_time.date()返回日期，去除时分秒信息

Routes from open source routing machine

训练集数据预处理

fastrout1 = pd.read_csv('./kaggle纽约出租车原始文件/fastest_routes_train_part_1.csv', 
                        usecols = ['id', 'total_distance', 'total_travel_time', 'number_of_steps', 'step_direction'])
fastrout2 = pd.read_csv('./kaggle纽约出租车原始文件/fastest_routes_train_part_2.csv', 
                        usecols = ['id', 'total_distance', 'total_travel_time', 'number_of_steps', 'step_direction'])
fastrout = pd.concat((fastrout1, fastrout2))
fastrout.head()

	id	total_distance	total_travel_time	number_of_steps	step_direction
0	id2875421	2009.1	164.9	5	left\|straight\|right\|straight\|arrive
1	id2377394	2513.2	332.0	6	none\|right\|left\|right\|left\|arrive
2	id3504673	1779.4	235.8	4	left\|left\|right\|arrive
3	id2181028	1614.9	140.1	5	right\|left\|right\|left\|arrive
4	id0801584	1393.5	189.4	5	right\|right\|right\|left\|arrive

转弯多，花的时间也多，因为可能还要等红灯。

美国大部分地方允许等红灯时右转，纽约除外。

left_turn = []
right_turn = []
left_turn += list(map(lambda x : x.count('left') - x.count('slight left'), fastrout.step_direction))
right_turn += list(map(lambda x : x.count('right') - x.count('slight right'), fastrout.step_direction))
osrm_data = fastrout[['id', 'total_distance', 'total_travel_time', 'number_of_steps']]
osrm_data = osrm_data.assign(left_steps = left_turn)
osrm_data = osrm_data.assign(right_steps = right_turn)
osrm_data.head()

	id	total_distance	total_travel_time	number_of_steps	left_steps	right_steps
0	id2875421	2009.1	164.9	5	1	1
1	id2377394	2513.2	332.0	6	2	2
2	id3504673	1779.4	235.8	4	2	1
3	id2181028	1614.9	140.1	5	2	2
4	id0801584	1393.5	189.4	5	1	3

测试集数据预处理

left_turn_test = []
right_turn_test = []
fastrout_test = pd.read_csv('./kaggle纽约出租车原始文件/fastest_routes_test.csv')
left_turn_test += list(map(lambda x : x.count('left') - x.count('slight left'), fastrout_test.step_direction))
right_turn_test += list(map(lambda x : x.count('right') - x.count('slight right'), fastrout_test.step_direction))
osrm_test = fastrout_test[['id', 'total_distance', 'total_travel_time', 'number_of_steps']]
osrm_test = osrm_test.assign(left_steps = left_turn_test)
osrm_test = osrm_test.assign(right_steps = right_turn_test)
osrm_test.head()

	id	total_distance	total_travel_time	number_of_steps	left_steps	right_steps
0	id0771704	1497.1	200.2	7	3	2
1	id3274209	1427.1	141.5	2	0	0
2	id2756455	2312.3	324.6	9	4	4
3	id3684027	931.8	84.2	4	1	2
4	id3101285	2501.7	294.7	8	3	3

Weather

weather = pd.read_csv('./kaggle纽约出租车原始文件/KNYC_Metars.csv', parse_dates = ['Time'])
weather.head()

	Time	Temp.	Windchill	Heat Index	Humidity	Pressure	Dew Point	Visibility	Wind Dir	Wind Speed	Precip	Events	Conditions
0	2015-12-31 02:00:00	7.8	7.1	NaN	0.89	1017.0	6.1	8.0	NNE	5.6	0.8	None	Overcast
1	2015-12-31 03:00:00	7.2	5.9	NaN	0.90	1016.5	5.6	12.9	Variable	7.4	0.3	None	Overcast
2	2015-12-31 04:00:00	7.2	NaN	NaN	0.90	1016.7	5.6	12.9	Calm	0.0	0.0	None	Overcast
3	2015-12-31 05:00:00	7.2	5.9	NaN	0.86	1015.9	5.0	14.5	NW	7.4	0.0	None	Overcast
4	2015-12-31 06:00:00	7.2	6.4	NaN	0.90	1016.2	5.6	11.3	West	5.6	0.0	None	Overcast

可视化 Visualization

longitude = list(train.pickup_longitude) + list(train.dropoff_longitude)
latitude = list(train.pickup_latitude) + list(train.dropoff_latitude)
print(len(train.pickup_longitude), len(train.dropoff_longitude), len(longitude))
print(len(train.pickup_latitude), len(train.dropoff_latitude), len(latitude))
loc_df = pd.DataFrame()
loc_df['longitude'] = longitude
loc_df['latitude'] = latitude
loc_df.head()

1458644 1458644 2917288
1458644 1458644 2917288

	longitude	latitude
0	-73.982155	40.767937
1	-73.980415	40.738564
2	-73.979027	40.763939
3	-74.010040	40.719971
4	-73.973053	40.793209

xlim = [-74.03, -73.77]
ylim = [40.63, 40.85]
print(loc_df.shape)
loc_df = loc_df[(loc_df.longitude > xlim[0]) & (loc_df.longitude < xlim[1])]
loc_df = loc_df[(loc_df.latitude > ylim[0]) & (loc_df.latitude < ylim[1])]
print(loc_df.shape)

(2917288, 2)
(2895748, 2)

kmeans = KMeans(n_clusters = 15, random_state = 2, n_init = 10)
kmeans.fit(loc_df)
loc_df['label'] = kmeans.labels_
loc_df.head()

	longitude	latitude	label
0	-73.982155	40.767937	12
1	-73.980415	40.738564	7
2	-73.979027	40.763939	12
3	-74.010040	40.719971	5
4	-73.973053	40.793209	9

plt.figure(figsize = (10, 10))
for label in loc_df.label.unique():
    plt.plot(loc_df.longitude[loc_df.label == label], loc_df.latitude[loc_df.label == label],
            '.', markersize = 0.3, alpha = 0.3)
plt.title('Clusters of New York')

Text(0.5, 1.0, 'Clusters of New York')

特征工程 Feature Engineering

时间特征 Data feature, including holidays

print('train shape:', train.shape)
print('test shape:', test.shape)

train shape: (1458644, 11)
test shape: (625134, 9)

细分时间信息，以便进一步分类处理

for df in (train, test):
    df['year'] = df['pickup_datetime'].dt.year
    df['month'] = df['pickup_datetime'].dt.month
    df['day'] = df['pickup_datetime'].dt.day
    df['hour'] = df['pickup_datetime'].dt.hour
    df['minute'] = df['pickup_datetime'].dt.minute
    df['store_and_fwd_flag'] = 1 * (df.store_and_fwd_flag.values == 'Y')

train = train.assign(log_trip_duration = np.log(train.trip_duration + 1))

分割休息日与工作日

def restday(year, month, day, holidays):
    is_rest = [None] * len(year)
    is_weekend = [None] * len(year)
    i = 0
    for yy, mm, dd in zip(year, month, day):
        is_weekend[i] = date(yy, mm, dd).isoweekday() in (6, 7)
        is_rest[i] = is_weekend[i] or date(yy, mm, dd) in holidays
        i += 1
    return is_rest, is_weekend

date.weekday()：返回weekday，如果是星期一，返回0；如果是星期2，返回1，以此类推；
data.isoweekday()：返回工作日，如果是星期一，返回1；如果是星期2，返回2，以此类推

rest_day, weekend = restday(train.year, train.month, train.day, holiday)
train['rest_day'] = rest_day
train['weekend'] = weekend
rest_day, weekend = restday(test.year, test.month, test.day, holiday)
test['rest_day'] = rest_day
test['weekend'] = weekend
# 简化时间信息
train['pickup_time'] = train.hour + train.minute/60
test['pickup_time'] = test.hour + test.minute/60

分割早晚高峰rush hour、白天day和夜晚night

for df in (train, test):
    df['hr_categori'] = np.nan
    df.loc[(df.rest_day == False) & (df.hour <= 9) & (df.hour >= 7), 'hr_categori'] = 'rush'
    df.loc[(df.rest_day == False) & (df.hour <= 18) & (df.hour >= 16), 'hr_categori'] = 'rush'
    df.loc[(df.rest_day == False) & (df.hour < 16) & (df.hour > 9), 'hr_categori'] = 'day'
    df.loc[(df.rest_day == False) & (df.hour < 7) | (df.hour > 18), 'hr_categori'] = 'night'
    df.loc[(df.rest_day == True) & (df.hour < 18) & (df.hour > 7), 'hr_categori'] = 'day'
    df.loc[(df.rest_day == True) & (df.hour <= 7) | (df.hour >= 18), 'hr_categori'] = 'night'

print('train shape:', train.shape)
print('test shape:', test.shape)

train shape: (1458644, 21)
test shape: (625134, 18)

print([i for i in train.columns if i not in test.columns])

['dropoff_datetime', 'trip_duration', 'log_trip_duration']

距离特征 Distance feature

train_join = train.join(osrm_data.set_index('id'), on = 'id')
test_join = test.join(osrm_test.set_index('id'), on = 'id')
print('train_join shape:', train_join.shape)
print('test_join shape:', test_join.shape)

train_join shape: (1458644, 26)
test_join shape: (625134, 23)

天气特征 Weather feature

weather['snow'] = 1 * (weather.Events == 'Snow') + 1 * (weather.Events == 'Fog\n\t,\nSnow')
weather['year'] = weather['Time'].dt.year
weather['month'] = weather['Time'].dt.month
weather['day'] = weather['Time'].dt.day
weather['hour'] = weather['Time'].dt.hour
weather = weather[weather.year == 2016][['month', 'day', 'hour', 'Temp.', 'Precip', 'snow', 'Visibility']]
print(weather.shape)
weather.head()

(8739, 7)

	month	day	hour	Temp.	Visibility
22	1	1	0	5.6	16.1
23	1	1	1	5.6	16.1
24	1	1	2	5.6	16.1
25	1	1	3	5.0	16.1
26	1	1	4	5.0	16.1

train = pd.merge(train_join, weather, on = ['month', 'day', 'hour'], how = 'left')
test = pd.merge(test_join, weather, on = ['month', 'day', 'hour'], how = 'left')
print('train shape:', train.shape)
print('test shape:', test.shape)

train shape: (1458644, 30)
test shape: (625134, 27)

Cluster and find speed

coords = np.vstack((train[['pickup_longitude', 'pickup_latitude']].values,
                   train[['dropoff_longitude', 'dropoff_latitude']].values,
                   test[['pickup_longitude', 'pickup_latitude']].values,
                   test[['dropoff_longitude', 'dropoff_latitude']].values))
coords

array([[-73.98215485,  40.76793671],
       [-73.98041534,  40.73856354],
       [-73.97902679,  40.7639389 ],
       ...,
       [-73.87660217,  40.74866486],
       [-73.85426331,  40.89178848],
       [-73.96932983,  40.76937866]])

kmeans = MiniBatchKMeans(n_clusters = 8, batch_size = 10000).fit(coords)

train.loc[:, 'pickup_cluster'] = kmeans.predict(train[['pickup_longitude', 'pickup_latitude']])
train.loc[:, 'dropoff_cluster'] = kmeans.predict(train[['dropoff_longitude', 'dropoff_latitude']])
test.loc[:, 'pickup_cluster'] = kmeans.predict(test[['pickup_longitude', 'pickup_latitude']])
test.loc[:, 'dropoff_cluster'] = kmeans.predict(test[['dropoff_longitude', 'dropoff_latitude']])
train[['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'pickup_cluster', 'dropoff_cluster']].head()

	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	pickup_cluster	dropoff_cluster
0	-73.982155	40.767937	-73.964630	40.765602	2	5
1	-73.980415	40.738564	-73.999481	40.731152	4	4
2	-73.979027	40.763939	-74.005333	40.710087	2	6
3	-74.010040	40.719971	-74.012268	40.706718	6	6
4	-73.973053	40.793209	-73.972923	40.782520	5	5

plt.figure(figsize = (10, 10))
for label in np.unique(kmeans.labels_):
    plt.plot(loc_df.longitude[loc_df.label == label], loc_df.latitude[loc_df.label == label],
            '.', markersize = 0.3, alpha = 0.3)

print('train shape:', train.shape)
print('test shape:', test.shape)

train shape: (1458644, 32)
test shape: (625134, 29)

Count features

a = pd.concat([train, test]).groupby(['pickup_cluster']).size().reset_index()
b = pd.concat([train, test]).groupby(['dropoff_cluster']).size().reset_index()

C:\anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
  """Entry point for launching an IPython kernel.
C:\anaconda\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.

train = pd.merge(train, a, on = 'pickup_cluster', how = 'left')
train = pd.merge(train, b, on = 'dropoff_cluster', how = 'left')
test = pd.merge(test, a, on = 'pickup_cluster', how = 'left')
test = pd.merge(test, b, on = 'dropoff_cluster', how = 'left')

train['speed'] = train['total_distance'] / train['trip_duration']
train[['speed', 'total_distance', 'trip_duration']].head()

	speed	total_distance	trip_duration
0	4.415604	2009.1	455
1	3.790649	2513.2	663
2	5.207533	11060.8	2124
3	4.147786	1779.4	429
4	3.712414	1614.9	435

pickup_speed = train[['speed', 'pickup_cluster']].groupby('pickup_cluster').mean().reset_index()
pickup_speed = pickup_speed.rename(columns = {'speed' : 'ave_pickup_speed'})
dropoff_speed = train[['speed', 'dropoff_cluster']].groupby('dropoff_cluster').mean().reset_index()
dropoff_speed = dropoff_speed.rename(columns = {'speed' : 'ave_dropoff_speed'})

print(pickup_speed)
print(dropoff_speed)

pickup_cluster  ave_pickup_speed
0               0          5.093225
1               1         11.282694
2               2          5.072040
3               3          6.075631
4               4          4.826975
5               5          5.393436
6               6          5.450792
7               7          8.413295
   dropoff_cluster  ave_dropoff_speed
0                0           4.738659
1                1          12.016480
2                2           4.788073
3                3           7.571144
4                4           4.717927
5                5           5.437323
6                6           5.675050
7                7           8.394009

train = pd.merge(train, pickup_speed, on = 'pickup_cluster', how = 'left')
train = pd.merge(train, dropoff_speed, on = 'dropoff_cluster', how = 'left')
test = pd.merge(test, pickup_speed, on = 'pickup_cluster', how = 'left')
test = pd.merge(test, dropoff_speed, on = 'dropoff_cluster', how = 'left')

train = train.drop('speed', axis = 1)

print('train shape:', train.shape)
print('test shape:', test.shape)

train shape: (1458644, 36)
test shape: (625134, 33)

print([i for i in train.columns if i not in test.columns])

['dropoff_datetime', 'trip_duration', 'log_trip_duration']

Dummy variables: One-hot encode 处理所有类型特征

类型特征用数字标记，但数字间的大小关系会对预测产生我们不希望的影响

vendor_train = pd.get_dummies(train['vendor_id'], prefix = 'vi', prefix_sep = '_')
pickup_cluster_train = pd.get_dummies(train['pickup_cluster'], prefix = 'p', prefix_sep = '_')
dropoff_cluster_train = pd.get_dummies(train['dropoff_cluster'], prefix = 'd', prefix_sep = '_')
store_fwd_flag_train = pd.get_dummies(train['store_and_fwd_flag'], prefix = 's', prefix_sep = '_')
vendor_test = pd.get_dummies(test['vendor_id'], prefix = 'vi', prefix_sep = '_')
pickup_cluster_test = pd.get_dummies(test['pickup_cluster'], prefix = 'p', prefix_sep = '_')
dropoff_cluster_test = pd.get_dummies(test['dropoff_cluster'], prefix = 'd', prefix_sep = '_')
store_fwd_flag_test = pd.get_dummies(test['store_and_fwd_flag'], prefix = 's', prefix_sep = '_')
train['dayofweek'] = train['pickup_datetime'].dt.dayofweek
test['dayofweek'] = test['pickup_datetime'].dt.dayofweek
month_train = pd.get_dummies(train['month'], prefix = 'm', prefix_sep = '_')
dam_train = pd.get_dummies(train['day'], prefix = 'dam', prefix_sep = '_')
daw_train = pd.get_dummies(train['dayofweek'], prefix = 'daw', prefix_sep = '_')
hr_train = pd.get_dummies(train['hour'], prefix = 'h', prefix_sep = '_')
hr_cate_train = pd.get_dummies(train['hr_categori'], prefix = 'hc', prefix_sep = '_')
month_test = pd.get_dummies(test['month'], prefix = 'm', prefix_sep = '_')
dam_test = pd.get_dummies(test['day'], prefix = 'dam', prefix_sep = '_')
daw_test = pd.get_dummies(test['dayofweek'], prefix = 'daw', prefix_sep = '_')
hr_test = pd.get_dummies(test['hour'], prefix = 'h', prefix_sep = '_')
hr_cate_test = pd.get_dummies(test['hr_categori'], prefix = 'hc', prefix_sep = '_')

month_train.head()

	m_1	m_3	m_4	m_6
0	0	1	0	0
1	0	0	0	1
2	1	0	0	0
3	0	0	1	0
4	0	1	0	0

train = train.drop(['id', 'vendor_id', 'pickup_cluster', 'dropoff_cluster', 
                    'store_and_fwd_flag', 'month', 'day', 'dayofweek', 'hour', 'hr_categori', 'dropoff_datetime', 'trip_duration'], axis = 1)
Test_id = test['id']
test = test.drop(['id', 'vendor_id', 'pickup_cluster', 'dropoff_cluster', 
                  'store_and_fwd_flag', 'month', 'day', 'dayofweek', 'hour', 'hr_categori'], axis = 1)

Modeling

Train_Master和Test_Master作为预测目标，上传Kaggle
Train_Master源自train数据集
Test_Master源自test数据集

Train_Master = pd.concat([train, vendor_train, pickup_cluster_train, dropoff_cluster_train, store_fwd_flag_train,
                         month_train, dam_train, daw_train, hr_train, hr_cate_train], axis = 1)
Test_Master = pd.concat([test, vendor_test, pickup_cluster_test, dropoff_cluster_test, store_fwd_flag_test, 
                         month_test, dam_test, daw_test, hr_test, hr_cate_test], axis = 1)

将Train_Master分为Train和Test，不涉及Test_Master

Train_Master = Train_Master.drop('pickup_datetime', axis = 1)
Test_Master = Test_Master.drop('pickup_datetime', axis = 1)
Train, Test = train_test_split(Train_Master, test_size = 0.01)
X_Train = Train.drop('log_trip_duration', axis = 1)
Y_Train = Train['log_trip_duration']
X_Test = Test.drop('log_trip_duration', axis = 1)
Y_Test = Test['log_trip_duration']
Y_Train = Y_Train.reset_index().drop('index', axis = 1)
Y_Test = Y_Test.reset_index().drop('index', axis = 1)

使用xgb特有的数据结构DMatrix，速度更快

dtrain = xgb.DMatrix(X_Train, label = Y_Train)
dvalid = xgb.DMatrix(X_Test, label = Y_Test)
dtest = xgb.DMatrix(Test_Master)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
# 理想的结果是模型在watchlist中的两个set上表现一样地好

xgb_pars = {'objective' : 'reg:linear',
           'learning_rate' : 0.05,
           'max_depth' : 7,
           'subsample' : 0.8,
           'colsample_bytree' : 0.7,
           'colsample_bylevel' : 0.7,
           'silent' : 1,
           'reg_alpha' : 1}
model = xgb.train(xgb_pars, dtrain, 100, watchlist, early_stopping_rounds = 5,
                 maximize = False, verbose_eval = 1)
print('Modeling RMSLE % .5f' %model.best_score)

[0]    train-rmse:5.72071    valid-rmse:5.71359
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.
Will train until valid-rmse hasn't improved in 5 rounds.
[1]    train-rmse:5.43652    valid-rmse:5.42959
[2]    train-rmse:5.16664    valid-rmse:5.15993
[3]    train-rmse:4.91048    valid-rmse:4.90398
[4]    train-rmse:4.66798    valid-rmse:4.66166
[5]    train-rmse:4.43687    valid-rmse:4.4307
[6]    train-rmse:4.21742    valid-rmse:4.21139
[7]    train-rmse:4.00899    valid-rmse:4.0031
[8]    train-rmse:3.81135    valid-rmse:3.80559
[9]    train-rmse:3.62356    valid-rmse:3.61793
[10]    train-rmse:3.44519    valid-rmse:3.43965
[11]    train-rmse:3.27661    valid-rmse:3.27129
[12]    train-rmse:3.11585    valid-rmse:3.11064
[13]    train-rmse:2.96326    valid-rmse:2.95812
[14]    train-rmse:2.81842    valid-rmse:2.8134
[15]    train-rmse:2.68097    valid-rmse:2.67604
[16]    train-rmse:2.55061    valid-rmse:2.54573
[17]    train-rmse:2.42696    valid-rmse:2.42213
[18]    train-rmse:2.30962    valid-rmse:2.30485
[19]    train-rmse:2.1983    valid-rmse:2.19364
[20]    train-rmse:2.09279    valid-rmse:2.08819
[21]    train-rmse:1.99277    valid-rmse:1.98825
[22]    train-rmse:1.89802    valid-rmse:1.89356
[23]    train-rmse:1.80818    valid-rmse:1.80382
[24]    train-rmse:1.72373    valid-rmse:1.71942
[25]    train-rmse:1.64307    valid-rmse:1.63881
[26]    train-rmse:1.56664    valid-rmse:1.56242
[27]    train-rmse:1.49434    valid-rmse:1.49023
[28]    train-rmse:1.42606    valid-rmse:1.42196
[29]    train-rmse:1.36146    valid-rmse:1.35739
[30]    train-rmse:1.30082    valid-rmse:1.29687
[31]    train-rmse:1.24305    valid-rmse:1.23917
[32]    train-rmse:1.18844    valid-rmse:1.18463
[33]    train-rmse:1.13687    valid-rmse:1.13312
[34]    train-rmse:1.08825    valid-rmse:1.08456
[35]    train-rmse:1.04279    valid-rmse:1.03918
[36]    train-rmse:0.999537    valid-rmse:0.995962
[37]    train-rmse:0.958848    valid-rmse:0.955314
[38]    train-rmse:0.920506    valid-rmse:0.917043
[39]    train-rmse:0.884371    valid-rmse:0.880961
[40]    train-rmse:0.850502    valid-rmse:0.847148
[41]    train-rmse:0.818737    valid-rmse:0.815504
[42]    train-rmse:0.788936    valid-rmse:0.785738
[43]    train-rmse:0.760988    valid-rmse:0.757844
[44]    train-rmse:0.73487    valid-rmse:0.731798
[45]    train-rmse:0.710508    valid-rmse:0.7075
[46]    train-rmse:0.687611    valid-rmse:0.684679
[47]    train-rmse:0.666258    valid-rmse:0.663418
[48]    train-rmse:0.646386    valid-rmse:0.643666
[49]    train-rmse:0.627901    valid-rmse:0.625226
[50]    train-rmse:0.610705    valid-rmse:0.608164
[51]    train-rmse:0.594841    valid-rmse:0.592374
[52]    train-rmse:0.580073    valid-rmse:0.577685
[53]    train-rmse:0.566276    valid-rmse:0.563964
[54]    train-rmse:0.553586    valid-rmse:0.551364
[55]    train-rmse:0.541802    valid-rmse:0.539665
[56]    train-rmse:0.530938    valid-rmse:0.528837
[57]    train-rmse:0.520891    valid-rmse:0.518795
[58]    train-rmse:0.511724    valid-rmse:0.509689
[59]    train-rmse:0.503191    valid-rmse:0.501274
[60]    train-rmse:0.495394    valid-rmse:0.493556
[61]    train-rmse:0.488261    valid-rmse:0.486498
[62]    train-rmse:0.481713    valid-rmse:0.480063
[63]    train-rmse:0.475621    valid-rmse:0.474073
[64]    train-rmse:0.470123    valid-rmse:0.468653
[65]    train-rmse:0.465024    valid-rmse:0.463596
[66]    train-rmse:0.460254    valid-rmse:0.458901
[67]    train-rmse:0.455962    valid-rmse:0.4547
[68]    train-rmse:0.452054    valid-rmse:0.450878
[69]    train-rmse:0.448391    valid-rmse:0.44725
[70]    train-rmse:0.445074    valid-rmse:0.443996
[71]    train-rmse:0.442037    valid-rmse:0.441076
[72]    train-rmse:0.439282    valid-rmse:0.438398
[73]    train-rmse:0.436759    valid-rmse:0.435955
[74]    train-rmse:0.434434    valid-rmse:0.433737
[75]    train-rmse:0.43231    valid-rmse:0.43164
[76]    train-rmse:0.430316    valid-rmse:0.429741
[77]    train-rmse:0.428579    valid-rmse:0.428089
[78]    train-rmse:0.426804    valid-rmse:0.42643
[79]    train-rmse:0.425351    valid-rmse:0.425114
[80]    train-rmse:0.424028    valid-rmse:0.423859
[81]    train-rmse:0.422741    valid-rmse:0.422651
[82]    train-rmse:0.421421    valid-rmse:0.421356
[83]    train-rmse:0.420333    valid-rmse:0.420332
[84]    train-rmse:0.419373    valid-rmse:0.419435
[85]    train-rmse:0.418452    valid-rmse:0.418557
[86]    train-rmse:0.417631    valid-rmse:0.417816
[87]    train-rmse:0.416853    valid-rmse:0.417082
[88]    train-rmse:0.416167    valid-rmse:0.416446
[89]    train-rmse:0.415507    valid-rmse:0.415831
[90]    train-rmse:0.414853    valid-rmse:0.415315
[91]    train-rmse:0.414216    valid-rmse:0.414727
[92]    train-rmse:0.413691    valid-rmse:0.41419
[93]    train-rmse:0.413165    valid-rmse:0.41368
[94]    train-rmse:0.412663    valid-rmse:0.413233
[95]    train-rmse:0.412157    valid-rmse:0.412737
[96]    train-rmse:0.411703    valid-rmse:0.412313
[97]    train-rmse:0.411311    valid-rmse:0.411921
[98]    train-rmse:0.410933    valid-rmse:0.411623
[99]    train-rmse:0.410626    valid-rmse:0.411271
Modeling RMSLE  0.41127

ax = xgb.plot_importance(model, max_num_features = 70, height = 0.9)
ax.figure.set_size_inches(20, 30)

如何在Jupyter放大XGB的绘图尺寸？

fig, ax = plt.subplots(figsize=(10,8))
plot_importance(model, ax=ax)
或者，你可以让绘图函数创建图形，然后更改其大小。
ax = plot_importance(model)
ax.figure.set_size_inches(10,8)

pred = model.predict(dtest)
pred = np.exp(pred) - 1

submission = pd.concat([Test_id, pd.DataFrame(pred)], axis = 1)
submission.columns = ['id', 'trip_duration']
submission['trip_duration'] = submission.apply(lambda x : 1 if (x['trip_duration'] <= 0) else x['trip_duration'], axis = 1)
submission.to_csv('submission.csv', index = False)

数据分析_学习笔记

纽约出租车车程用时预测