Date: 19 AUG


Preparation and Brief information

  1. import numpy as np
  2. import pandas as pd
  3. from sklearn import preprocessing, metrics
  4. import lightgbm as lgb
  5. # Set up code checking
  6. from learntools.core import binder
  7. binder.bind(globals())
  8. from learntools.feature_engineering.ex3 import *
  9. # Create features from timestamps
  10. click_data = pd.read_csv('../input/feature-engineering-data/train_sample.csv',
  11. parse_dates=['click_time'])
  12. click_times = click_data['click_time']
  13. clicks = click_data.assign(day=click_times.dt.day.astype('uint8'),
  14. hour=click_times.dt.hour.astype('uint8'),
  15. minute=click_times.dt.minute.astype('uint8'),
  16. second=click_times.dt.second.astype('uint8'))
  17. # Label encoding for categorical features
  18. cat_features = ['ip', 'app', 'device', 'os', 'channel']
  19. for feature in cat_features:
  20. label_encoder = preprocessing.LabelEncoder()
  21. clicks[feature] = label_encoder.fit_transform(clicks[feature])
  22. def get_data_splits(dataframe, valid_fraction=0.1):
  23. dataframe = dataframe.sort_values('click_time')
  24. valid_rows = int(len(dataframe) * valid_fraction)
  25. train = dataframe[:-valid_rows * 2]
  26. # valid size == test size, last two sections of the data
  27. valid = dataframe[-valid_rows * 2:-valid_rows]
  28. test = dataframe[-valid_rows:]
  29. return train, valid, test
  30. def train_model(train, valid, test=None, feature_cols=None):
  31. if feature_cols is None:
  32. feature_cols = train.columns.drop(['click_time', 'attributed_time',
  33. 'is_attributed'])
  34. dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
  35. dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
  36. param = {'num_leaves': 64, 'objective': 'binary',
  37. 'metric': 'auc', 'seed': 7}
  38. num_round = 1000
  39. print("Training model. Hold on a minute to see the validation score")
  40. bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid],
  41. early_stopping_rounds=20, verbose_eval=False)
  42. valid_pred = bst.predict(valid[feature_cols])
  43. valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
  44. print(f"Validation AUC score: {valid_score}")
  45. if test is not None:
  46. test_pred = bst.predict(test[feature_cols])
  47. test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
  48. return bst, valid_score, test_score
  49. else:
  50. return bst, valid_score
  51. print("Baseline model score")
  52. train, valid, test = get_data_splits(clicks)
  53. _ = train_model(train, valid)


P01 Add interaction features

Hint: The easiest way to loop through the pairs is with itertools.combinations. Once you have that working, for each pair of columns convert them to strings then you can join them with the + operator. It’s usually good to join with a symbol like _ inbetween to ensure unique values. Now you should have a column of new categorical values, you can label encoder those and add them to the DataFrame

  1. cat_features = ['ip', 'app', 'device', 'os', 'channel']
  2. interactions = pd.DataFrame(index=clicks.index)
  3. for col1, col2 in itertools.combinations(cat_features, 2):
  4. new_col_name = '_'.join([col1, col2])
  5. # Convert to strings and combine
  6. new_values = clicks[col1].map(str) + "_" + clicks[col2].map(str)
  7. encoder = preprocessing.LabelEncoder()
  8. interactions[new_col_name] = encoder.fit_transform(new_values)


P02 Generating numerical features

Hint: You can get a rolling time window using .rolling(), but first you need to convert the index to a time series. The current row is included in the window, but we want to count all the events before the current row, so be sure to adjust the count.

Number of events in the past six hours

  1. def count_past_events(series, time_window='6H'):
  2. series = pd.Series(series.index, index=series)
  3. # Subtract 1 so the current event isn't counted
  4. past_events = series.rolling(time_window).count() - 1
  5. return past_events
  1. # Loading in from saved Parquet file
  2. past_events = pd.read_parquet('../input/feature-engineering-data/past_6hr_events.pqt')
  3. clicks['ip_past_6hr_counts'] = past_events
  4. train, valid, test = get_data_splits(clicks)
  5. _ = train_model(train, valid)

P03 Features from future information

Should you use future events or not?
Solution:
In general, you shouldn’t use information from the future. When you’re using models like this in a real-world scenario you won’t have data from the future. Your model’s score will likely be higher when training and testing on historical data, but it will overestimate the performance on real data.
I should note that using future data will improve the score on Kaggle competition test data, but avoid it when building machine learning products.
The data is inactruate

P04 Time since last event

Hint: Try using the .diff() method on a time series.

  1. def time_diff(series):
  2. """Returns a series with the time since the last timestamp in seconds."""
  3. return series.diff().dt.total_seconds()
  4. timedeltas = clicks.groupby('ip')['click_time'].transform(time_diff)

P05 Number of previous app downloads

Hint:
Here you want a window that always starts at the first row but expands as you get further in the data.
You can use the .expanding methods for this.
Also, the current row is included in the window, so you’ll need to subtract that off as well

  1. def previous_attributions(series):
  2. """Returns a series with the number of times an app has been downloaded."""
  3. # Subtracting raw values so I don't count the current event
  4. sums = series.expanding(min_periods=2).sum() - series
  5. return sums


P06 Tree-based vs Neural Network Models

Solution:
The features themselves will work for either model. However, numerical inputs to neural networks need to be standardized first. That is, the features need to be scaled such that they have 0 mean and a standard deviation of 1. This can be done using sklearn.preprocessing.StandardScaler.