Data From SHIYANLOU 實驗樓

01 Renthop 公司數據集

  1. # 下载数据并解压
  2. !wget -nc "https://labfile.oss.aliyuncs.com/courses/1283/renthop_train.json.gz"
  3. !gunzip "renthop_train.json.gz"

02 Telecom Churn data 電信商

  1. !wget 'https://labfile.oss.aliyuncs.com/courses/1283/telecom_churn.csv'

03 House Price 房價數據

  1. !wget 'https://labfile.oss.aliyuncs.com/courses/1363/HousePrice.csv'

04 Titanic Data

  1. !wget 'https://labfile.oss.aliyuncs.com/courses/1363/Titanic.csv'

05 SOCR Dataset

Weights and Heights

  1. !wget 'https://labfile.oss.aliyuncs.com/courses/1283/weights_heights.csv' # 导入数据集

06 Adult Data

  1. !wget 'https://labfile.oss.aliyuncs.com/courses/1283/adult.data.csv'
  2. data['salary'] = pd.factorize(data['salary'])[0]
  3. string = ['workclass', 'education', 'marital-status', 'relationship',
  4. 'occupation', 'race', 'sex']
  5. for i in string:
  6. data[i] = pd.factorize(data[i])[0]
  7. data.info()

07 Home Credit

  1. # Download data
  2. !wget "https://labfile.oss.aliyuncs.com/courses/1363/HomeCredit.csv"
  3. !ls
  4. # Import library
  5. import pandas as pd
  6. import matplotlib.pyplot as plt
  7. import seaborn as sns
  8. import warnings
  9. warnings.filterwarnings("ignore")
  10. %matplotlib inline
  11. path = 'HomeCredit.csv'
  12. house = pd.read_csv(path)
  13. house.head()
  14. # Main columns
  15. column = [
  16. 'AMT_CREDIT', 'AMT_INCOME_TOTAL', 'AMT_GOODS_PRICE',
  17. 'NAME_TYPE_SUITE', 'TARGET','NAME_CONTRACT_TYPE', 'NAME_INCOME_TYPE',
  18. 'NAME_FAMILY_STATUS','OCCUPATION_TYPE','NAME_EDUCATION_TYPE','NAME_HOUSING_TYPE',
  19. 'DAYS_BIRTH','DAYS_EMPLOYED'
  20. ]
  21. MainHouse = house[column]
  22. MainHouse.head()
  23. # Label Encoding
  24. from sklearn.preprocessing import LabelEncoder
  25. cat_features = [
  26. 'NAME_TYPE_SUITE', 'NAME_CONTRACT_TYPE', 'NAME_INCOME_TYPE',
  27. 'NAME_FAMILY_STATUS', 'OCCUPATION_TYPE',
  28. 'NAME_HOUSING_TYPE', 'NAME_EDUCATION_TYPE'
  29. ]
  30. for col in cat_features:
  31. encoder = LabelEncoder()
  32. value = list(MainHouse[col].values.astype('str'))
  33. encoder.fit(value)
  34. MainHouse[col] = encoder.transform(value)
  35. MainHouse.head()

The data preparation until the MainHouse data is generated.

image.png

08 Google Play

  1. !wget -nc "https://labfile.oss.aliyuncs.com/courses/1363/googleplaystore.csv"
  2. !ls

LightGBM Encoding

  1. import lightgbm as lgb
  2. feature_cols = train.columns.drop('TARGET')
  3. dtrain = lgb.Dataset(train[feature_cols], label = train['TARGET'])
  4. dvalid = lgb.Dataset(train[feature_cols], label = valid['TARGET'])
  5. param = {
  6. 'num_leaves': 64,
  7. 'objective': "binary"
  8. }
  9. param['metric'] = 'auc'
  10. num_round = 1000
  11. bst = lgb.train(
  12. param, dtrain, num_round, valid_sets = [dvalid],
  13. early_stopping_rounds = 10,
  14. verbose_eval = False
  15. )

Example of Kaggle - Click

  1. def train_model(train, valid, test=None, feature_cols=None):
  2. if feature_cols is None:
  3. feature_cols = train.columns.drop(['click_time', 'attributed_time',
  4. 'is_attributed'])
  5. dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
  6. dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
  7. param = {'num_leaves': 64, 'objective': 'binary',
  8. 'metric': 'auc', 'seed': 7}
  9. num_round = 1000
  10. print("Training model!")
  11. bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid],
  12. early_stopping_rounds=20, verbose_eval=False)
  13. valid_pred = bst.predict(valid[feature_cols])
  14. valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
  15. print(f"Validation AUC score: {valid_score}")
  16. if test is not None:
  17. test_pred = bst.predict(test[feature_cols])
  18. test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
  19. return bst, valid_score, test_score
  20. else:
  21. return bst, valid_score

C01 Machine Learning

image.png
The class downloaded in SHIYANLOU 實驗樓

Telecom Churn data 電信商

!wget 'https://labfile.oss.aliyuncs.com/courses/1283/telecom_churn.csv'

  1. df = pd.read_csv(
  2. 'https://labfile.oss.aliyuncs.com/courses/1283/telecom_churn.csv')
  3. df['International plan'] = pd.factorize(df['International plan'])[0]
  4. df['Voice mail plan'] = pd.factorize(df['Voice mail plan'])[0]
  5. df['Churn'] = df['Churn'].astype('int')
  6. states = df['State']
  7. y = df['Churn']
  8. df.drop(['State', 'Churn'], axis=1, inplace=True)
  9. from sklearn.model_selection import train_test_split, StratifiedKFold
  10. from sklearn.neighbors import KNeighborsClassifier
  11. X_train, X_holdout, y_train, y_holdout = train_test_split(
  12. df.values, y, test_size=0.3,
  13. random_state=17
  14. )
  15. tree = DecisionTreeClassifier(max_depth=5, random_state=17)
  16. knn = KNeighborsClassifier(n_neighbors=10)
  17. tree.fit(X_train, y_train)
  18. knn.fit(X_train, y_train)
  19. from sklearn.metrics import accuracy_score
  20. tree_pred = tree.predict(X_holdout)
  21. accuracy_score(y_holdout, tree_pred)
  22. knn_pred = knn.predict(X_holdout)
  23. accuracy_score(y_holdout, knn_pred)

Credit Scoring 信用評分預測

  1. !wget 'https://labfile.oss.aliyuncs.com/courses/1283/credit_scoring_sample.csv'
  2. path = 'credit_scoring_sample.csv'
  3. data = pd.read_csv(path, sep = ';')
  4. data.head()
Feature Variable Type Value Type Description
age Input Feature integer Customer age
DebtRatio Input Feature real Total monthly loan payments (loan, alimony, etc.) / Total monthly income percentage
NumberOfTime30-59DaysPastDueNotWorse Input Feature integer The number of cases when client has overdue 30-59 days (not worse) on other loans during the last 2 years
NumberOfTimes90DaysLate Input Feature integer Number of cases when customer had 90+dpd overdue on other credits
NumberOfTime60-89DaysPastDueNotWorse Input Feature integer` Number of cased when customer has 60-89dpd (not worse) during the last 2 years
NumberOfDependents Input Feature integer The number of customer dependents
SeriousDlqin2yrs Target Variable binary:
0 or 1
Customer hasn’t paid the loan debt within 90 days

Missing Values

01 Preprocessed training and validation features
02 Imputation removed column names; put them back

  1. from sklearn.impute import SimpleImputer
  2. final_imputer = SimpleImputer(strategy = 'median')
  3. final_X_train = pd.DataFrame(final_imputer.fit_transform(X_train))
  4. final_X_valid = pd.DataFrame(final_imputer.transform(X_valid))
  5. final_X_train.columns = X_train.columns
  6. final_X_valid.columns = X_valid.columns

Code for perparation

  1. data['Voice mail plan'] = pd.factorize(data['Voice mail plan'])[0]
  2. data['Churn'] = data['Churn'].astype('int')
  3. states = data['State']
  4. y = data['Churn']
  5. data.drop(['State', 'Churn'], axis = 1, inplace = True)

Change the data from str / object to int64 / float64 and delete the +, ,

  1. string = ['+', ',']
  2. for i in string:
  3. google['Installs'] = google['Installs'].apply(
  4. lambda x: x.replace(i, '') if i in str(x) else x
  5. )
  6. google['Installs'] = google['Installs'].apply(lambda x: int(x))

Time Series Data

  1. # Example as google Playstore data
  2. google['Last Updated'] = pd.to_datatime(
  3. google['Last Updated'], errors = 'coerce'
  4. )
  5. # Timestamp features
  6. google = google.assign(
  7. day = google['Last Updated'].dt.day,
  8. month = google['Last Updated'].dt.month,
  9. year = google['Last Updated'].dt.year
  10. )

C02 Feature Engineering

LightGBM Encoding

  1. import lightgbm as lgb
  2. from sklearn import metrics
  3. def train_model(train, valid):
  4. feature_cols = train.columns.drop('TARGET')
  5. dtrain = lgb.Dataset(
  6. train[feature_cols],
  7. label = train['TARGET'])
  8. dvalid = lgb.Dataset(
  9. valid[feature_cols],
  10. label = valid['TARGET'])
  11. param = {
  12. 'num_leaves': 64,
  13. 'objective': 'binary',
  14. 'metric': 'auc',
  15. 'seed': 7
  16. }
  17. bst = lgb.train(
  18. param, dtrain, num_boost_round = 1000,
  19. valid_sets = [dvalid], early_stopping_rounds = 10,
  20. verbose_eval = False
  21. )
  22. valid_pred = bst.predict(valid[feature_cols])
  23. valid_score = metrics.roc_auc_score(valid['TARGET'], valid_pred)
  24. print('Validation AUC score: {:.4f}'.format(valid_score))

Data Splits

  1. def get_data_splits(dataframe, valid_fraction = 0.1):
  2. """
  3. Splits a dataframe into train, validation, and test sets.
  4. First, orders by the column 'click_time'. Set the size of the
  5. validation and test sets with the valid_fraction keyword argument.
  6. """
  7. valid_fraction = 0.1
  8. valid_size = int(len(dataframe) * valid_fraction)
  9. train = dataframe[ : -valid_size * 2]
  10. valid = dataframe[ -valid_size * 2 : -valid_size]
  11. test = dataframe[-valid_size : ]
  12. return train, valid, test

01 Count Encoding

  1. # Count Encoding
  2. import category_encoders as ce
  3. count_enc = ce.CountEncoder()
  4. count_encoded = count_enc.fit_transform(house[cat_features])
  5. MainHouse = MainHouse.join(count_encoded.add_suffix('_count'))
  6. train, valid, test = get_data_splits(MainHouse)
  7. train_model(train, valid)

The version after the data splits

  1. # Create the count encoder
  2. count_enc = ce.CountEncoder(cols=cat_features)
  3. # Learn encoding from the training set
  4. count_enc.fit(train[cat_features])
  5. # Apply encoding to the train and validation sets
  6. train_encoded = train.join(count_enc.transform(train[cat_features]).add_suffix('_count'))
  7. valid_encoded = valid.join(count_enc.transform(valid[cat_features]).add_suffix('_count'))

Question:
Why is count encoding effective?
Rare values tend to have similar counts (with values like 1 or 2), so you can classify rare values together at prediction time. Common values with large counts are unlikely to have the same exact count as other values. So, the common/important values get their own grouping.

02 Target Encoding

  1. # Target Encoding
  2. # Start typing ce. the press Tab to bring up a list of classes and functions.
  3. target_enc = ce.TargetEncoder(cols = cat_features)
  4. # Learn encoding from the training set. Use the 'is_attributed' column as the target.
  5. target_enc.fit(train[cat_features], train['TARGET'])
  6. # Apply encoding to the train and validation sets as new columns
  7. # Make sure to add `_target` as a suffix to the new columns
  8. train_TE = train.join(
  9. target_enc.transform(train[cat_features]).add_suffix('_target'))
  10. valid_TE = valid.join(
  11. target_enc.transform(valid[cat_features]).add_suffix('_target'))
  12. train_model(train_TE, valid_TE)

Question:
Try removing IP encoding
Why do you think the score is below baseline when we encode the IP address but above baseline when we don’t?
Target encoding attempts to measure the population mean of the target for each level in a categorical feature. This means when there is less data per level, the estimated mean will be further away from the “true” mean, there will be more variance.
There is little data per IP address so it’s likely that the estimates are much noisier than for the other features. The model will rely heavily on this feature since it is extremely predictive.

This causes it to make fewer splits on other features, and those features are fit on just the errors left over accounting for IP address. So, the model will perform very poorly when seeing new IP addresses that weren’t in the training data (which is likely most new data).
Going forward, we’ll leave out the IP feature when trying different encodings.

03 CatBoost Encoding

  1. # CatBoost Encoding
  2. target_enc = ce.CatBoostEncoder(cols = cat_features)
  3. target_enc.fit(train[cat_features], train['TARGET'])
  4. train_CBE = train.join(target_enc.transform(train[cat_features]).add_suffix('_cb'))
  5. valid_CBE = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_cb'))
  6. train_model(train_CBE, valid_CBE)

Feature Generation

01 Interaction Features

Feature Selection

数据预处理完成后,接下来需要从给定的特征集合中筛选出对当前学习任务有用的特征,这个过程称为特征选择(feature selection)。

特征选择的两个标准:
特征是否发散:
如果一个特征不发散,例如方差接近于0,也就是说样本在这个特征上基本上没有差异,这个特征对于样本的区分并没有什么用。

特征与目标的相关性:
优先选择与目标相关性高的特征。

01 Univariate Feature Selection

image.png
SelectKBest()的第一項參數須給定評分函數,在本範例是設定為f_regression 。第二項參數代表選擇評估分數最高的3個特徵做為訓練的素材。建立完成後,即可用物件內的方法.fit_transform(X,y) 來提取被選出來的特徵。

  1. from sklearn.feature_selection import SelectKBest, f_classif
  2. feature_cols = baseline_data.columns.drop('Installs')
  3. # Create the selector, keeping 5 features
  4. selector = SelectKBest(f_classif, k = 5)
  5. # Use the selector to retrieve the best features
  6. X_new = selector.fit_transform(
  7. baseline_data[feature_cols], baseline_data['Installs']
  8. )
  9. X_new
  1. train, valid, _ = get_data_splits(baseline_data)
  2. selector = SelectKBest(f_classif, k = 5)
  3. X_new = selector.fit_transform(
  4. train[feature_cols], train['Installs'])
  5. X_new
  1. # Get back the kept features as a DataFrame with dropped columns as all 0s
  2. selected_features = pd.DataFrame(
  3. selector.inverse_transform(X_new),
  4. index = train.index,
  5. columns = feature_cols
  6. )
  7. selected_features.head()
  1. # Dropped columns have values of all 0s, so var is 0, drop them
  2. selected_columns = selected_features.columns[
  3. selected_features.var() != 0
  4. ]
  5. # Find the columns that were dropped
  6. dropped_columns = selected_features.columns[
  7. selected_features.var() == 0
  8. ]
  9. # Get the valid dataset with the selected features.
  10. valid[selected_columns].head()

Question
The best value of K
How would you find the “best” value of K?

Solution:
To find the best value of K, you can fit multiple models with increasing values of K, then choose the smallest K with validation score above some threshold or some other criteria.

A good way to do this is loop over values of K and record the validation scores for each iteration.

02 L1 Regularization 正則化 for feature selection

image.png
Hint:
First fit the logistic regression model, then pass it to SelectFromModel. That should give you a model with the selected features, you can get the selected features with X_new = model.transform(X).

However, this leaves off the column labels so you’ll need to get them back. The easiest way to do this is to use model.inverse_transform(X_new) to get back the original X array with the dropped columns as all zeros. Then you can create a new DataFrame with the index and columns of X. From there, keep the columns that aren’t all zeros.

Example

  1. def select_features_l1(X, y):
  2. logistic = LogisticRegression(C=0.1, penalty="l1", random_state=7).fit(X, y)
  3. model = SelectFromModel(logistic, prefit=True)
  4. X_new = model.transform(X)
  5. # Get back the kept features as a DataFrame with dropped columns as all 0s
  6. selected_features = pd.DataFrame(model.inverse_transform(X_new),
  7. index=X.index,
  8. columns=X.columns)
  9. # Dropped columns have values of all 0s, keep other columns
  10. cols_to_keep = selected_features.columns[selected_features.var() != 0]
  11. return cols_to_keep

03 Feature Selection with Trees

What would you do different to select the features using a trees classifier?

Solution:
You could use something like RandomForestClassifier or ExtraTreesClassifier to find feature importances. SelectFromModel can use the feature importances to find the best features.

04 Top K features with L1 regularization

by setting C you aren’t able to choose a certain number of features to keep. What would you do to keep the top K important features using L1 regularization?

Solution:
To select a certain number of features with L1 regularization, you need to find the regularization parameter that leaves the desired number of features. To do this you can iterate over models with different regularization parameters from low to high and choose the one that leaves K features. Note that for the scikit-learn models C is the inverse of the regularization strength.

Conclusion

首先来说说这几个术语:
特征工程 Feature Engineering:
利用数据领域的相关知识来创建能够使机器学习算法达到最佳性能的特征的过程。

特征构建:
是原始数据中人工的构建新的特征。

特征提取:
自动地构建新的特征,将原始特征转换为一组具有明显物理意义或者统计意义或核的特征。

特征选择:
从特征集合中挑选一组最具统计意义的特征子集,从而达到降维的效果

Feature engineering is a super-set of activities which include feature extraction, feature construction and feature selection. Each of the three are important steps and none should be ignored. We could make a generalization of the importance though, from my experience the relative importance of the steps would be feature construction > feature extraction > feature selection.

特征工程是一个超集,它包括特征提取、特征构建和特征选择这三个子模块。在实践当中,每一个子模块都非常重要,忽略不得。根据答主的经验,他将这三个子模块的重要性进行了一个排名,即:特征构建>特征提取>特征选择。

事实上,真的是这样,如果特征构建做的不好,那么它会直接影响特征提取,进而影响了特征选择,最终影响模型的性能。