Predicting House Prices on Kaggle

:label:sec_kaggle_house

Now that we have introduced some basic tools for building and training deep networks and regularizing them with techniques including weight decay and dropout, we are ready to put all this knowledge into practice by participating in a Kaggle competition. The house price prediction competition is a great place to start. The data are fairly generic and do not exhibit exotic structure that might require specialized models (as audio or video might). This dataset, collected by Bart de Cock in 2011 :cite:De-Cock.2011, covers house prices in Ames, IA from the period of 2006—2010. It is considerably larger than the famous Boston housing dataset of Harrison and Rubinfeld (1978), boasting both more examples and more features.

In this section, we will walk you through details of data preprocessing, model design, and hyperparameter selection. We hope that through a hands-on approach, you will gain some intuitions that will guide you in your career as a data scientist.

Downloading and Caching Datasets

Throughout the book, we will train and test models on various downloaded datasets. Here, we implement several utility functions to facilitate data downloading. First, we maintain a dictionary DATA_HUB that maps a string (the name of the dataset) to a tuple containing both the URL to locate the dataset and the SHA-1 key that verifies the integrity of the file. All such datasets are hosted at the site whose address is DATA_URL.

```{.python .input}

@tab all

import os import requests import zipfile import tarfile import hashlib

DATA_HUB = dict() #@save DATA_URL = ‘http://d2l-data.s3-accelerate.amazonaws.com/‘ #@save

  1. The following `download` function downloads a dataset,
  2. caching it in a local directory (`../data` by default)
  3. and returns the name of the downloaded file.
  4. If a file corresponding to this dataset
  5. already exists in the cache directory
  6. and its SHA-1 matches the one stored in `DATA_HUB`,
  7. our code will use the cached file to avoid
  8. clogging up your internet with redundant downloads.
  9. ```{.python .input}
  10. #@tab all
  11. def download(name, cache_dir=os.path.join('..', 'data')): #@save
  12. """Download a file inserted into DATA_HUB, return the local filename."""
  13. assert name in DATA_HUB, f"{name} does not exist in {DATA_HUB}."
  14. url, sha1_hash = DATA_HUB[name]
  15. d2l.mkdir_if_not_exist(cache_dir)
  16. fname = os.path.join(cache_dir, url.split('/')[-1])
  17. if os.path.exists(fname):
  18. sha1 = hashlib.sha1()
  19. with open(fname, 'rb') as f:
  20. while True:
  21. data = f.read(1048576)
  22. if not data:
  23. break
  24. sha1.update(data)
  25. if sha1.hexdigest() == sha1_hash:
  26. return fname # Hit cache
  27. print(f'Downloading {fname} from {url}...')
  28. r = requests.get(url, stream=True, verify=True)
  29. with open(fname, 'wb') as f:
  30. f.write(r.content)
  31. return fname

We also implement two additional utility functions: one is to download and extract a zip or tar file and the other to download all the datasets used in this book from DATA_HUB into the cache directory.

```{.python .input}

@tab all

def download_extract(name, folder=None): #@save “””Download and extract a zip/tar file.””” fname = download(name) base_dir = os.path.dirname(fname) data_dir, ext = os.path.splitext(fname) if ext == ‘.zip’: fp = zipfile.ZipFile(fname, ‘r’) elif ext in (‘.tar’, ‘.gz’): fp = tarfile.open(fname, ‘r’) else: assert False, ‘Only zip/tar files can be extracted.’ fp.extractall(base_dir) return os.path.join(base_dir, folder) if folder else data_dir

def download_all(): #@save “””Download all files in the DATA_HUB.””” for name in DATA_HUB: download(name)

  1. ## Kaggle
  2. [Kaggle](https://www.kaggle.com) is a popular platform
  3. that hosts machine learning competitions.
  4. Each competition centers on a dataset and many
  5. are sponsored by stakeholders who offer prizes
  6. to the winning solutions.
  7. The platform helps users to interact
  8. via forums and shared code,
  9. fostering both collaboration and competition.
  10. While leaderboard chasing often spirals out of control,
  11. with researchers focusing myopically on preprocessing steps
  12. rather than asking fundamental questions,
  13. there is also tremendous value in the objectivity of a platform
  14. that facilitates direct quantitative comparisons
  15. among competing approaches as well as code sharing
  16. so that everyone can learn what did and did not work.
  17. If you want to participate in a Kaggle competitions,
  18. you will first need to register for an account
  19. (see :numref:`fig_kaggle`).
  20. ![The Kaggle website.](/uploads/projects/d2l-ai-CN/img/kaggle.png)
  21. :width:`400px`
  22. :label:`fig_kaggle`
  23. On the house price prediction competition page, as illustrated
  24. in :numref:`fig_house_pricing`,
  25. you can find the dataset (under the "Data" tab),
  26. submit predictions, and see your ranking,
  27. The URL is right here:
  28. > https://www.kaggle.com/c/house-prices-advanced-regression-techniques
  29. ![The house price prediction competition page.](/uploads/projects/d2l-ai-CN/img/house-pricing.png)
  30. :width:`400px`
  31. :label:`fig_house_pricing`
  32. ## Accessing and Reading the Dataset
  33. Note that the competition data is separated
  34. into training and test sets.
  35. Each record includes the property value of the house
  36. and attributes such as street type, year of construction,
  37. roof type, basement condition, etc.
  38. The features consist of various data types.
  39. For example, the year of construction
  40. is represented by an integer,
  41. the roof type by discrete categorical assignments,
  42. and other features by floating point numbers.
  43. And here is where reality complicates things:
  44. for some examples, some data are altogether missing
  45. with the missing value marked simply as "na".
  46. The price of each house is included
  47. for the training set only
  48. (it is a competition after all).
  49. We will want to partition the training set
  50. to create a validation set,
  51. but we only get to evaluate our models on the official test set
  52. after uploading predictions to Kaggle.
  53. The "Data" tab on the competition tab
  54. in :numref:`fig_house_pricing`
  55. has links to download the data.
  56. To get started, we will read in and process the data
  57. using `pandas`, which we have introduced in :numref:`sec_pandas`.
  58. So, you will want to make sure that you have `pandas` installed
  59. before proceeding further.
  60. Fortunately, if you are reading in Jupyter,
  61. we can install pandas without even leaving the notebook.
  62. ```{.python .input}
  63. # If pandas is not installed, please uncomment the following line:
  64. # !pip install pandas
  65. %matplotlib inline
  66. from d2l import mxnet as d2l
  67. from mxnet import gluon, autograd, init, np, npx
  68. from mxnet.gluon import nn
  69. import pandas as pd
  70. npx.set_np()

```{.python .input}

@tab pytorch

If pandas is not installed, please uncomment the following line:

!pip install pandas

%matplotlib inline from d2l import torch as d2l import torch import torch.nn as nn import pandas as pd import numpy as np

  1. ```{.python .input}
  2. #@tab tensorflow
  3. # If pandas is not installed, please uncomment the following line:
  4. # !pip install pandas
  5. %matplotlib inline
  6. from d2l import tensorflow as d2l
  7. import tensorflow as tf
  8. import pandas as pd
  9. import numpy as np

For convenience, we can download and cache the Kaggle housing dataset using the script we defined above.

```{.python .input}

@tab all

DATA_HUB[‘kaggle_house_train’] = ( #@save DATA_URL + ‘kaggle_house_pred_train.csv’, ‘585e9cc93e70b39160e7921475f9bcd7d31219ce’)

DATA_HUB[‘kaggle_house_test’] = ( #@save DATA_URL + ‘kaggle_house_pred_test.csv’, ‘fa19780a7b011d9b009e8bff8e99922a8ee2eb90’)

  1. We use `pandas` to load the two csv files containing training and test data respectively.
  2. ```{.python .input}
  3. #@tab all
  4. train_data = pd.read_csv(download('kaggle_house_train'))
  5. test_data = pd.read_csv(download('kaggle_house_test'))

The training dataset includes 1460 examples, 80 features, and 1 label, while the test data contains 1459 examples and 80 features.

```{.python .input}

@tab all

print(train_data.shape) print(test_data.shape)

  1. Let us take a look at the first four and last two features
  2. as well as the label (SalePrice) from the first four examples.
  3. ```{.python .input}
  4. #@tab all
  5. print(train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]])

We can see that in each example, the first feature is the ID. This helps the model identify each training example. While this is convenient, it does not carry any information for prediction purposes. Hence, we remove it from the dataset before feeding the data into the model.

```{.python .input}

@tab all

all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))

  1. ## Data Preprocessing
  2. As stated above, we have a wide variety of data types.
  3. We will need to preprocess the data before we can start modeling.
  4. Let us start with the numerical features.
  5. First, we apply a heuristic,
  6. replacing all missing values
  7. by the corresponding feature's mean.
  8. Then, to put all features on a common scale,
  9. we *standardize* the data by
  10. rescaling features to zero mean and unit variance:
  11. $$x \leftarrow \frac{x - \mu}{\sigma}.$$
  12. To verify that this indeed transforms
  13. our feature (variable) such that it has zero mean and unit variance,
  14. note that $E[\frac{x-\mu}{\sigma}] = \frac{\mu - \mu}{\sigma} = 0$
  15. and that $E[(x-\mu)^2] = (\sigma^2 + \mu^2) - 2\mu^2+\mu^2 = \sigma^2$.
  16. Intuitively, we standardize the data
  17. for two reasons.
  18. First, it proves convenient for optimization.
  19. Second, because we do not know *a priori*
  20. which features will be relevant,
  21. we do not want to penalize coefficients
  22. assigned to one feature more than on any other.
  23. ```{.python .input}
  24. #@tab all
  25. numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
  26. all_features[numeric_features] = all_features[numeric_features].apply(
  27. lambda x: (x - x.mean()) / (x.std()))
  28. # After standardizing the data all means vanish, hence we can set missing
  29. # values to 0
  30. all_features[numeric_features] = all_features[numeric_features].fillna(0)

Next we deal with discrete values. This includes features such as “MSZoning”. We replace them by a one-hot encoding in the same way that we previously transformed multiclass labels into vectors (see :numref:subsec_classification-problem). For instance, “MSZoning” assumes the values “RL” and “RM”. Dropping the “MSZoning” feature, two new indicator features “MSZoning_RL” and “MSZoning_RM” are created with values being either 0 or 1. According to one-hot encoding, if the original value of “MSZoning” is “RL”, then “MSZoning_RL” is 1 and “MSZoning_RM” is 0. The pandas package does this automatically for us.

```{.python .input}

@tab all

Dummy_na=True considers “na” (missing value) as a valid feature value, and

creates an indicator feature for it

all_features = pd.get_dummies(all_features, dummy_na=True) all_features.shape

  1. You can see that this conversion increases
  2. the number of features from 79 to 331.
  3. Finally, via the `values` attribute,
  4. we can extract the NumPy format from the `pandas` format
  5. and convert it into the tensor
  6. representation for training.
  7. ```{.python .input}
  8. #@tab all
  9. n_train = train_data.shape[0]
  10. train_features = d2l.tensor(all_features[:n_train].values, dtype=d2l.float32)
  11. test_features = d2l.tensor(all_features[n_train:].values, dtype=d2l.float32)
  12. train_labels = d2l.tensor(
  13. train_data.SalePrice.values.reshape(-1, 1), dtype=d2l.float32)

Training

To get started we train a linear model with squared loss. Not surprisingly, our linear model will not lead to a competition-winning submission but it provides a sanity check to see whether there is meaningful information in the data. If we cannot do better than random guessing here, then there might be a good chance that we have a data processing bug. And if things work, the linear model will serve as a baseline giving us some intuition about how close the simple model gets to the best reported models, giving us a sense of how much gain we should expect from fancier models.

```{.python .input} loss = gluon.loss.L2Loss()

def get_net(): net = nn.Sequential() net.add(nn.Dense(1)) net.initialize() return net

  1. ```{.python .input}
  2. #@tab pytorch
  3. loss = nn.MSELoss()
  4. in_features = train_features.shape[1]
  5. def get_net():
  6. net = nn.Sequential(nn.Linear(in_features,1))
  7. return net

```{.python .input}

@tab tensorflow

loss = tf.keras.losses.MeanSquaredError()

def get_net(): net = tf.keras.models.Sequential() net.add(tf.keras.layers.Dense( 1, kernel_regularizer=tf.keras.regularizers.l2(weight_decay))) return net

  1. With house prices, as with stock prices,
  2. we care about relative quantities
  3. more than absolute quantities.
  4. Thus we tend to care more about
  5. the relative error $\frac{y - \hat{y}}{y}$
  6. than about the absolute error $y - \hat{y}$.
  7. For instance, if our prediction is off by USD 100,000
  8. when estimating the price of a house in Rural Ohio,
  9. where the value of a typical house is 125,000 USD,
  10. then we are probably doing a horrible job.
  11. On the other hand, if we err by this amount
  12. in Los Altos Hills, California,
  13. this might represent a stunningly accurate prediction
  14. (there, the median house price exceeds 4 million USD).
  15. One way to address this problem is to
  16. measure the discrepancy in the logarithm of the price estimates.
  17. In fact, this is also the official error measure
  18. used by the competition to evaluate the quality of submissions.
  19. After all, a small value $\delta$ for $|\log y - \log \hat{y}| \leq \delta$
  20. translates into $e^{-\delta} \leq \frac{\hat{y}}{y} \leq e^\delta$.
  21. This leads to the following root-mean-squared-error between the logarithm of the predicted price and the logarithm of the label price:
  22. $$\sqrt{\frac{1}{n}\sum_{i=1}^n\left(\log y_i -\log \hat{y}_i\right)^2}.$$
  23. ```{.python .input}
  24. def log_rmse(net, features, labels):
  25. # To further stabilize the value when the logarithm is taken, set the
  26. # value less than 1 as 1
  27. clipped_preds = np.clip(net(features), 1, float('inf'))
  28. return np.sqrt(2 * loss(np.log(clipped_preds), np.log(labels)).mean())

```{.python .input}

@tab pytorch

def log_rmse(net, features, labels):

  1. # To further stabilize the value when the logarithm is taken, set the
  2. # value less than 1 as 1
  3. clipped_preds = torch.clamp(net(features), 1, float('inf'))
  4. rmse = torch.sqrt(torch.mean(loss(torch.log(clipped_preds),
  5. torch.log(labels))))
  6. return rmse.item()
  1. ```{.python .input}
  2. #@tab tensorflow
  3. def log_rmse(y_true, y_pred):
  4. # To further stabilize the value when the logarithm is taken, set the
  5. # value less than 1 as 1
  6. clipped_preds = tf.clip_by_value(y_pred, 1, float('inf'))
  7. return tf.sqrt(tf.reduce_mean(loss(
  8. tf.math.log(y_true), tf.math.log(clipped_preds))))

Unlike in previous sections, our training functions will rely on the Adam optimizer (we will describe it in greater detail later). The main appeal of this optimizer is that, despite doing no better (and sometimes worse) given unlimited resources for hyperparameter optimization, people tend to find that it is significantly less sensitive to the initial learning rate.

```{.python .input} def train(net, train_features, train_labels, test_features, test_labels, num_epochs, learning_rate, weight_decay, batch_size): train_ls, test_ls = [], [] train_iter = d2l.load_array((train_features, train_labels), batch_size)

  1. # The Adam optimization algorithm is used here
  2. trainer = gluon.Trainer(net.collect_params(), 'adam', {
  3. 'learning_rate': learning_rate, 'wd': weight_decay})
  4. for epoch in range(num_epochs):
  5. for X, y in train_iter:
  6. with autograd.record():
  7. l = loss(net(X), y)
  8. l.backward()
  9. trainer.step(batch_size)
  10. train_ls.append(log_rmse(net, train_features, train_labels))
  11. if test_labels is not None:
  12. test_ls.append(log_rmse(net, test_features, test_labels))
  13. return train_ls, test_ls
  1. ```{.python .input}
  2. #@tab pytorch
  3. def train(net, train_features, train_labels, test_features, test_labels,
  4. num_epochs, learning_rate, weight_decay, batch_size):
  5. train_ls, test_ls = [], []
  6. train_iter = d2l.load_array((train_features, train_labels), batch_size)
  7. # The Adam optimization algorithm is used here
  8. optimizer = torch.optim.Adam(net.parameters(),
  9. lr = learning_rate,
  10. weight_decay = weight_decay)
  11. for epoch in range(num_epochs):
  12. for X, y in train_iter:
  13. optimizer.zero_grad()
  14. l = loss(net(X), y)
  15. l.backward()
  16. optimizer.step()
  17. train_ls.append(log_rmse(net, train_features, train_labels))
  18. if test_labels is not None:
  19. test_ls.append(log_rmse(net, test_features, test_labels))
  20. return train_ls, test_ls

```{.python .input}

@tab tensorflow

def train(net, train_features, train_labels, test_features, test_labels, num_epochs, learning_rate, weight_decay, batch_size): train_ls, test_ls = [], [] train_iter = d2l.load_array((train_features, train_labels), batch_size)

  1. # The Adam optimization algorithm is used here
  2. optimizer = tf.keras.optimizers.Adam(learning_rate)
  3. net.compile(loss=loss, optimizer=optimizer)
  4. for epoch in range(num_epochs):
  5. for X, y in train_iter:
  6. with tf.GradientTape() as tape:
  7. y_hat = net(X)
  8. l = loss(y, y_hat)
  9. params = net.trainable_variables
  10. grads = tape.gradient(l, params)
  11. optimizer.apply_gradients(zip(grads, params))
  12. train_ls.append(log_rmse(train_labels, net(train_features)))
  13. if test_labels is not None:
  14. test_ls.append(log_rmse(test_labels, net(test_features)))
  15. return train_ls, test_ls
  1. ## $K$-Fold Cross-Validation
  2. You might recall that we introduced $K$-fold cross-validation
  3. in the section where we discussed how to deal
  4. with model selection (:numref:`sec_model_selection`).
  5. We will put this to good use to select the model design
  6. and to adjust the hyperparameters.
  7. We first need a function that returns
  8. the $i^\mathrm{th}$ fold of the data
  9. in a $K$-fold cross-validation procedure.
  10. It proceeds by slicing out the $i^\mathrm{th}$ segment
  11. as validation data and returning the rest as training data.
  12. Note that this is not the most efficient way of handling data
  13. and we would definitely do something much smarter
  14. if our dataset was considerably larger.
  15. But this added complexity might obfuscate our code unnecessarily
  16. so we can safely omit it here owing to the simplicity of our problem.
  17. ```{.python .input}
  18. #@tab all
  19. def get_k_fold_data(k, i, X, y):
  20. assert k > 1
  21. fold_size = X.shape[0] // k
  22. X_train, y_train = None, None
  23. for j in range(k):
  24. idx = slice(j * fold_size, (j + 1) * fold_size)
  25. X_part, y_part = X[idx, :], y[idx]
  26. if j == i:
  27. X_valid, y_valid = X_part, y_part
  28. elif X_train is None:
  29. X_train, y_train = X_part, y_part
  30. else:
  31. X_train = d2l.concat([X_train, X_part], 0)
  32. y_train = d2l.concat([y_train, y_part], 0)
  33. return X_train, y_train, X_valid, y_valid

The training and verification error averages are returned when we train $K$ times in the $K$-fold cross-validation.

```{.python .input}

@tab all

def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay, batch_size): train_l_sum, valid_l_sum = 0, 0 for i in range(k): data = get_k_fold_data(k, i, X_train, y_train) net = get_net() train_ls, valid_ls = train(net, *data, num_epochs, learning_rate, weight_decay, batch_size) train_l_sum += train_ls[-1] valid_l_sum += valid_ls[-1] if i == 0: d2l.plot(list(range(1, num_epochs+1)), [train_ls, valid_ls], xlabel=’epoch’, ylabel=’rmse’, legend=[‘train’, ‘valid’], yscale=’log’) print(f’fold {i + 1}, train log rmse {float(train_ls[-1]):f}, ‘ f’valid log rmse {float(valid_ls[-1]):f}’) return train_l_sum / k, valid_l_sum / k

  1. ## Model Selection
  2. In this example, we pick an untuned set of hyperparameters
  3. and leave it up to the reader to improve the model.
  4. Finding a good choice can take time,
  5. depending on how many variables one optimizes over.
  6. With a large enough dataset,
  7. and the normal sorts of hyperparameters,
  8. $K$-fold cross-validation tends to be
  9. reasonably resilient against multiple testing.
  10. However, if we try an unreasonably large number of options
  11. we might just get lucky and find that our validation
  12. performance is no longer representative of the true error.
  13. ```{.python .input}
  14. #@tab all
  15. k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
  16. train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr,
  17. weight_decay, batch_size)
  18. print(f'{k}-fold validation: avg train log rmse: {float(train_l):f}, '
  19. f'avg valid log rmse: {float(valid_l):f}')

Notice that sometimes the number of training errors for a set of hyperparameters can be very low, even as the number of errors on $K$-fold cross-validation is considerably higher. This indicates that we are overfitting. Throughout training you will want to monitor both numbers. Less overfitting might indicate that our data can support a more powerful model. Massive overfitting might suggest that we can gain by incorporating regularization techniques.

Submitting Predictions on Kaggle

Now that we know what a good choice of hyperparameters should be, we might as well use all the data to train on it (rather than just $1-1/K$ of the data that are used in the cross-validation slices). The model that we obtain in this way can then be applied to the test set. Saving the predictions in a csv file will simplify uploading the results to Kaggle.

```{.python .input}

@tab all

def trainand_pred(train_features, test_feature, train_labels, test_data, num_epochs, lr, weight_decay, batch_size): net = get_net() train_ls, = train(net, train_features, train_labels, None, None, num_epochs, lr, weight_decay, batch_size) d2l.plot(np.arange(1, num_epochs + 1), [train_ls], xlabel=’epoch’, ylabel=’log rmse’, yscale=’log’) print(f’train log rmse {float(train_ls[-1]):f}’)

  1. # Apply the network to the test set
  2. preds = d2l.numpy(net(test_features))
  3. # Reformat it to export to Kaggle
  4. test_data['SalePrice'] = pd.Series(preds.reshape(1, -1)[0])
  5. submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)
  6. submission.to_csv('submission.csv', index=False)
  1. One nice sanity check is to see
  2. whether the predictions on the test set
  3. resemble those of the $K$-fold cross-validation process.
  4. If they do, it is time to upload them to Kaggle.
  5. The following code will generate a file called `submission.csv`.
  6. ```{.python .input}
  7. #@tab all
  8. train_and_pred(train_features, test_features, train_labels, test_data,
  9. num_epochs, lr, weight_decay, batch_size)

Next, as demonstrated in :numref:fig_kaggle_submit2, we can submit our predictions on Kaggle and see how they compare with the actual house prices (labels) on the test set. The steps are quite simple:

  • Log in to the Kaggle website and visit the house price prediction competition page.
  • Click the “Submit Predictions” or “Late Submission” button (as of this writing, the button is located on the right).
  • Click the “Upload Submission File” button in the dashed box at the bottom of the page and select the prediction file you wish to upload.
  • Click the “Make Submission” button at the bottom of the page to view your results.

Submitting data to Kaggle :width:400px :label:fig_kaggle_submit2

Summary

  • Real data often contain a mix of different data types and need to be preprocessed.
  • Rescaling real-valued data to zero mean and unit variance is a good default. So is replacing missing values with their mean.
  • Transforming categorical features into indicator features allows us to treat them like one-hot vectors.
  • We can use $K$-fold cross-validation to select the model and adjust the hyperparameters.
  • Logarithms are useful for relative errors.

Exercises

  1. Submit your predictions for this section to Kaggle. How good are your predictions?
  2. Can you improve your model by minimizing the logarithm of prices directly? What happens if you try to predict the logarithm of the price rather than the price?
  3. Is it always a good idea to replace missing values by their mean? Hint: can you construct a situation where the values are not missing at random?
  4. Improve the score on Kaggle by tuning the hyperparameters through $K$-fold cross-validation.
  5. Improve the score by improving the model (e.g., layers, weight decay, and dropout).
  6. What happens if we do not standardize the continuous numerical features like what we have done in this section?

:begin_tab:mxnet Discussions :end_tab:

:begin_tab:pytorch Discussions :end_tab:

:begin_tab:tensorflow Discussions :end_tab: