Weight Decay

:label:sec_weight_decay

Now that we have characterized the problem of overfitting, we can introduce some standard techniques for regularizing models. Recall that we can always mitigate overfitting by going out and collecting more training data. That can be costly, time consuming, or entirely out of our control, making it impossible in the short run. For now, we can assume that we already have as much high-quality data as our resources permit and focus on regularization techniques.

Recall that in our polynomial regression example (:numref:sec_model_selection) we could limit our model’s capacity simply by tweaking the degree of the fitted polynomial. Indeed, limiting the number of features is a popular technique to mitigate overfitting. However, simply tossing aside features can be too blunt an instrument for the job. Sticking with the polynomial regression example, consider what might happen with high-dimensional inputs. The natural extensions of polynomials to multivariate data are called monomials, which are simply products of powers of variables. The degree of a monomial is the sum of the powers. For example, $x_1^2 x_2$, and $x_3 x_5^2$ are both monomials of degree 3.

Note that the number of terms with degree $d$ blows up rapidly as $d$ grows larger. Given $k$ variables, the number of monomials of degree $d$ (i.e., $k$ multichoose $d$) is ${k - 1 + d} \choose {k - 1}$. Even small changes in degree, say from $2$ to $3$, dramatically increase the complexity of our model. Thus we often need a more fine-grained tool for adjusting function complexity.

Norms and Weight Decay

We have described both the $L_2$ norm and the $L_1$ norm, which are special cases of the more general $L_p$ norm in :numref:subsec_lin-algebra-norms. Weight decay (commonly called $L_2$ regularization), might be the most widely-used technique for regularizing parametric machine learning models. The technique is motivated by the basic intuition that among all functions $f$, the function $f = 0$ (assigning the value $0$ to all inputs) is in some sense the simplest, and that we can measure the complexity of a function by its distance from zero. But how precisely should we measure the distance between a function and zero? There is no single right answer. In fact, entire branches of mathematics, including parts of functional analysis and the theory of Banach spaces, are devoted to answering this issue.

One simple interpretation might be to measure the complexity of a linear function $f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x}$ by some norm of its weight vector, e.g., $| \mathbf{w} |^2$. The most common method for ensuring a small weight vector is to add its norm as a penalty term to the problem of minimizing the loss. Thus we replace our original objective, minimizing the prediction loss on the training labels, with new objective, minimizing the sum of the prediction loss and the penalty term. Now, if our weight vector grows too large, our learning algorithm might focus on minimizing the weight norm $| \mathbf{w} |^2$ vs. minimizing the training error. That is exactly what we want. To illustrate things in code, let us revive our previous example from :numref:sec_linear_regression for linear regression. There, our loss was given by

L(\mathbf{w}, b) = \frac{1}{n}\sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.

Recall that $\mathbf{x}^{(i)}$ are the features, $y^{(i)}$ are labels for all data points $i$, and $(\mathbf{w}, b)$ are the weight and bias parameters, respectively. To penalize the size of the weight vector, we must somehow add $| \mathbf{w} |^2$ to the loss function, but how should the model trade off the standard loss for this new additive penalty? In practice, we characterize this tradeoff via the regularization constant $\lambda$, a non-negative hyperparameter that we fit using validation data:

L(\mathbf{w}, b) + \frac{\lambda}{2} |\mathbf{w}|^2,

For $\lambda = 0$, we recover our original loss function. For $\lambda > 0$, we restrict the size of $| \mathbf{w} |$. We divide by $2$ by convention: when we take the derivative of a quadratic function, the $2$ and $1/2$ cancel out, ensuring that the expression for the update looks nice and simple. The astute reader might wonder why we work with the squared norm and not the standard norm (i.e., the Euclidean distance). We do this for computational convenience. By squaring the $L_2$ norm, we remove the square root, leaving the sum of squares of each component of the weight vector. This makes the derivative of the penalty easy to compute: the sum of derivatives equals the derivative of the sum.

Moreover, you might ask why we work with the $L_2$ norm in the first place and not, say, the $L_1$ norm. In fact, other choices are valid and popular throughout statistics. While $L_2$-regularized linear models constitute the classic ridge regression algorithm, $L_1$-regularized linear regression is a similarly fundamental model in statistics, which is popularly known as lasso regression.

One reason to work with the $L_2$ norm is that it places an outsize penalty on large components of the weight vector. This biases our learning algorithm towards models that distribute weight evenly across a larger number of features. In practice, this might make them more robust to measurement error in a single variable. By contrast, $L_1$ penalties lead to models that concentrate weights on a small set of features by clearing the other weights to zero. This is called feature selection, which may be desirable for other reasons.

Using the same notation in :eqref:eq_linreg_batch_update, the minibatch stochastic gradient descent updates for $L_2$-regularized regression follow:

$$ \begin{aligned} \mathbf{w} & \leftarrow \left(1- \eta\lambda \right) \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right). \end{aligned} $$

As before, we update $\mathbf{w}$ based on the amount by which our estimate differs from the observation. However, we also shrink the size of $\mathbf{w}$ towards zero. That is why the method is sometimes called “weight decay”: given the penalty term alone, our optimization algorithm decays the weight at each step of training. In contrast to feature selection, weight decay offers us a continuous mechanism for adjusting the complexity of a function. Smaller values of $\lambda$ correspond to less constrained $\mathbf{w}$, whereas larger values of $\lambda$ constrain $\mathbf{w}$ more considerably.

Whether we include a corresponding bias penalty $b^2$ can vary across implementations, and may vary across layers of a neural network. Often, we do not regularize the bias term of a network’s output layer.

High-Dimensional Linear Regression

We can illustrate the benefits of weight decay through a simple synthetic example.

```{.python .input} %matplotlib inline from d2l import mxnet as d2l from mxnet import autograd, gluon, init, np, npx from mxnet.gluon import nn npx.set_np()

  1. ```{.python .input}
  2. #@tab pytorch
  3. %matplotlib inline
  4. from d2l import torch as d2l
  5. import torch
  6. import torch.nn as nn

```{.python .input}

@tab tensorflow

%matplotlib inline from d2l import tensorflow as d2l import tensorflow as tf

  1. First, we generate some data as before
  2. $$y = 0.05 + \sum_{i = 1}^d 0.01 x_i + \epsilon \text{ where }
  3. \epsilon \sim \mathcal{N}(0, 0.01^2).$$
  4. We choose our label to be a linear function of our inputs,
  5. corrupted by Gaussian noise with zero mean and standard deviation 0.01.
  6. To make the effects of overfitting pronounced,
  7. we can increase the dimensionality of our problem to $d = 200$
  8. and work with a small training set containing only 20 examples.
  9. ```{.python .input}
  10. #@tab all
  11. n_train, n_test, num_inputs, batch_size = 20, 100, 200, 5
  12. true_w, true_b = d2l.ones((num_inputs, 1)) * 0.01, 0.05
  13. train_data = d2l.synthetic_data(true_w, true_b, n_train)
  14. train_iter = d2l.load_array(train_data, batch_size)
  15. test_data = d2l.synthetic_data(true_w, true_b, n_test)
  16. test_iter = d2l.load_array(test_data, batch_size, is_train=False)

Implementation from Scratch

In the following, we will implement weight decay from scratch, simply by adding the squared $L_2$ penalty to the original target function.

Initializing Model Parameters

First, we will define a function to randomly initialize our model parameters.

```{.python .input} def init_params(): w = np.random.normal(scale=1, size=(num_inputs, 1)) b = np.zeros(1) w.attach_grad() b.attach_grad() return [w, b]

  1. ```{.python .input}
  2. #@tab pytorch
  3. def init_params():
  4. w = torch.normal(0, 1, size=(num_inputs, 1), requires_grad=True)
  5. b = torch.zeros(1, requires_grad=True)
  6. return [w, b]

```{.python .input}

@tab tensorflow

def init_params(): w = tf.Variable(tf.random.normal(mean=1, shape=(num_inputs, 1))) b = tf.Variable(tf.zeros(shape=(1, ))) return [w, b]

  1. ### Defining $L_2$ Norm Penalty
  2. Perhaps the most convenient way to implement this penalty
  3. is to square all terms in place and sum them up.
  4. ```{.python .input}
  5. def l2_penalty(w):
  6. return (w**2).sum() / 2

```{.python .input}

@tab pytorch

def l2_penalty(w): return torch.sum(w.pow(2)) / 2

  1. ```{.python .input}
  2. #@tab tensorflow
  3. def l2_penalty(w):
  4. return tf.reduce_sum(tf.pow(w, 2)) / 2

Defining the Training Loop

The following code fits a model on the training set and evaluates it on the test set. The linear network and the squared loss have not changed since :numref:chap_linear, so we will just import them via d2l.linreg and d2l.squared_loss. The only change here is that our loss now includes the penalty term.

```{.python .input} def train(lambd): w, b = init_params() net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss num_epochs, lr = 100, 0.003 animator = d2l.Animator(xlabel=’epochs’, ylabel=’loss’, yscale=’log’, xlim=[5, num_epochs], legend=[‘train’, ‘test’]) for epoch in range(num_epochs): for X, y in train_iter: with autograd.record():

  1. # The L2 norm penalty term has been added, and broadcasting
  2. # makes `l2_penalty(w)` a vector whose length is `batch_size`
  3. l = loss(net(X), y) + lambd * l2_penalty(w)
  4. l.backward()
  5. d2l.sgd([w, b], lr, batch_size)
  6. if (epoch + 1) % 5 == 0:
  7. animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
  8. d2l.evaluate_loss(net, test_iter, loss)))
  9. print('L2 norm of w:', np.linalg.norm(w))
  1. ```{.python .input}
  2. #@tab pytorch
  3. def train(lambd):
  4. w, b = init_params()
  5. net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
  6. num_epochs, lr = 100, 0.003
  7. animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
  8. xlim=[5, num_epochs], legend=['train', 'test'])
  9. for epoch in range(num_epochs):
  10. for X, y in train_iter:
  11. with torch.enable_grad():
  12. # The L2 norm penalty term has been added, and broadcasting
  13. # makes `l2_penalty(w)` a vector whose length is `batch_size`
  14. l = loss(net(X), y) + lambd * l2_penalty(w)
  15. l.sum().backward()
  16. d2l.sgd([w, b], lr, batch_size)
  17. if (epoch + 1) % 5 == 0:
  18. animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
  19. d2l.evaluate_loss(net, test_iter, loss)))
  20. print('L2 norm of w:', torch.norm(w).item())

```{.python .input}

@tab tensorflow

def train(lambd): w, b = init_params() net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss num_epochs, lr = 100, 0.003 animator = d2l.Animator(xlabel=’epochs’, ylabel=’loss’, yscale=’log’, xlim=[5, num_epochs], legend=[‘train’, ‘test’]) for epoch in range(num_epochs): for X, y in train_iter: with tf.GradientTape() as tape:

  1. # The L2 norm penalty term has been added, and broadcasting
  2. # makes `l2_penalty(w)` a vector whose length is `batch_size`
  3. l = loss(net(X), y) + lambd * l2_penalty(w)
  4. grads = tape.gradient(l, [w, b])
  5. d2l.sgd([w, b], grads, lr, batch_size)
  6. if (epoch + 1) % 5 == 0:
  7. animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
  8. d2l.evaluate_loss(net, test_iter, loss)))
  9. print('L2 norm of w:', tf.norm(w).numpy())
  1. ### Training without Regularization
  2. We now run this code with `lambd = 0`,
  3. disabling weight decay.
  4. Note that we overfit badly,
  5. decreasing the training error but not the
  6. test error---a textook case of overfitting.
  7. ```{.python .input}
  8. #@tab all
  9. train(lambd=0)

Using Weight Decay

Below, we run with substantial weight decay. Note that the training error increases but the test error decreases. This is precisely the effect we expect from regularization.

```{.python .input}

@tab all

train(lambd=3)

  1. ## Concise Implementation
  2. Because weight decay is ubiquitous
  3. in neural network optimization,
  4. the deep learning framework makes it especially convenient,
  5. integrating weight decay into the optimization algorithm itself
  6. for easy use in combination with any loss function.
  7. Moreover, this integration serves a computational benefit,
  8. allowing implementation tricks to add weight decay to the algorithm,
  9. without any additional computational overhead.
  10. Since the weight decay portion of the update
  11. depends only on the current value of each parameter,
  12. the optimizer must touch each parameter once anyway.
  13. :begin_tab:`mxnet`
  14. In the following code, we specify
  15. the weight decay hyperparameter directly
  16. through `wd` when instantiating our `Trainer`.
  17. By default, Gluon decays both
  18. weights and biases simultaneously.
  19. Note that the hyperparameter `wd`
  20. will be multiplied by `wd_mult`
  21. when updating model parameters.
  22. Thus, if we set `wd_mult` to zero,
  23. the bias parameter $b$ will not decay.
  24. :end_tab:
  25. :begin_tab:`pytorch`
  26. In the following code, we specify
  27. the weight decay hyperparameter directly
  28. through `weight_decay` when instantiating our optimizer.
  29. By default, PyTorch decays both
  30. weights and biases simultaneously. Here we only set `weight_decay` for
  31. the weight, so the bias parameter $b$ will not decay.
  32. :end_tab:
  33. :begin_tab:`tensorflow`
  34. In the following code, we create an $L_2$ regularizer with
  35. the weight decay hyperparameter `wd` and apply it to the layer
  36. through the `kernel_regularizer` argument.
  37. :end_tab:
  38. ```{.python .input}
  39. def train_concise(wd):
  40. net = nn.Sequential()
  41. net.add(nn.Dense(1))
  42. net.initialize(init.Normal(sigma=1))
  43. loss = gluon.loss.L2Loss()
  44. num_epochs, lr = 100, 0.003
  45. trainer = gluon.Trainer(net.collect_params(), 'sgd',
  46. {'learning_rate': lr, 'wd': wd})
  47. # The bias parameter has not decayed. Bias names generally end with "bias"
  48. net.collect_params('.*bias').setattr('wd_mult', 0)
  49. animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
  50. xlim=[5, num_epochs], legend=['train', 'test'])
  51. for epoch in range(num_epochs):
  52. for X, y in train_iter:
  53. with autograd.record():
  54. l = loss(net(X), y)
  55. l.backward()
  56. trainer.step(batch_size)
  57. if (epoch + 1) % 5 == 0:
  58. animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
  59. d2l.evaluate_loss(net, test_iter, loss)))
  60. print('L2 norm of w:', np.linalg.norm(net[0].weight.data()))

```{.python .input}

@tab pytorch

def trainconcise(wd): net = nn.Sequential(nn.Linear(num_inputs, 1)) for param in net.parameters(): param.data.normal() loss = nn.MSELoss() num_epochs, lr = 100, 0.003

  1. # The bias parameter has not decayed
  2. trainer = torch.optim.SGD([
  3. {"params":net[0].weight,'weight_decay': wd},
  4. {"params":net[0].bias}], lr=lr)
  5. animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
  6. xlim=[5, num_epochs], legend=['train', 'test'])
  7. for epoch in range(num_epochs):
  8. for X, y in train_iter:
  9. with torch.enable_grad():
  10. trainer.zero_grad()
  11. l = loss(net(X), y)
  12. l.backward()
  13. trainer.step()
  14. if (epoch + 1) % 5 == 0:
  15. animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
  16. d2l.evaluate_loss(net, test_iter, loss)))
  17. print('L2 norm of w:', net[0].weight.norm().item())
  1. ```{.python .input}
  2. #@tab tensorflow
  3. def train_concise(wd):
  4. net = tf.keras.models.Sequential()
  5. net.add(tf.keras.layers.Dense(
  6. 1, kernel_regularizer=tf.keras.regularizers.l2(wd)))
  7. net.build(input_shape=(1, num_inputs))
  8. w, b = net.trainable_variables
  9. loss = tf.keras.losses.MeanSquaredError()
  10. num_epochs, lr = 100, 0.003
  11. trainer = tf.keras.optimizers.SGD(learning_rate=lr)
  12. animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
  13. xlim=[5, num_epochs], legend=['train', 'test'])
  14. for epoch in range(num_epochs):
  15. for X, y in train_iter:
  16. with tf.GradientTape() as tape:
  17. # `tf.keras` requires retrieving and adding the losses from
  18. # layers manually for custom training loop.
  19. l = loss(net(X), y) + net.losses
  20. grads = tape.gradient(l, net.trainable_variables)
  21. trainer.apply_gradients(zip(grads, net.trainable_variables))
  22. if (epoch + 1) % 5 == 0:
  23. animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
  24. d2l.evaluate_loss(net, test_iter, loss)))
  25. print('L2 norm of w:', tf.norm(net.get_weights()[0]).numpy())

The plots look identical to those when we implemented weight decay from scratch. However, they run appreciably faster and are easier to implement, a benefit that will become more pronounced for larger problems.

```{.python .input}

@tab all

train_concise(0)

  1. ```{.python .input}
  2. #@tab all
  3. train_concise(3)

So far, we only touched upon one notion of what constitutes a simple linear function. Moreover, what constitutes a simple nonlinear function can be an even more complex question. For instance, reproducing kernel Hilbert space (RKHS) allows one to apply tools introduced for linear functions in a nonlinear context. Unfortunately, RKHS-based algorithms tend to scale purely to large, high-dimensional data. In this book we will default to the simple heuristic of applying weight decay on all layers of a deep network.

Summary

  • Regularization is a common method for dealing with overfitting. It adds a penalty term to the loss function on the training set to reduce the complexity of the learned model.
  • One particular choice for keeping the model simple is weight decay using an $L_2$ penalty. This leads to weight decay in the update steps of the learning algorithm.
  • The weight decay functionality is provided in optimizers from deep learning frameworks.
  • Different sets of parameters can have different update behaviors within the same training loop.

Exercises

  1. Experiment with the value of $\lambda$ in the estimation problem in this section. Plot training and test accuracy as a function of $\lambda$. What do you observe?
  2. Use a validation set to find the optimal value of $\lambda$. Is it really the optimal value? Does this matter?
  3. What would the update equations look like if instead of $|\mathbf{w}|^2$ we used $\sum_i |w_i|$ as our penalty of choice ($L_1$ regularization)?
  4. We know that $|\mathbf{w}|^2 = \mathbf{w}^\top \mathbf{w}$. Can you find a similar equation for matrices (see the Frobenius norm in :numref:subsec_lin-algebra-norms)?
  5. Review the relationship between training error and generalization error. In addition to weight decay, increased training, and the use of a model of suitable complexity, what other ways can you think of to deal with overfitting?
  6. In Bayesian statistics we use the product of prior and likelihood to arrive at a posterior via $P(w \mid x) \propto P(x \mid w) P(w)$. How can you identify $P(w)$ with regularization?

:begin_tab:mxnet Discussions :end_tab:

:begin_tab:pytorch Discussions :end_tab:

:begin_tab:tensorflow Discussions :end_tab: