Batch Normalization

:label:sec_batch_norm

Training deep neural networks is difficult. And getting them to converge in a reasonable amount of time can be tricky. In this section, we describe batch normalization, a popular and effective technique that consistently accelerates the convergence of deep networks :cite:Ioffe.Szegedy.2015. Together with residual blocks—-covered later in :numref:sec_resnet—-batch normalization has made it possible for practitioners to routinely train networks with over 100 layers.

Training Deep Networks

To motivate batch normalization, let us review a few practical challenges that arise when training machine learning models and neural networks in particular.

First, choices regarding data preprocessing often make an enormous difference in the final results. Recall our application of MLPs to predicting house prices (:numref:sec_kaggle_house). Our first step when working with real data was to standardize our input features to each have a mean of zero and variance of one. Intuitively, this standardization plays nicely with our optimizers because it puts the parameters a priori at a similar scale.

Second, for a typical MLP or CNN, as we train, the variables (e.g., affine transformation outputs in MLP) in intermediate layers may take values with widely varying magnitudes: both along the layers from the input to the output, across units in the same layer, and over time due to our updates to the model parameters. The inventors of batch normalization postulated informally that this drift in the distribution of such variables could hamper the convergence of the network. Intuitively, we might conjecture that if one layer has variable values that are 100 times that of another layer, this might necessitate compensatory adjustments in the learning rates.

Third, deeper networks are complex and easily capable of overfitting. This means that regularization becomes more critical.

Batch normalization is applied to individual layers (optionally, to all of them) and works as follows: In each training iteration, we first normalize the inputs (of batch normalization) by subtracting their mean and dividing by their standard deviation, where both are estimated based on the statistics of the current minibatch. Next, we apply a scale coefficient and a scale offset. It is precisely due to this normalization based on batch statistics that batch normalization derives its name.

Note that if we tried to apply batch normalization with minibatches of size 1, we would not be able to learn anything. That is because after subtracting the means, each hidden unit would take value 0! As you might guess, since we are devoting a whole section to batch normalization, with large enough minibatches, the approach proves effective and stable. One takeaway here is that when applying batch normalization, the choice of batch size may be even more significant than without batch normalization.

Formally, denoting by $\mathbf{x} \in \mathcal{B}$ an input to batch normalization ($\mathrm{BN}$) that is from a minibatch $\mathcal{B}$, batch normalization transforms $\mathbf{x}$ according to the following expression:

\mathrm{BN}(\mathbf{x}) = \boldsymbol{\gamma} \odot \frac{\mathbf{x} - \hat{\boldsymbol{\mu}}\mathcal{B}}{\hat{\boldsymbol{\sigma}}\mathcal{B}} + \boldsymbol{\beta}. :eqlabel:eq_batchnorm

In :eqref:eq_batchnorm, $\hat{\boldsymbol{\mu}}\mathcal{B}$ is the sample mean and $\hat{\boldsymbol{\sigma}}\mathcal{B}$ is the sample standard deviation of the minibatch $\mathcal{B}$. After applying standardization, the resulting minibatch has zero mean and unit variance. Because the choice of unit variance (vs. some other magic number) is an arbitrary choice, we commonly include element-wise scale parameter $\boldsymbol{\gamma}$ and shift parameter $\boldsymbol{\beta}$ that have the same shape as $\mathbf{x}$. Note that $\boldsymbol{\gamma}$ and $\boldsymbol{\beta}$ are parameters that need to be learned jointly with the other model parameters.

Consequently, the variable magnitudes for intermediate layers cannot diverge during training because batch normalization actively centers and rescales them back to a given mean and size (via $\hat{\boldsymbol{\mu}}\mathcal{B}$ and ${\hat{\boldsymbol{\sigma}}\mathcal{B}}$). One piece of practitioner’s intuition or wisdom is that batch normalization seems to allow for more aggressive learning rates.

Formally, we calculate $\hat{\boldsymbol{\mu}}\mathcal{B}$ and ${\hat{\boldsymbol{\sigma}}\mathcal{B}}$ in :eqref:eq_batchnorm as follows:

$$\begin{aligned} \hat{\boldsymbol{\mu}}\mathcal{B} &= \frac{1}{|\mathcal{B}|} \sum{\mathbf{x} \in \mathcal{B}} \mathbf{x},\ \hat{\boldsymbol{\sigma}}\mathcal{B}^2 &= \frac{1}{|\mathcal{B}|} \sum{\mathbf{x} \in \mathcal{B}} (\mathbf{x} - \hat{\boldsymbol{\mu}}_{\mathcal{B}})^2 + \epsilon.\end{aligned}$$

Note that we add a small constant $\epsilon > 0$ to the variance estimate to ensure that we never attempt division by zero, even in cases where the empirical variance estimate might vanish. The estimates $\hat{\boldsymbol{\mu}}\mathcal{B}$ and ${\hat{\boldsymbol{\sigma}}\mathcal{B}}$ counteract the scaling issue by using noisy estimates of mean and variance. You might think that this noisiness should be a problem. As it turns out, this is actually beneficial.

This turns out to be a recurring theme in deep learning. For reasons that are not yet well-characterized theoretically, various sources of noise in optimization often lead to faster training and less overfitting: this variation appears to act as a form of regularization. In some preliminary research, :cite:Teye.Azizpour.Smith.2018 and :cite:Luo.Wang.Shao.ea.2018 relate the properties of batch normalization to Bayesian priors and penalties respectively. In particular, this sheds some light on the puzzle of why batch normalization works best for moderate minibatches sizes in the $50 \sim 100$ range.

Fixing a trained model, you might think that we would prefer using the entire dataset to estimate the mean and variance. Once training is complete, why would we want the same image to be classified differently, depending on the batch in which it happens to reside? During training, such exact calculation is infeasible because the intermediate variables for all data examples change every time we update our model. However, once the model is trained, we can calculate the means and variances of each layer’s variables based on the entire dataset. Indeed this is standard practice for models employing batch normalization and thus batch normalization layers function differently in training mode (normalizing by minibatch statistics) and in prediction mode (normalizing by dataset statistics).

We are now ready to take a look at how batch normalization works in practice.

Batch Normalization Layers

Batch normalization implementations for fully-connected layers and convolutional layers are slightly different. We discuss both cases below. Recall that one key differences between batch normalization and other layers is that because batch normalization operates on a full minibatch at a time, we cannot just ignore the batch dimension as we did before when introducing other layers.

Fully-Connected Layers

When applying batch normalization to fully-connected layers, the original paper inserts batch normalization after the affine transformation and before the nonlinear activation function (later applications may insert batch normalization right after activation functions) :cite:Ioffe.Szegedy.2015. Denoting the input to the fully-connected layer by $\mathbf{x}$, the affine transformation by $\mathbf{W}\mathbf{x} + \mathbf{b}$ (with the weight parameter $\mathbf{W}$ and the bias parameter $\mathbf{b}$), and the activation function by $\phi$, we can express the computation of a batch-normalization-enabled, fully-connected layer output $\mathbf{h}$ as follows:

\mathbf{h} = \phi(\mathrm{BN}(\mathbf{W}\mathbf{x} + \mathbf{b}) ).

Recall that mean and variance are computed on the same minibatch on which the transformation is applied.

Convolutional Layers

Similarly, with convolutional layers, we can apply batch normalization after the convolution and before the nonlinear activation function. When the convolution has multiple output channels, we need to carry out batch normalization for each of the outputs of these channels, and each channel has its own scale and shift parameters, both of which are scalars. Assume that our minibatches contain $m$ examples and that for each channel, the output of the convolution has height $p$ and width $q$. For convolutional layers, we carry out each batch normalization over the $m \cdot p \cdot q$ elements per output channel simultaneously. Thus, we collect the values over all spatial locations when computing the mean and variance and consequently apply the same mean and variance within a given channel to normalize the value at each spatial location.

Batch Normalization During Prediction

As we mentioned earlier, batch normalization typically behaves differently in training mode and prediction mode. First, the noise in the sample mean and the sample variance arising from estimating each on minibatches are no longer desirable once we have trained the model. Second, we might not have the luxury of computing per-batch normalization statistics. For example, we might need to apply our model to make one prediction at a time.

Typically, after training, we use the entire dataset to compute stable estimates of the variable statistics and then fix them at prediction time. Consequently, batch normalization behaves differently during training and at test time. Recall that dropout also exhibits this characteristic.

Implementation from Scratch

Below, we implement a batch normalization layer with tensors from scratch.

```{.python .input} from d2l import mxnet as d2l from mxnet import autograd, np, npx, init from mxnet.gluon import nn npx.set_np()

def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):

  1. # Use `autograd` to determine whether the current mode is training mode or
  2. # prediction mode
  3. if not autograd.is_training():
  4. # If it is prediction mode, directly use the mean and variance
  5. # obtained by moving average
  6. X_hat = (X - moving_mean) / np.sqrt(moving_var + eps)
  7. else:
  8. assert len(X.shape) in (2, 4)
  9. if len(X.shape) == 2:
  10. # When using a fully-connected layer, calculate the mean and
  11. # variance on the feature dimension
  12. mean = X.mean(axis=0)
  13. var = ((X - mean) ** 2).mean(axis=0)
  14. else:
  15. # When using a two-dimensional convolutional layer, calculate the
  16. # mean and variance on the channel dimension (axis=1). Here we
  17. # need to maintain the shape of `X`, so that the broadcasting
  18. # operation can be carried out later
  19. mean = X.mean(axis=(0, 2, 3), keepdims=True)
  20. var = ((X - mean) ** 2).mean(axis=(0, 2, 3), keepdims=True)
  21. # In training mode, the current mean and variance are used for the
  22. # standardization
  23. X_hat = (X - mean) / np.sqrt(var + eps)
  24. # Update the mean and variance using moving average
  25. moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
  26. moving_var = momentum * moving_var + (1.0 - momentum) * var
  27. Y = gamma * X_hat + beta # Scale and shift
  28. return Y, moving_mean, moving_var
  1. ```{.python .input}
  2. #@tab pytorch
  3. from d2l import torch as d2l
  4. import torch
  5. from torch import nn
  6. def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
  7. # Use `is_grad_enabled` to determine whether the current mode is training
  8. # mode or prediction mode
  9. if not torch.is_grad_enabled():
  10. # If it is prediction mode, directly use the mean and variance
  11. # obtained by moving average
  12. X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
  13. else:
  14. assert len(X.shape) in (2, 4)
  15. if len(X.shape) == 2:
  16. # When using a fully-connected layer, calculate the mean and
  17. # variance on the feature dimension
  18. mean = X.mean(dim=0)
  19. var = ((X - mean) ** 2).mean(dim=0)
  20. else:
  21. # When using a two-dimensional convolutional layer, calculate the
  22. # mean and variance on the channel dimension (axis=1). Here we
  23. # need to maintain the shape of `X`, so that the broadcasting
  24. # operation can be carried out later
  25. mean = X.mean(dim=(0, 2, 3), keepdim=True)
  26. var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
  27. # In training mode, the current mean and variance are used for the
  28. # standardization
  29. X_hat = (X - mean) / torch.sqrt(var + eps)
  30. # Update the mean and variance using moving average
  31. moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
  32. moving_var = momentum * moving_var + (1.0 - momentum) * var
  33. Y = gamma * X_hat + beta # Scale and shift
  34. return Y, moving_mean.data, moving_var.data

```{.python .input}

@tab tensorflow

from d2l import tensorflow as d2l import tensorflow as tf

def batch_norm(X, gamma, beta, moving_mean, moving_var, eps):

  1. # Compute reciprocal of square root of the moving variance element-wise
  2. inv = tf.cast(tf.math.rsqrt(moving_var + eps), X.dtype)
  3. # Scale and shift
  4. inv *= gamma
  5. Y = X * inv + (beta - moving_mean * inv)
  6. return Y
  1. We can now create a proper `BatchNorm` layer.
  2. Our layer will maintain proper parameters
  3. for scale `gamma` and shift `beta`,
  4. both of which will be updated in the course of training.
  5. Additionally, our layer will maintain
  6. moving averages of the means and variances
  7. for subsequent use during model prediction.
  8. Putting aside the algorithmic details,
  9. note the design pattern underlying our implementation of the layer.
  10. Typically, we define the mathematics in a separate function, say `batch_norm`.
  11. We then integrate this functionality into a custom layer,
  12. whose code mostly addresses bookkeeping matters,
  13. such as moving data to the right device context,
  14. allocating and initializing any required variables,
  15. keeping track of moving averages (here for mean and variance), and so on.
  16. This pattern enables a clean separation of mathematics from boilerplate code.
  17. Also note that for the sake of convenience
  18. we did not worry about automatically inferring the input shape here,
  19. thus we need to specify the number of features throughout.
  20. Do not worry, the high-level batch normalization APIs in the deep learning framework will care of this for us and we will demonstrate that later.
  21. ```{.python .input}
  22. class BatchNorm(nn.Block):
  23. # `num_features`: the number of outputs for a fully-connected layer
  24. # or the number of output channels for a convolutional layer. `num_dims`:
  25. # 2 for a fully-connected layer and 4 for a convolutional layer
  26. def __init__(self, num_features, num_dims, **kwargs):
  27. super().__init__(**kwargs)
  28. if num_dims == 2:
  29. shape = (1, num_features)
  30. else:
  31. shape = (1, num_features, 1, 1)
  32. # The scale parameter and the shift parameter (model parameters) are
  33. # initialized to 1 and 0, respectively
  34. self.gamma = self.params.get('gamma', shape=shape, init=init.One())
  35. self.beta = self.params.get('beta', shape=shape, init=init.Zero())
  36. # The variables that are not model parameters are initialized to 0
  37. self.moving_mean = np.zeros(shape)
  38. self.moving_var = np.zeros(shape)
  39. def forward(self, X):
  40. # If `X` is not on the main memory, copy `moving_mean` and
  41. # `moving_var` to the device where `X` is located
  42. if self.moving_mean.ctx != X.ctx:
  43. self.moving_mean = self.moving_mean.copyto(X.ctx)
  44. self.moving_var = self.moving_var.copyto(X.ctx)
  45. # Save the updated `moving_mean` and `moving_var`
  46. Y, self.moving_mean, self.moving_var = batch_norm(
  47. X, self.gamma.data(), self.beta.data(), self.moving_mean,
  48. self.moving_var, eps=1e-12, momentum=0.9)
  49. return Y

```{.python .input}

@tab pytorch

class BatchNorm(nn.Module):

  1. # `num_features`: the number of outputs for a fully-connected layer
  2. # or the number of output channels for a convolutional layer. `num_dims`:
  3. # 2 for a fully-connected layer and 4 for a convolutional layer
  4. def __init__(self, num_features, num_dims):
  5. super().__init__()
  6. if num_dims == 2:
  7. shape = (1, num_features)
  8. else:
  9. shape = (1, num_features, 1, 1)
  10. # The scale parameter and the shift parameter (model parameters) are
  11. # initialized to 1 and 0, respectively
  12. self.gamma = nn.Parameter(torch.ones(shape))
  13. self.beta = nn.Parameter(torch.zeros(shape))
  14. # The variables that are not model parameters are initialized to 0
  15. self.moving_mean = torch.zeros(shape)
  16. self.moving_var = torch.zeros(shape)
  17. def forward(self, X):
  18. # If `X` is not on the main memory, copy `moving_mean` and
  19. # `moving_var` to the device where `X` is located
  20. if self.moving_mean.device != X.device:
  21. self.moving_mean = self.moving_mean.to(X.device)
  22. self.moving_var = self.moving_var.to(X.device)
  23. # Save the updated `moving_mean` and `moving_var`
  24. Y, self.moving_mean, self.moving_var = batch_norm(
  25. X, self.gamma, self.beta, self.moving_mean,
  26. self.moving_var, eps=1e-5, momentum=0.9)
  27. return Y
  1. ```{.python .input}
  2. #@tab tensorflow
  3. class BatchNorm(tf.keras.layers.Layer):
  4. def __init__(self, **kwargs):
  5. super(BatchNorm, self).__init__(**kwargs)
  6. def build(self, input_shape):
  7. weight_shape = [input_shape[-1], ]
  8. # The scale parameter and the shift parameter (model parameters) are
  9. # initialized to 1 and 0, respectively
  10. self.gamma = self.add_weight(name='gamma', shape=weight_shape,
  11. initializer=tf.initializers.ones, trainable=True)
  12. self.beta = self.add_weight(name='beta', shape=weight_shape,
  13. initializer=tf.initializers.zeros, trainable=True)
  14. # The variables that are not model parameters are initialized to 0
  15. self.moving_mean = self.add_weight(name='moving_mean',
  16. shape=weight_shape, initializer=tf.initializers.zeros,
  17. trainable=False)
  18. self.moving_variance = self.add_weight(name='moving_variance',
  19. shape=weight_shape, initializer=tf.initializers.zeros,
  20. trainable=False)
  21. super(BatchNorm, self).build(input_shape)
  22. def assign_moving_average(self, variable, value):
  23. momentum = 0.9
  24. delta = variable * momentum + value * (1 - momentum)
  25. return variable.assign(delta)
  26. @tf.function
  27. def call(self, inputs, training):
  28. if training:
  29. axes = list(range(len(inputs.shape) - 1))
  30. batch_mean = tf.reduce_mean(inputs, axes, keepdims=True)
  31. batch_variance = tf.reduce_mean(tf.math.squared_difference(
  32. inputs, tf.stop_gradient(batch_mean)), axes, keepdims=True)
  33. batch_mean = tf.squeeze(batch_mean, axes)
  34. batch_variance = tf.squeeze(batch_variance, axes)
  35. mean_update = self.assign_moving_average(
  36. self.moving_mean, batch_mean)
  37. variance_update = self.assign_moving_average(
  38. self.moving_variance, batch_variance)
  39. self.add_update(mean_update)
  40. self.add_update(variance_update)
  41. mean, variance = batch_mean, batch_variance
  42. else:
  43. mean, variance = self.moving_mean, self.moving_variance
  44. output = batch_norm(inputs, moving_mean=mean, moving_var=variance,
  45. beta=self.beta, gamma=self.gamma, eps=1e-5)
  46. return output

Applying Batch Normalization in LeNet

To see how to apply BatchNorm in context, below we apply it to a traditional LeNet model (:numref:sec_lenet). Recall that batch normalization is applied after the convolutional layers or fully-connected layers but before the corresponding activation functions.

```{.python .input} net = nn.Sequential() net.add(nn.Conv2D(6, kernel_size=5), BatchNorm(6, num_dims=4), nn.Activation(‘sigmoid’), nn.MaxPool2D(pool_size=2, strides=2), nn.Conv2D(16, kernel_size=5), BatchNorm(16, num_dims=4), nn.Activation(‘sigmoid’), nn.MaxPool2D(pool_size=2, strides=2), nn.Dense(120), BatchNorm(120, num_dims=2), nn.Activation(‘sigmoid’), nn.Dense(84), BatchNorm(84, num_dims=2), nn.Activation(‘sigmoid’), nn.Dense(10))

  1. ```{.python .input}
  2. #@tab pytorch
  3. net = nn.Sequential(
  4. nn.Conv2d(1, 6, kernel_size=5), BatchNorm(6, num_dims=4), nn.Sigmoid(),
  5. nn.MaxPool2d(kernel_size=2, stride=2),
  6. nn.Conv2d(6, 16, kernel_size=5), BatchNorm(16, num_dims=4), nn.Sigmoid(),
  7. nn.MaxPool2d(kernel_size=2, stride=2), nn.Flatten(),
  8. nn.Linear(16*4*4, 120), BatchNorm(120, num_dims=2), nn.Sigmoid(),
  9. nn.Linear(120, 84), BatchNorm(84, num_dims=2), nn.Sigmoid(),
  10. nn.Linear(84, 10))

```{.python .input}

@tab tensorflow

Recall that this has to be a function that will be passed to d2l.train_ch6

so that model building or compiling need to be within strategy.scope() in

order to utilize the CPU/GPU devices that we have

def net(): return tf.keras.models.Sequential([ tf.keras.layers.Conv2D(filters=6, kernel_size=5, input_shape=(28, 28, 1)), BatchNorm(), tf.keras.layers.Activation(‘sigmoid’), tf.keras.layers.MaxPool2D(pool_size=2, strides=2), tf.keras.layers.Conv2D(filters=16, kernel_size=5), BatchNorm(), tf.keras.layers.Activation(‘sigmoid’), tf.keras.layers.MaxPool2D(pool_size=2, strides=2), tf.keras.layers.Flatten(), tf.keras.layers.Dense(120), BatchNorm(), tf.keras.layers.Activation(‘sigmoid’), tf.keras.layers.Dense(84), BatchNorm(), tf.keras.layers.Activation(‘sigmoid’), tf.keras.layers.Dense(10)] )

  1. As before, we will train our network on the Fashion-MNIST dataset.
  2. This code is virtually identical to that when we first trained LeNet (:numref:`sec_lenet`).
  3. The main difference is the considerably larger learning rate.
  4. ```{.python .input}
  5. #@tab mxnet, pytorch
  6. lr, num_epochs, batch_size = 1.0, 10, 256
  7. train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
  8. d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr)

```{.python .input}

@tab tensorflow

lr, num_epochs, batch_size = 1.0, 10, 256 train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size) net = d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr)

  1. Let us have a look at the scale parameter `gamma`
  2. and the shift parameter `beta` learned
  3. from the first batch normalization layer.
  4. ```{.python .input}
  5. net[1].gamma.data().reshape(-1,), net[1].beta.data().reshape(-1,)

```{.python .input}

@tab pytorch

net[1].gamma.reshape((-1,)), net[1].beta.reshape((-1,))

  1. ```{.python .input}
  2. #@tab tensorflow
  3. tf.reshape(net.layers[1].gamma, (-1,)), tf.reshape(net.layers[1].beta, (-1,))

Concise Implementation

Compared with the BatchNorm class, which we just defined ourselves, we can use the BatchNorm class defined in high-level APIs from the deep learning framework directly. The code looks virtually identical to the application our implementation above.

```{.python .input} net = nn.Sequential() net.add(nn.Conv2D(6, kernel_size=5), nn.BatchNorm(), nn.Activation(‘sigmoid’), nn.MaxPool2D(pool_size=2, strides=2), nn.Conv2D(16, kernel_size=5), nn.BatchNorm(), nn.Activation(‘sigmoid’), nn.MaxPool2D(pool_size=2, strides=2), nn.Dense(120), nn.BatchNorm(), nn.Activation(‘sigmoid’), nn.Dense(84), nn.BatchNorm(), nn.Activation(‘sigmoid’), nn.Dense(10))

  1. ```{.python .input}
  2. #@tab pytorch
  3. net = nn.Sequential(
  4. nn.Conv2d(1, 6, kernel_size=5), nn.BatchNorm2d(6), nn.Sigmoid(),
  5. nn.MaxPool2d(kernel_size=2, stride=2),
  6. nn.Conv2d(6, 16, kernel_size=5), nn.BatchNorm2d(16), nn.Sigmoid(),
  7. nn.MaxPool2d(kernel_size=2, stride=2), nn.Flatten(),
  8. nn.Linear(256, 120), nn.BatchNorm1d(120), nn.Sigmoid(),
  9. nn.Linear(120, 84), nn.BatchNorm1d(84), nn.Sigmoid(),
  10. nn.Linear(84, 10))

```{.python .input}

@tab tensorflow

def net(): return tf.keras.models.Sequential([ tf.keras.layers.Conv2D(filters=6, kernel_size=5, input_shape=(28, 28, 1)), tf.keras.layers.BatchNormalization(), tf.keras.layers.Activation(‘sigmoid’), tf.keras.layers.MaxPool2D(pool_size=2, strides=2), tf.keras.layers.Conv2D(filters=16, kernel_size=5), tf.keras.layers.BatchNormalization(), tf.keras.layers.Activation(‘sigmoid’), tf.keras.layers.MaxPool2D(pool_size=2, strides=2), tf.keras.layers.Flatten(), tf.keras.layers.Dense(120), tf.keras.layers.BatchNormalization(), tf.keras.layers.Activation(‘sigmoid’), tf.keras.layers.Dense(84), tf.keras.layers.BatchNormalization(), tf.keras.layers.Activation(‘sigmoid’), tf.keras.layers.Dense(10), ])

  1. Below, we use the same hyperparameters to train our model.
  2. Note that as usual, the high-level API variant runs much faster
  3. because its code has been compiled to C++ or CUDA
  4. while our custom implementation must be interpreted by Python.
  5. ```{.python .input}
  6. #@tab all
  7. d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr)

Controversy

Intuitively, batch normalization is thought to make the optimization landscape smoother. However, we must be careful to distinguish between speculative intuitions and true explanations for the phenomena that we observe when training deep models. Recall that we do not even know why simpler deep neural networks (MLPs and conventional CNNs) generalize well in the first place. Even with dropout and weight decay, they remain so flexible that their ability to generalize to unseen data cannot be explained via conventional learning-theoretic generalization guarantees.

In the original paper proposing batch normalization, the authors, in addition to introducing a powerful and useful tool, offered an explanation for why it works: by reducing internal covariate shift. Presumably by internal covariate shift the authors meant something like the intuition expressed above—-the notion that the distribution of variable values changes over the course of training. However, there were two problems with this explanation: i) This drift is very different from covariate shift, rendering the name a misnomer. ii) The explanation offers an under-specified intuition but leaves the question of why precisely this technique works an open question wanting for a rigorous explanation. Throughout this book, we aim to convey the intuitions that practitioners use to guide their development of deep neural networks. However, we believe that it is important to separate these guiding intuitions from established scientific fact. Eventually, when you master this material and start writing your own research papers you will want to be clear to delineate between technical claims and hunches.

Following the success of batch normalization, its explanation in terms of internal covariate shift has repeatedly surfaced in debates in the technical literature and broader discourse about how to present machine learning research. In a memorable speech given while accepting a Test of Time Award at the 2017 NeurIPS conference, Ali Rahimi used internal covariate shift as a focal point in an argument likening the modern practice of deep learning to alchemy. Subsequently, the example was revisited in detail in a position paper outlining troubling trends in machine learning :cite:Lipton.Steinhardt.2018. Other authors have proposed alternative explanations for the success of batch normalization, some claiming that batch normalization’s success comes despite exhibiting behavior that is in some ways opposite to those claimed in the original paper :cite:Santurkar.Tsipras.Ilyas.ea.2018.

We note that the internal covariate shift is no more worthy of criticism than any of thousands of similarly vague claims made every year in the technical machine learning literature. Likely, its resonance as a focal point of these debates owes to its broad recognizability to the target audience. Batch normalization has proven an indispensable method, applied in nearly all deployed image classifiers, earning the paper that introduced the technique tens of thousands of citations.

Summary

  • During model training, batch normalization continuously adjusts the intermediate output of the neural network by utilizing the mean and standard deviation of the minibatch, so that the values of the intermediate output in each layer throughout the neural network are more stable.
  • The batch normalization methods for fully-connected layers and convolutional layers are slightly different.
  • Like a dropout layer, batch normalization layers have different computation results in training mode and prediction mode.
  • Batch normalization has many beneficial side effects, primarily that of regularization. On the other hand, the original motivation of reducing internal covariate shift seems not to be a valid explanation.

Exercises

  1. Can we remove the bias parameter from the fully-connected layer or the convolutional layer before the batch normalization? Why?
  2. Compare the learning rates for LeNet with and without batch normalization.
    1. Plot the increase in training and test accuracy.
    2. How large can you make the learning rate?
  3. Do we need batch normalization in every layer? Experiment with it?
  4. Can you replace dropout by batch normalization? How does the behavior change?
  5. Fix the parameters beta and gamma, and observe and analyze the results.
  6. Review the online documentation for BatchNorm from the high-level APIs to see the other applications for batch normalization.
  7. Research ideas: think of other normalization transforms that you can apply? Can you apply the probability integral transform? How about a full rank covariance estimate?

:begin_tab:mxnet Discussions :end_tab:

:begin_tab:pytorch Discussions :end_tab:

:begin_tab:tensorflow Discussions :end_tab: