Implementation of Softmax Regression from Scratch

:label:sec_softmax_scratch

Just as we implemented linear regression from scratch, we believe that softmax regression is similarly fundamental and you ought to know the gory details of how to implement it yourself. We will work with the Fashion-MNIST dataset, just introduced in :numref:sec_fashion_mnist, setting up a data iterator with batch size 256.

```{.python .input} from d2l import mxnet as d2l from mxnet import autograd, np, npx, gluon from IPython import display npx.set_np()

  1. ```{.python .input}
  2. #@tab pytorch
  3. from d2l import torch as d2l
  4. import torch
  5. from IPython import display

```{.python .input}

@tab tensorflow

from d2l import tensorflow as d2l import tensorflow as tf from IPython import display

  1. ```{.python .input}
  2. #@tab all
  3. batch_size = 256
  4. train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

Initializing Model Parameters

As in our linear regression example, each example here will be represented by a fixed-length vector. Each example in the raw dataset is a $28 \times 28$ image. In this section, we will flatten each image, treating them as vectors of length 784. In the future, we will talk about more sophisticated strategies for exploiting the spatial structure in images, but for now we treat each pixel location as just another feature.

Recall that in softmax regression, we have as many outputs as there are classes. Because our dataset has 10 classes, our network will have an output dimension of 10. Consequently, our weights will constitute a $784 \times 10$ matrix and the biases will constitute a $1 \times 10$ row vector. As with linear regression, we will initialize our weights W with Gaussian noise and our biases to take the initial value 0.

```{.python .input} num_inputs = 784 num_outputs = 10

W = np.random.normal(0, 0.01, (num_inputs, num_outputs)) b = np.zeros(num_outputs) W.attach_grad() b.attach_grad()

  1. ```{.python .input}
  2. #@tab pytorch
  3. num_inputs = 784
  4. num_outputs = 10
  5. W = torch.normal(0, 0.01, size=(num_inputs, num_outputs), requires_grad=True)
  6. b = torch.zeros(num_outputs, requires_grad=True)

```{.python .input}

@tab tensorflow

num_inputs = 784 num_outputs = 10

W = tf.Variable(tf.random.normal(shape=(num_inputs, num_outputs), mean=0, stddev=0.01)) b = tf.Variable(tf.zeros(num_outputs))

  1. ## Defining the Softmax Operation
  2. Before implementing the softmax regression model,
  3. let us briefly review how the sum operator work
  4. along specific dimensions in a tensor,
  5. as discussed in :numref:`subseq_lin-alg-reduction` and :numref:`subseq_lin-alg-non-reduction`.
  6. Given a matrix `X` we can sum over all elements (by default) or only
  7. over elements in the same axis,
  8. i.e., the same column (axis 0) or the same row (axis 1).
  9. Note that if `X` is an tensor with shape (2, 3)
  10. and we sum over the columns,
  11. the result will be a vector with shape (3,).
  12. When invoking the sum operator,
  13. we can specify to keep the number of axes in the original tensor,
  14. rather than collapsing out the dimension that we summed over.
  15. This will result in a two-dimensional tensor with shape (1, 3).
  16. ```{.python .input}
  17. #@tab pytorch
  18. X = d2l.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
  19. d2l.reduce_sum(X, 0, keepdim=True), d2l.reduce_sum(X, 1, keepdim=True)

```{.python .input}

@tab mxnet, tensorflow

X = d2l.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) d2l.reduce_sum(X, 0, keepdims=True), d2l.reduce_sum(X, 1, keepdims=True)

  1. We are now ready to implement the softmax operation.
  2. Recall that softmax consists of three steps:
  3. i) we exponentiate each term (using `exp`);
  4. ii) we sum over each row (we have one row per example in the batch)
  5. to get the normalization constant for each example;
  6. iii) we divide each row by its normalization constant,
  7. ensuring that the result sums to 1.
  8. Before looking at the code, let us recall
  9. how this looks expressed as an equation:
  10. $$
  11. \mathrm{softmax}(\mathbf{X})_{ij} = \frac{\exp(\mathbf{X}_{ij})}{\sum_k \exp(\mathbf{X}_{ik})}.
  12. $$
  13. The denominator, or normalization constant,
  14. is also sometimes called the *partition function*
  15. (and its logarithm is called the log-partition function).
  16. The origins of that name are in [statistical physics](https://en.wikipedia.org/wiki/Partition_function_(statistical_mechanics))
  17. where a related equation models the distribution
  18. over an ensemble of particles.
  19. ```{.python .input}
  20. #@tab mxnet, tensorflow
  21. def softmax(X):
  22. X_exp = d2l.exp(X)
  23. partition = d2l.reduce_sum(X_exp, 1, keepdims=True)
  24. return X_exp / partition # The broadcasting mechanism is applied here

```{.python .input}

@tab pytorch

def softmax(X): X_exp = d2l.exp(X) partition = d2l.reduce_sum(X_exp, 1, keepdim=True) return X_exp / partition # The broadcasting mechanism is applied here

  1. As you can see, for any random input,
  2. we turn each element into a non-negative number.
  3. Moreover, each row sums up to 1,
  4. as is required for a probability.
  5. ```{.python .input}
  6. #@tab mxnet, pytorch
  7. X = d2l.normal(0, 1, (2, 5))
  8. X_prob = softmax(X)
  9. X_prob, d2l.reduce_sum(X_prob, 1)

```{.python .input}

@tab tensorflow

X = tf.random.normal((2, 5), 0, 1) X_prob = softmax(X) X_prob, tf.reduce_sum(X_prob, 1)

  1. Note that while this looks correct mathematically,
  2. we were a bit sloppy in our implementation
  3. because we failed to take precautions against numerical overflow or underflow
  4. due to large or very small elements of the matrix.
  5. ## Defining the Model
  6. Now that we have defined the softmax operation,
  7. we can implement the softmax regression model.
  8. The below code defines how the input is mapped to the output through the network.
  9. Note that we flatten each original image in the batch
  10. into a vector using the `reshape` function
  11. before passing the data through our model.
  12. ```{.python .input}
  13. #@tab all
  14. def net(X):
  15. return softmax(d2l.matmul(d2l.reshape(X, (-1, W.shape[0])), W) + b)

Defining the Loss Function

Next, we need to implement the cross-entropy loss function, as introduced in :numref:sec_softmax. This may be the most common loss function in all of deep learning because, at the moment, classification problems far outnumber regression problems.

Recall that cross-entropy takes the negative log-likelihood of the predicted probability assigned to the true label. Rather than iterating over the predictions with a Python for-loop (which tends to be inefficient), we can pick all elements by a single operator. Below, we create a toy data y_hat with 2 examples of predicted probabilities over 3 classes. Then we pick the probability of the first class in the first example and the probability of the third class in the second example.

```{.python .input}

@tab mxnet, pytorch

y = d2l.tensor([0, 2]) y_hat = d2l.tensor([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]]) y_hat[[0, 1], y]

  1. ```{.python .input}
  2. #@tab tensorflow
  3. y_hat = tf.constant([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
  4. y = tf.constant([0, 2])
  5. tf.boolean_mask(y_hat, tf.one_hot(y, depth=y_hat.shape[-1]))

Now we can implement the cross-entropy loss function efficiently with just one line of code.

```{.python .input}

@tab mxnet, pytorch

def cross_entropy(y_hat, y): return - d2l.log(y_hat[range(len(y_hat)), y])

cross_entropy(y_hat, y)

  1. ```{.python .input}
  2. #@tab tensorflow
  3. def cross_entropy(y_hat, y):
  4. return -tf.math.log(tf.boolean_mask(
  5. y_hat, tf.one_hot(y, depth=y_hat.shape[-1])))
  6. cross_entropy(y_hat, y)

Classification Accuracy

Given the predicted probability distribution y_hat, we typically choose the class with the highest predicted probability whenever we must output a hard prediction. Indeed, many applications require that we make a choice. Gmail must categorize an email into “Primary”, “Social”, “Updates”, or “Forums”. It might estimate probabilities internally, but at the end of the day it has to choose one among the classes.

When predictions are consistent with the label class y, they are correct. The classification accuracy is the fraction of all predictions that are correct. Although it can be difficult to optimize accuracy directly (it is not differentiable), it is often the performance measure that we care most about, and we will nearly always report it when training classifiers.

To compute accuracy we do the following. First, if y_hat is a matrix, we assume that the second dimension stores prediction scores for each class. We use argmax to obtain the predicted class by the index for the largest entry in each row. Then we compare the predicted class with the ground-truth y elementwise. Since the equality operator == is sensitive to data types, we convert y_hat‘s data type to match that of y. The result is a tensor containing entries of 0 (false) and 1 (true). Taking the sum yields the number of correct predictions.

```{.python .input}

@tab all

def accuracy(y_hat, y): #@save “””Compute the number of correct predictions.””” if len(y_hat.shape) > 1 and y_hat.shape[1] > 1: y_hat = d2l.argmax(y_hat, axis=1)
cmp = d2l.astype(y_hat, y.dtype) == y return float(d2l.reduce_sum(d2l.astype(cmp, y.dtype)))

  1. We will continue to use the variables `y_hat` and `y`
  2. defined before
  3. as the predicted probability distributions and labels, respectively.
  4. We can see that the first example's prediction class is 2
  5. (the largest element of the row is 0.6 with the index 2),
  6. which is inconsistent with the actual label, 0.
  7. The second example's prediction class is 2
  8. (the largest element of the row is 0.5 with the index of 2),
  9. which is consistent with the actual label, 2.
  10. Therefore, the classification accuracy rate for these two examples is 0.5.
  11. ```{.python .input}
  12. #@tab all
  13. accuracy(y_hat, y) / len(y)

Similarly, we can evaluate the accuracy for any model net on a dataset that is accessed via the data iterator data_iter.

```{.python .input}

@tab mxnet, tensorflow

def evaluateaccuracy(net, data_iter): #@save “””Compute the accuracy for a model on a dataset.””” metric = Accumulator(2) # No. of correct predictions, no. of predictions for , (X, y) in enumerate(data_iter): metric.add(accuracy(net(X), y), d2l.size(y)) return metric[0] / metric[1]

  1. ```{.python .input}
  2. #@tab pytorch
  3. def evaluate_accuracy(net, data_iter): #@save
  4. """Compute the accuracy for a model on a dataset."""
  5. if isinstance(net, torch.nn.Module):
  6. net.eval() # Set the model to evaluation mode
  7. metric = Accumulator(2) # No. of correct predictions, no. of predictions
  8. for _, (X, y) in enumerate(data_iter):
  9. metric.add(accuracy(net(X), y), d2l.size(y))
  10. return metric[0] / metric[1]

Here Accumulator is a utility class to accumulate sums over multiple variables. In the above evaluate_accuracy function, we create 2 variables in the Accumulator instance for storing both the number of correct predictions and the number of predictions, respectively. Both will be accumulated over time as we iterate over the dataset.

```{.python .input}

@tab all

class Accumulator: #@save “””For accumulating sums over n variables.””” def init(self, n): self.data = [0.0] * n

  1. def add(self, *args):
  2. self.data = [a + float(b) for a, b in zip(self.data, args)]
  3. def reset(self):
  4. self.data = [0.0] * len(self.data)
  5. def __getitem__(self, idx):
  6. return self.data[idx]
  1. Because we initialized the `net` model with random weights,
  2. the accuracy of this model should be close to random guessing,
  3. i.e., 0.1 for 10 classes.
  4. ```{.python .input}
  5. #@tab all
  6. evaluate_accuracy(net, test_iter)

Training

The training loop for softmax regression should look strikingly familiar if you read through our implementation of linear regression in :numref:sec_linear_scratch. Here we refactor the implementation to make it reusable. First, we define a function to train for one epoch. Note that updater is a general function to update the model parameters, which accepts the batch size as an argument. It can be either a wrapper of the d2l.sgd function or a framework’s built-in optimization function.

```{.python .input} def train_epoch_ch3(net, train_iter, loss, updater): #@save “””Train a model within one epoch (defined in Chapter 3).”””

  1. # Sum of training loss, sum of training accuracy, no. of examples
  2. metric = Accumulator(3)
  3. if isinstance(updater, gluon.Trainer):
  4. updater = updater.step
  5. for X, y in train_iter:
  6. # Compute gradients and update parameters
  7. with autograd.record():
  8. y_hat = net(X)
  9. l = loss(y_hat, y)
  10. l.backward()
  11. updater(X.shape[0])
  12. metric.add(float(l.sum()), accuracy(y_hat, y), y.size)
  13. # Return training loss and training accuracy
  14. return metric[0] / metric[2], metric[1] / metric[2]
  1. ```{.python .input}
  2. #@tab pytorch
  3. def train_epoch_ch3(net, train_iter, loss, updater): #@save
  4. """The training loop defined in Chapter 3."""
  5. # Set the model to training mode
  6. if isinstance(net, torch.nn.Module):
  7. net.train()
  8. # Sum of training loss, sum of training accuracy, no. of examples
  9. metric = Accumulator(3)
  10. for X, y in train_iter:
  11. # Compute gradients and update parameters
  12. y_hat = net(X)
  13. l = loss(y_hat, y)
  14. if isinstance(updater, torch.optim.Optimizer):
  15. updater.zero_grad()
  16. l.backward()
  17. updater.step()
  18. metric.add(float(l) * len(y), accuracy(y_hat, y),
  19. y.size().numel())
  20. else:
  21. l.sum().backward()
  22. updater(X.shape[0])
  23. metric.add(float(l.sum()), accuracy(y_hat, y), y.numel())
  24. # Return training loss and training accuracy
  25. return metric[0] / metric[2], metric[1] / metric[2]

```{.python .input}

@tab tensorflow

def train_epoch_ch3(net, train_iter, loss, updater): #@save “””The training loop defined in Chapter 3.”””

  1. # Sum of training loss, sum of training accuracy, no. of examples
  2. metric = Accumulator(3)
  3. for X, y in train_iter:
  4. # Compute gradients and update parameters
  5. with tf.GradientTape() as tape:
  6. y_hat = net(X)
  7. # Keras implementations for loss takes (labels, predictions)
  8. # instead of (predictions, labels) that users might implement
  9. # in this book, e.g. `cross_entropy` that we implemented above
  10. if isinstance(loss, tf.keras.losses.Loss):
  11. l = loss(y, y_hat)
  12. else:
  13. l = loss(y_hat, y)
  14. if isinstance(updater, tf.keras.optimizers.Optimizer):
  15. params = net.trainable_variables
  16. grads = tape.gradient(l, params)
  17. updater.apply_gradients(zip(grads, params))
  18. else:
  19. updater(X.shape[0], tape.gradient(l, updater.params))
  20. # Keras loss by default returns the average loss in a batch
  21. l_sum = l * float(tf.size(y)) if isinstance(
  22. loss, tf.keras.losses.Loss) else tf.reduce_sum(l)
  23. metric.add(l_sum, accuracy(y_hat, y), tf.size(y))
  24. # Return training loss and training accuracy
  25. return metric[0] / metric[2], metric[1] / metric[2]
  1. Before showing the implementation of the training function,
  2. we define a utility class that plot data in animation.
  3. Again, it aims to simplify code in the rest of the book.
  4. ```{.python .input}
  5. #@tab all
  6. class Animator: #@save
  7. """For plotting data in animation."""
  8. def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
  9. ylim=None, xscale='linear', yscale='linear',
  10. fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
  11. figsize=(3.5, 2.5)):
  12. # Incrementally plot multiple lines
  13. if legend is None:
  14. legend = []
  15. d2l.use_svg_display()
  16. self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
  17. if nrows * ncols == 1:
  18. self.axes = [self.axes, ]
  19. # Use a lambda function to capture arguments
  20. self.config_axes = lambda: d2l.set_axes(
  21. self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
  22. self.X, self.Y, self.fmts = None, None, fmts
  23. def add(self, x, y):
  24. # Add multiple data points into the figure
  25. if not hasattr(y, "__len__"):
  26. y = [y]
  27. n = len(y)
  28. if not hasattr(x, "__len__"):
  29. x = [x] * n
  30. if not self.X:
  31. self.X = [[] for _ in range(n)]
  32. if not self.Y:
  33. self.Y = [[] for _ in range(n)]
  34. for i, (a, b) in enumerate(zip(x, y)):
  35. if a is not None and b is not None:
  36. self.X[i].append(a)
  37. self.Y[i].append(b)
  38. self.axes[0].cla()
  39. for x, y, fmt in zip(self.X, self.Y, self.fmts):
  40. self.axes[0].plot(x, y, fmt)
  41. self.config_axes()
  42. display.display(self.fig)
  43. display.clear_output(wait=True)

The following training function then trains a model net on a training dataset accessed via train_iter for multiple epochs, which is specified by num_epochs. At the end of each epoch, the model is evaluated on a testing dataset accessed via test_iter. We will leverage the Animator class to visualize the training progress.

```{.python .input}

@tab all

def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater): #@save “””Train a model (defined in Chapter 3).””” animator = Animator(xlabel=’epoch’, xlim=[1, num_epochs], ylim=[0.3, 0.9], legend=[‘train loss’, ‘train acc’, ‘test acc’]) for epoch in range(num_epochs): train_metrics = train_epoch_ch3(net, train_iter, loss, updater) test_acc = evaluate_accuracy(net, test_iter) animator.add(epoch + 1, train_metrics + (test_acc,)) train_loss, train_acc = train_metrics assert train_loss < 0.5, train_loss assert train_acc <= 1 and train_acc > 0.7, train_acc assert test_acc <= 1 and test_acc > 0.7, test_acc

  1. As an implementation from scratch,
  2. we use the minibatch stochastic gradient descent defined in :numref:`sec_linear_scratch`
  3. to optimize the loss function of the model with a learning rate 0.1.
  4. ```{.python .input}
  5. #@tab mxnet, pytorch
  6. lr = 0.1
  7. def updater(batch_size):
  8. return d2l.sgd([W, b], lr, batch_size)

```{.python .input}

@tab tensorflow

class Updater(): #@save “””For updating parameters using minibatch stochastic gradient descent.””” def init(self, params, lr): self.params = params self.lr = lr

  1. def __call__(self, batch_size, grads):
  2. d2l.sgd(self.params, grads, self.lr, batch_size)

updater = Updater([W, b], lr=0.1)

  1. Now we train the model with 10 epochs.
  2. Note that both the number of epochs (`num_epochs`),
  3. and learning rate (`lr`) are adjustable hyperparameters.
  4. By changing their values, we may be able
  5. to increase the classification accuracy of the model.
  6. ```{.python .input}
  7. #@tab all
  8. num_epochs = 10
  9. train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, updater)

Prediction

Now that training is complete, our model is ready to classify some images. Given a series of images, we will compare their actual labels (first line of text output) and the predictions from the model (second line of text output).

```{.python .input}

@tab all

def predict_ch3(net, test_iter, n=6): #@save “””Predict labels (defined in Chapter 3).””” for X, y in test_iter: break trues = d2l.get_fashion_mnist_labels(y) preds = d2l.get_fashion_mnist_labels(d2l.argmax(net(X), axis=1)) titles = [true +’\n’ + pred for true, pred in zip(trues, preds)] d2l.show_images(d2l.reshape(X[0:n], (n, 28, 28)), 1, n, titles=titles[0:n])

predict_ch3(net, test_iter) ```

Summary

  • With softmax regression, we can train models for multiclass classification.
  • The training loop of softmax regression is very similar to that in linear regression: retrieve and read data, define models and loss functions, then train models using optimization algorithms. As you will soon find out, most common deep learning models have similar training procedures.

Exercises

  1. In this section, we directly implemented the softmax function based on the mathematical definition of the softmax operation. What problems might this cause? Hint: try to calculate the size of $\exp(50)$.
  2. The function cross_entropy in this section was implemented according to the definition of the cross-entropy loss function. What could be the problem with this implementation? Hint: consider the domain of the logarithm.
  3. What solutions you can think of to fix the two problems above?
  4. Is it always a good idea to return the most likely label? For example, would you do this for medical diagnosis?
  5. Assume that we want to use softmax regression to predict the next word based on some features. What are some problems that might arise from a large vocabulary?

:begin_tab:mxnet Discussions :end_tab:

:begin_tab:pytorch Discussions :end_tab:

:begin_tab:tensorflow Discussions :end_tab: