Concise Implementation of Softmax Regression

:label:sec_softmax_concise

Just as high-level APIs of deep learning frameworks made it much easier to implement linear regression in :numref:sec_linear_concise, we will find it similarly (or possibly more) convenient for implementing classification models. Let us stick with the Fashion-MNIST dataset and keep the batch size at 256 as in :numref:sec_softmax_scratch.

```{.python .input} from d2l import mxnet as d2l from mxnet import gluon, init, npx from mxnet.gluon import nn npx.set_np()

  1. ```{.python .input}
  2. #@tab pytorch
  3. from d2l import torch as d2l
  4. import torch
  5. from torch import nn

```{.python .input}

@tab tensorflow

from d2l import tensorflow as d2l import tensorflow as tf

  1. ```{.python .input}
  2. #@tab all
  3. batch_size = 256
  4. train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

Initializing Model Parameters

As mentioned in :numref:sec_softmax, the output layer of softmax regression is a fully-connected layer. Therefore, to implement our model, we just need to add one fully-connected layer with 10 outputs to our Sequential. Again, here, the Sequential is not really necessary, but we might as well form the habit since it will be ubiquitous when implementing deep models. Again, we initialize the weights at random with zero mean and standard deviation 0.01.

```{.python .input} net = nn.Sequential() net.add(nn.Dense(10)) net.initialize(init.Normal(sigma=0.01))

  1. ```{.python .input}
  2. #@tab pytorch
  3. # PyTorch does not implicitly reshape the inputs. Thus we define a layer to
  4. # reshape the inputs in our network
  5. class Reshape(torch.nn.Module):
  6. def forward(self, x):
  7. return x.view(-1,784)
  8. net = nn.Sequential(Reshape(), nn.Linear(784, 10))
  9. def init_weights(m):
  10. if type(m) == nn.Linear:
  11. torch.nn.init.normal_(m.weight, std=0.01)
  12. net.apply(init_weights)

```{.python .input}

@tab tensorflow

net = tf.keras.models.Sequential() net.add(tf.keras.layers.Flatten(input_shape=(28, 28))) weight_initializer = tf.keras.initializers.RandomNormal(mean=0.0, stddev=0.01) net.add(tf.keras.layers.Dense(10, kernel_initializer=weight_initializer))

  1. ## Softmax Implementation Revisited
  2. :label:`subsec_softmax-implementation-revisited`
  3. In the previous example of :numref:`sec_softmax_scratch`,
  4. we calculated our model's output
  5. and then ran this output through the cross-entropy loss.
  6. Mathematically, that is a perfectly reasonable thing to do.
  7. However, from a computational perspective,
  8. exponentiation can be a source of numerical stability issues.
  9. Recall that the softmax function calculates
  10. $\hat y_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}$,
  11. where $\hat y_j$ is the $j^\mathrm{th}$ element of
  12. the predicted probability distribution $\hat{\mathbf{y}}$
  13. and $o_j$ is the $j^\mathrm{th}$ element of the logits
  14. $\mathbf{o}$.
  15. If some of the $o_k$ are very large (i.e., very positive),
  16. then $\exp(o_k)$ might be larger than the largest number
  17. we can have for certain data types (i.e., *overflow*).
  18. This would make the denominator (and/or numerator) `inf` (infinity)
  19. and we wind up encountering either 0, `inf`, or `nan` (not a number) for $\hat y_j$.
  20. In these situations we do not get a well-defined
  21. return value for cross entropy.
  22. One trick to get around this is to first subtract $\max(o_k)$
  23. from all $o_k$ before proceeding with the softmax calculation.
  24. You can verify that this shifting of each $o_k$ by constant factor
  25. does not change the return value of softmax.
  26. After the subtraction and normalization step,
  27. it might be possible that some $o_j$ have large negative values
  28. and thus that the corresponding $\exp(o_j)$ will take values close to zero.
  29. These might be rounded to zero due to finite precision (i.e., *underflow*),
  30. making $\hat y_j$ zero and giving us `-inf` for $\log(\hat y_j)$.
  31. A few steps down the road in backpropagation,
  32. we might find ourselves faced with a screenful
  33. of the dreaded `nan` results.
  34. Fortunately, we are saved by the fact that
  35. even though we are computing exponential functions,
  36. we ultimately intend to take their log
  37. (when calculating the cross-entropy loss).
  38. By combining these two operators
  39. softmax and cross entropy together,
  40. we can escape the numerical stability issues
  41. that might otherwise plague us during backpropagation.
  42. As shown in the equation below, we avoid calculating $\exp(o_j)$
  43. and can use instead $o_j$ directly due to the canceling in $\log(\exp(\cdot))$.
  44. $$
  45. \begin{aligned}
  46. \log{(\hat y_j)} & = \log\left( \frac{\exp(o_j)}{\sum_k \exp(o_k)}\right) \\
  47. & = \log{(\exp(o_j))}-\log{\left( \sum_k \exp(o_k) \right)} \\
  48. & = o_j -\log{\left( \sum_k \exp(o_k) \right)}.
  49. \end{aligned}
  50. $$
  51. We will want to keep the conventional softmax function handy
  52. in case we ever want to evaluate the output probabilities by our model.
  53. But instead of passing softmax probabilities into our new loss function,
  54. we will just pass the logits and compute the softmax and its log
  55. all at once inside the cross entropy loss function,
  56. which does smart things like the ["LogSumExp trick"](https://en.wikipedia.org/wiki/LogSumExp).
  57. ```{.python .input}
  58. loss = gluon.loss.SoftmaxCrossEntropyLoss()

```{.python .input}

@tab pytorch

loss = nn.CrossEntropyLoss()

  1. ```{.python .input}
  2. #@tab tensorflow
  3. loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

Optimization Algorithm

Here, we use minibatch stochastic gradient descent with a learning rate of 0.1 as the optimization algorithm. Note that this is the same as we applied in the linear regression example and it illustrates the general applicability of the optimizers.

```{.python .input} trainer = gluon.Trainer(net.collect_params(), ‘sgd’, {‘learning_rate’: 0.1})

  1. ```{.python .input}
  2. #@tab pytorch
  3. trainer = torch.optim.SGD(net.parameters(), lr=0.1)

```{.python .input}

@tab tensorflow

trainer = tf.keras.optimizers.SGD(learning_rate=.1)

  1. ## Training
  2. Next we call the training function defined in :numref:`sec_softmax_scratch` to train the model.
  3. ```{.python .input}
  4. #@tab all
  5. num_epochs = 10
  6. d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

As before, this algorithm converges to a solution that achieves a decent accuracy, albeit this time with fewer lines of code than before.

Summary

  • Using high-level APIs, we can implement softmax regression much more concisely.
  • From a computational perspective, implementing softmax regression has intricacies. Note that in many cases, a deep learning framework takes additional precautions beyond these most well-known tricks to ensure numerical stability, saving us from even more pitfalls that we would encounter if we tried to code all of our models from scratch in practice.

Exercises

  1. Try adjusting the hyperparameters, such as the batch size, number of epochs, and learning rate, to see what the results are.
  2. Increase the numper of epochs for training. Why might the test accuracy decrease after a while? How could we fix this?

:begin_tab:mxnet Discussions :end_tab:

:begin_tab:pytorch Discussions :end_tab:

:begin_tab:tensorflow Discussions :end_tab: