Implementation of Recurrent Neural Networks from Scratch

:label:sec_rnn_scratch

In this section we will implement an RNN from scratch for a character-level language model, according to our descriptions in :numref:sec_rnn. Such a model will be trained on H. G. Wells’ The Time Machine. As before, we start by reading the dataset first, which is introduced in :numref:sec_language_model.

```{.python .input} %matplotlib inline from d2l import mxnet as d2l import math from mxnet import autograd, gluon, np, npx npx.set_np()

  1. ```{.python .input}
  2. #@tab pytorch
  3. %matplotlib inline
  4. from d2l import torch as d2l
  5. import math
  6. import torch
  7. from torch import nn
  8. from torch.nn import functional as F

```{.python .input}

@tab tensorflow

%matplotlib inline from d2l import tensorflow as d2l import math import numpy as np import tensorflow as tf

  1. ```{.python .input}
  2. #@tab all
  3. batch_size, num_steps = 32, 35
  4. train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)

```{.python .input}

@tab tensorflow

train_random_iter, vocab_random_iter = d2l.load_data_time_machine( batch_size, num_steps, use_random_iter=True)

  1. ## One-Hot Encoding
  2. Recall that each token is represented as a numerical index in `train_iter`.
  3. Feeding these indices directly to a neural network might make it hard to
  4. learn.
  5. We often represent each token as a more expressive feature vector.
  6. The easiest representation is called *one-hot encoding*,
  7. which is introduced
  8. in :numref:`subsec_classification-problem`.
  9. In a nutshell, we map each index to a different unit vector: assume that the number of different tokens in the vocabulary is $N$ (`len(vocab)`) and the token indices range from 0 to $N-1$.
  10. If the index of a token is the integer $i$, then we create a vector of all 0s with a length of $N$ and set the element at position $i$ to 1.
  11. This vector is the one-hot vector of the original token. The one-hot vectors with indices 0 and 2 are shown below.
  12. ```{.python .input}
  13. npx.one_hot(np.array([0, 2]), len(vocab))

```{.python .input}

@tab pytorch

F.one_hot(torch.tensor([0, 2]), len(vocab))

  1. ```{.python .input}
  2. #@tab tensorflow
  3. tf.one_hot(tf.constant([0, 2]), len(vocab))

The shape of the minibatch that we sample each time is (batch size, number of time steps). The one_hot function transforms such a minibatch into a three-dimensional tensor with the last dimension equals to the vocabulary size (len(vocab)). We often transpose the input so that we will obtain an output of shape (number of time steps, batch size, vocabulary size). This will allow us to more conveniently loop through the outermost dimension for updating hidden states of a minibatch, time step by time step.

```{.python .input} X = d2l.reshape(d2l.arange(10), (2, 5)) npx.one_hot(X.T, 28).shape

  1. ```{.python .input}
  2. #@tab pytorch
  3. X = d2l.reshape(d2l.arange(10), (2, 5))
  4. F.one_hot(X.T, 28).shape

```{.python .input}

@tab tensorflow

X = d2l.reshape(d2l.arange(10), (2, 5)) tf.one_hot(tf.transpose(X), 28).shape

  1. ## Initializing the Model Parameters
  2. Next, we initialize the model parameters for
  3. the RNN model.
  4. The number of hidden units `num_hiddens` is a tunable hyperparameter.
  5. When training language models,
  6. the inputs and outputs are from the same vocabulary.
  7. Hence, they have the same dimension,
  8. which is equal to the vocabulary size.
  9. ```{.python .input}
  10. def get_params(vocab_size, num_hiddens, device):
  11. num_inputs = num_outputs = vocab_size
  12. def normal(shape):
  13. return np.random.normal(scale=0.01, size=shape, ctx=device)
  14. # Hidden layer parameters
  15. W_xh = normal((num_inputs, num_hiddens))
  16. W_hh = normal((num_hiddens, num_hiddens))
  17. b_h = d2l.zeros(num_hiddens, ctx=device)
  18. # Output layer parameters
  19. W_hq = normal((num_hiddens, num_outputs))
  20. b_q = d2l.zeros(num_outputs, ctx=device)
  21. # Attach gradients
  22. params = [W_xh, W_hh, b_h, W_hq, b_q]
  23. for param in params:
  24. param.attach_grad()
  25. return params

```{.python .input}

@tab pytorch

def get_params(vocab_size, num_hiddens, device): num_inputs = num_outputs = vocab_size

  1. def normal(shape):
  2. return torch.randn(size=shape, device=device) * 0.01
  3. # Hidden layer parameters
  4. W_xh = normal((num_inputs, num_hiddens))
  5. W_hh = normal((num_hiddens, num_hiddens))
  6. b_h = d2l.zeros(num_hiddens, device=device)
  7. # Output layer parameters
  8. W_hq = normal((num_hiddens, num_outputs))
  9. b_q = d2l.zeros(num_outputs, device=device)
  10. # Attach gradients
  11. params = [W_xh, W_hh, b_h, W_hq, b_q]
  12. for param in params:
  13. param.requires_grad_(True)
  14. return params
  1. ```{.python .input}
  2. #@tab tensorflow
  3. def get_params(vocab_size, num_hidden):
  4. num_inputs = num_outputs = vocab_size
  5. def normal(shape):
  6. return d2l.normal(shape=shape,stddev=0.01,mean=0,dtype=tf.float32)
  7. # Hidden layer parameters
  8. W_xh = tf.Variable(normal((num_inputs, num_hiddens)), dtype=tf.float32)
  9. W_hh = tf.Variable(normal((num_hiddens, num_hiddens)), dtype=tf.float32)
  10. b_h = tf.Variable(d2l.zeros(num_hiddens), dtype=tf.float32)
  11. # Output layer parameters
  12. W_hq = tf.Variable(normal((num_hiddens, num_outputs)), dtype=tf.float32)
  13. b_q = tf.Variable(d2l.zeros(num_outputs), dtype=tf.float32)
  14. params = [W_xh, W_hh, b_h, W_hq, b_q]
  15. return params

RNN Model

To define an RNN model, we first need an init_rnn_state function to return the hidden state at initialization. It returns a tensor filled with 0 and with a shape of (batch size, number of hidden units). Using tuples makes it easier to handle situations where the hidden state contains multiple variables, which we will encounter in later sections.

```{.python .input} def init_rnn_state(batch_size, num_hiddens, device): return (d2l.zeros((batch_size, num_hiddens), ctx=device), )

  1. ```{.python .input}
  2. #@tab pytorch
  3. def init_rnn_state(batch_size, num_hiddens, device):
  4. return (d2l.zeros((batch_size, num_hiddens), device=device), )

```{.python .input}

@tab tensorflow

def init_rnn_state(batch_size, num_hiddens): return (d2l.zeros((batch_size, num_hiddens)), )

  1. The following `rnn` function defines how to compute the hidden state and output
  2. at a time step.
  3. Note that
  4. the RNN model
  5. loops through the outermost dimension of `inputs`
  6. so that it updates hidden states `H` of a minibatch,
  7. time step by time step.
  8. Besides,
  9. the activation function here uses the $\tanh$ function.
  10. As
  11. described in :numref:`sec_mlp`, the
  12. mean value of the $\tanh$ function is 0, when the elements are uniformly
  13. distributed over the real numbers.
  14. ```{.python .input}
  15. def rnn(inputs, state, params):
  16. # Shape of `inputs`: (`num_steps`, `batch_size`, `vocab_size`)
  17. W_xh, W_hh, b_h, W_hq, b_q = params
  18. H, = state
  19. outputs = []
  20. # Shape of `X`: (`batch_size`, `vocab_size`)
  21. for X in inputs:
  22. H = np.tanh(np.dot(X, W_xh) + np.dot(H, W_hh) + b_h)
  23. Y = np.dot(H, W_hq) + b_q
  24. outputs.append(Y)
  25. return np.concatenate(outputs, axis=0), (H,)

```{.python .input}

@tab pytorch

def rnn(inputs, state, params):

  1. # Here `inputs` shape: (`num_steps`, `batch_size`, `vocab_size`)
  2. W_xh, W_hh, b_h, W_hq, b_q = params
  3. H, = state
  4. outputs = []
  5. # Shape of `X`: (`batch_size`, `vocab_size`)
  6. for X in inputs:
  7. H = torch.tanh(torch.mm(X, W_xh) + torch.mm(H, W_hh) + b_h)
  8. Y = torch.mm(H, W_hq) + b_q
  9. outputs.append(Y)
  10. return torch.cat(outputs, dim=0), (H,)
  1. ```{.python .input}
  2. #@tab tensorflow
  3. def rnn(inputs, state, params):
  4. # Here `inputs` shape: (`num_steps`, `batch_size`, `vocab_size`)
  5. W_xh, W_hh, b_h, W_hq, b_q = params
  6. H, = state
  7. outputs = []
  8. # Shape of `X`: (`batch_size`, `vocab_size`)
  9. for X in inputs:
  10. X = tf.reshape(X,[-1,W_xh.shape[0]])
  11. H = tf.tanh(tf.matmul(X, W_xh) + tf.matmul(H, W_hh) + b_h)
  12. Y = tf.matmul(H, W_hq) + b_q
  13. outputs.append(Y)
  14. return d2l.concat(outputs, axis=0), (H,)

With all the needed functions being defined, next we create a class to wrap these functions and store parameters for an RNN model implemented from scratch.

```{.python .input} class RNNModelScratch: #@save “””An RNN Model implemented from scratch.””” def init(self, vocab_size, num_hiddens, device, get_params, init_state, forward_fn): self.vocab_size, self.num_hiddens = vocab_size, num_hiddens self.params = get_params(vocab_size, num_hiddens, device) self.init_state, self.forward_fn = init_state, forward_fn

  1. def __call__(self, X, state):
  2. X = npx.one_hot(X.T, self.vocab_size)
  3. return self.forward_fn(X, state, self.params)
  4. def begin_state(self, batch_size, ctx):
  5. return self.init_state(batch_size, self.num_hiddens, ctx)
  1. ```{.python .input}
  2. #@tab pytorch
  3. class RNNModelScratch: #@save
  4. """A RNN Model implemented from scratch."""
  5. def __init__(self, vocab_size, num_hiddens, device,
  6. get_params, init_state, forward_fn):
  7. self.vocab_size, self.num_hiddens = vocab_size, num_hiddens
  8. self.params = get_params(vocab_size, num_hiddens, device)
  9. self.init_state, self.forward_fn = init_state, forward_fn
  10. def __call__(self, X, state):
  11. X = F.one_hot(X.T, self.vocab_size).type(torch.float32)
  12. return self.forward_fn(X, state, self.params)
  13. def begin_state(self, batch_size, device):
  14. return self.init_state(batch_size, self.num_hiddens, device)

```{.python .input}

@tab tensorflow

class RNNModelScratch: #@save “””A RNN Model implemented from scratch.””” def init(self, vocab_size, num_hiddens, init_state, forward_fn): self.vocab_size, self.num_hiddens = vocab_size, num_hiddens self.init_state, self.forward_fn = init_state, forward_fn

  1. def __call__(self, X, state, params):
  2. X = tf.one_hot(tf.transpose(X), self.vocab_size)
  3. X = tf.cast(X, tf.float32)
  4. return self.forward_fn(X, state, params)
  5. def begin_state(self, batch_size):
  6. return self.init_state(batch_size, self.num_hiddens)
  1. Let us check whether the outputs have the correct shapes, e.g., to ensure that the dimensionality of the hidden state remains unchanged.
  2. ```{.python .input}
  3. #@tab mxnet
  4. num_hiddens = 512
  5. model = RNNModelScratch(len(vocab), num_hiddens, d2l.try_gpu(), get_params,
  6. init_rnn_state, rnn)
  7. state = model.begin_state(X.shape[0], d2l.try_gpu())
  8. Y, new_state = model(X.as_in_context(d2l.try_gpu()), state)
  9. Y.shape, len(new_state), new_state[0].shape

```{.python .input}

@tab pytorch

num_hiddens = 512 model = RNNModelScratch(len(vocab), num_hiddens, d2l.try_gpu(), get_params, init_rnn_state, rnn) state = model.begin_state(X.shape[0], d2l.try_gpu()) Y, new_state = model(X.to(d2l.try_gpu()), state) Y.shape, len(new_state), new_state[0].shape

  1. ```{.python .input}
  2. #@tab tensorflow
  3. # defining tensorflow training strategy
  4. device_name = d2l.try_gpu()._device_name
  5. strategy = tf.distribute.OneDeviceStrategy(device_name)
  6. num_hiddens = 512
  7. with strategy.scope():
  8. model = RNNModelScratch(len(vocab), num_hiddens,
  9. init_rnn_state, rnn)
  10. state = model.begin_state(X.shape[0])
  11. params = get_params(len(vocab), num_hiddens)
  12. Y, new_state = model(X, state, params)
  13. Y.shape, len(new_state), new_state[0].shape

We can see that the output shape is (number of time steps $\times$ batch size, vocabulary size), while the hidden state shape remains the same, i.e., (batch size, number of hidden units).

Prediction

Let us first define the prediction function to generate new characters following the user-provided prefix, which is a string containing several characters. When looping through these beginning characters in prefix, we keep passing the hidden state to the next time step without generating any output. This is called the warm-up period, during which the model updates itself (e.g., update the hidden state) but does not make predictions. After the warm-up period, the hidden state is generally better than its initialized value at the beginning. So we generate the predicted characters and emit them.

``{.python .input} def predict_ch8(prefix, num_preds, model, vocab, device): #@save """Generate new characters following theprefix.""" state = model.begin_state(batch_size=1, ctx=device) outputs = [vocab[prefix[0]]] get_input = lambda: d2l.reshape( d2l.tensor([outputs[-1]], ctx=device), (1, 1)) for y in prefix[1:]: # Warm-up period _, state = model(get_input(), state) outputs.append(vocab[y]) for _ in range(num_preds): # Predictnum_preds` steps y, state = model(get_input(), state) outputs.append(int(y.argmax(axis=1).reshape(1))) return ‘’.join([vocab.idx_to_token[i] for i in outputs])

  1. ```{.python .input}
  2. #@tab pytorch
  3. def predict_ch8(prefix, num_preds, model, vocab, device): #@save
  4. """Generate new characters following the `prefix`."""
  5. state = model.begin_state(batch_size=1, device=device)
  6. outputs = [vocab[prefix[0]]]
  7. get_input = lambda: d2l.reshape(d2l.tensor(
  8. [outputs[-1]], device=device), (1, 1))
  9. for y in prefix[1:]: # Warm-up period
  10. _, state = model(get_input(), state)
  11. outputs.append(vocab[y])
  12. for _ in range(num_preds): # Predict `num_preds` steps
  13. y, state = model(get_input(), state)
  14. outputs.append(int(y.argmax(dim=1).reshape(1)))
  15. return ''.join([vocab.idx_to_token[i] for i in outputs])

```{.python .input}

@tab tensorflow

def predictch8(prefix, num_preds, model, vocab, params): #@save “””Generate new characters following the prefix.””” state = model.begin_state(batch_size=1) outputs = [vocab[prefix[0]]] get_input = lambda: d2l.reshape(d2l.tensor([outputs[-1]]), (1, 1)).numpy() for y in prefix[1:]: # Warm-up period , state = model(getinput(), state, params) outputs.append(vocab[y]) for in range(num_preds): # Predict num_preds steps y, state = model(get_input(), state, params) outputs.append(int(y.numpy().argmax(axis=1).reshape(1))) return ‘’.join([vocab.idx_to_token[i] for i in outputs])

  1. Now we can test the `predict_ch8` function.
  2. We specify the prefix as `time traveller ` and have it generate 10 additional characters.
  3. Given that we have not trained the network,
  4. it will generate nonsensical predictions.
  5. ```{.python .input}
  6. #@tab mxnet,pytorch
  7. predict_ch8('time traveller ', 10, model, vocab, d2l.try_gpu())

```{.python .input}

@tab tensorflow

predict_ch8(‘time traveller ‘, 10, model, vocab, params)

  1. ## Gradient Clipping
  2. For a sequence of length $T$,
  3. we compute the gradients over these $T$ time steps in an iteration, which results in a chain of matrix-products with length $\mathcal{O}(T)$ during backpropagation.
  4. As mentioned in :numref:`sec_numerical_stability`, it might result in numerical instability, e.g., the gradients may either explode or vanish, when $T$ is large. Therefore, RNN models often need extra help to stabilize the training.
  5. Generally speaking,
  6. when solving an optimization problem,
  7. we take update steps for the model parameter,
  8. say in the vector form
  9. $\mathbf{x}$,
  10. in the direction of the negative gradient $\mathbf{g}$ on a minibatch.
  11. For example,
  12. with $\eta > 0$ as the learning rate,
  13. in one iteration we update
  14. $\mathbf{x}$
  15. as $\mathbf{x} - \eta \mathbf{g}$.
  16. Let us further assume that the objective function $f$
  17. is well behaved, say, *Lipschitz continuous* with constant $L$.
  18. That is to say,
  19. for any $\mathbf{x}$ and $\mathbf{y}$ we have
  20. $$|f(\mathbf{x}) - f(\mathbf{y})| \leq L \|\mathbf{x} - \mathbf{y}\|.$$
  21. In this case we can safely assume that if we update the parameter vector by $\eta \mathbf{g}$, then
  22. $$|f(\mathbf{x}) - f(\mathbf{x} - \eta\mathbf{g})| \leq L \eta\|\mathbf{g}\|,$$
  23. which means that
  24. we will not observe a change by more than $L \eta \|\mathbf{g}\|$. This is both a curse and a blessing.
  25. On the curse side,
  26. it limits the speed of making progress;
  27. whereas on the blessing side,
  28. it limits the extent to which things can go wrong if we move in the wrong direction.
  29. Sometimes the gradients can be quite large and the optimization algorithm may fail to converge. We could address this by reducing the learning rate $\eta$. But what if we only *rarely* get large gradients? In this case such an approach may appear entirely unwarranted. One popular alternative is to clip the gradient $\mathbf{g}$ by projecting them back to a ball of a given radius, say $\theta$ via
  30. $$\mathbf{g} \leftarrow \min\left(1, \frac{\theta}{\|\mathbf{g}\|}\right) \mathbf{g}.$$
  31. By doing so we know that the gradient norm never exceeds $\theta$ and that the
  32. updated gradient is entirely aligned with the original direction of $\mathbf{g}$.
  33. It also has the desirable side-effect of limiting the influence any given
  34. minibatch (and within it any given sample) can exert on the parameter vector. This
  35. bestows a certain degree of robustness to the model. Gradient clipping provides
  36. a quick fix to the gradient exploding. While it does not entirely solve the problem, it is one of the many techniques to alleviate it.
  37. Below we define a function to clip the gradients of
  38. a model that is implemented from scratch or a model constructed by the high-level APIs.
  39. Also note that we compute the gradient norm over all the model parameters.
  40. ```{.python .input}
  41. def grad_clipping(model, theta): #@save
  42. """Clip the gradient."""
  43. if isinstance(model, gluon.Block):
  44. params = [p.data() for p in model.collect_params().values()]
  45. else:
  46. params = model.params
  47. norm = math.sqrt(sum((p.grad ** 2).sum() for p in params))
  48. if norm > theta:
  49. for param in params:
  50. param.grad[:] *= theta / norm

```{.python .input}

@tab pytorch

def grad_clipping(model, theta): #@save “””Clip the gradient.””” if isinstance(model, nn.Module): params = [p for p in model.parameters() if p.requires_grad] else: params = model.params norm = torch.sqrt(sum(torch.sum((p.grad * 2)) for p in params)) if norm > theta: for param in params: param.grad[:] = theta / norm

  1. ```{.python .input}
  2. #@tab tensorflow
  3. def grad_clipping(grads, theta): #@save
  4. """Clip the gradient."""
  5. theta = tf.constant(theta, dtype=tf.float32)
  6. norm = tf.math.sqrt(sum((tf.reduce_sum(grad ** 2)).numpy()
  7. for grad in grads))
  8. norm = tf.cast(norm, tf.float32)
  9. new_grad = []
  10. if tf.greater(norm, theta):
  11. for grad in grads:
  12. new_grad.append(grad * theta / norm)
  13. else:
  14. for grad in grads:
  15. new_grad.append(grad)
  16. return new_grad

Training

Before training the model, let us define a function to train the model in one epoch. It differs from how we train the model of :numref:sec_softmax_scratch in three places:

  1. Different sampling methods for sequential data (random sampling and sequential partitioning) will result in differences in the initialization of hidden states.
  2. We clip the gradients before updating the model parameters. This ensures that the model does not diverge even when gradients blow up at some point during the training process.
  3. We use perplexity to evaluate the model. As discussed in :numref:subsec_perplexity, this ensures that sequences of different length are comparable.

Specifically, when sequential partitioning is used, we initialize the hidden state only at the beginning of each epoch. Since the $i^\mathrm{th}$ subsequence example in the next minibatch is adjacent to the current $i^\mathrm{th}$ subsequence example, the hidden state at the end of the current minibatch will be used to initialize the hidden state at the beginning of the next minibatch. In this way, historical information of the sequence stored in the hidden state might flow over adjacent subsequences within an epoch. However, the computation of the hidden state at any point depends on all the previous minibatches in the same epoch, which complicates the gradient computation. To reduce computational cost, we detach the gradient before processing any minibatch so that the gradient computation of the hidden state is always limited to the time steps in one minibatch.

When using the random sampling, we need to re-initialize the hidden state for each iteration since each example is sampled with a random position. Same as the train_epoch_ch3 function in :numref:sec_softmax_scratch, updater is a general function to update the model parameters. It can be either the d2l.sgd function implemented from scratch or the built-in optimization function in a deep learning framework.

```{.python .input} def train_epoch_ch8(model, train_iter, loss, updater, device, #@save use_random_iter): “””Train a model within one epoch (defined in Chapter 8).””” state, timer = None, d2l.Timer() metric = d2l.Accumulator(2) # Sum of training loss, no. of tokens for X, Y in train_iter: if state is None or use_random_iter:

  1. # Initialize `state` when either it is the first iteration or
  2. # using random sampling
  3. state = model.begin_state(batch_size=X.shape[0], ctx=device)
  4. else:
  5. for s in state:
  6. s.detach()
  7. y = Y.T.reshape(-1)
  8. X, y = X.as_in_ctx(device), y.as_in_ctx(device)
  9. with autograd.record():
  10. y_hat, state = model(X, state)
  11. l = loss(y_hat, y).mean()
  12. l.backward()
  13. grad_clipping(model, 1)
  14. updater(batch_size=1) # Since the `mean` function has been invoked
  15. metric.add(l * d2l.size(y), d2l.size(y))
  16. return math.exp(metric[0] / metric[1]), metric[1] / timer.stop()
  1. ```{.python .input}
  2. #@tab pytorch
  3. def train_epoch_ch8(model, train_iter, loss, updater, device, #@save
  4. use_random_iter):
  5. """Train a model within one epoch (defined in Chapter 8)."""
  6. state, timer = None, d2l.Timer()
  7. metric = d2l.Accumulator(2) # Sum of training loss, no. of tokens
  8. for X, Y in train_iter:
  9. if state is None or use_random_iter:
  10. # Initialize `state` when either it is the first iteration or
  11. # using random sampling
  12. state = model.begin_state(batch_size=X.shape[0], device=device)
  13. else:
  14. if isinstance(model, nn.Module) and not isinstance(state, tuple):
  15. # `state` is a tensor for `nn.GRU`
  16. state.detach_()
  17. else:
  18. # `state` is a tuple of tensors for `nn.LSTM` and
  19. # for our custom scratch implementation
  20. for s in state:
  21. s.detach_()
  22. y = Y.T.reshape(-1)
  23. X, y = X.to(device), y.to(device)
  24. y_hat, state = model(X, state)
  25. l = loss(y_hat, y.long()).mean()
  26. if isinstance(updater, torch.optim.Optimizer):
  27. updater.zero_grad()
  28. l.backward()
  29. grad_clipping(model, 1)
  30. updater.step()
  31. else:
  32. l.backward()
  33. grad_clipping(model, 1)
  34. # Since the `mean` function has been invoked
  35. updater(batch_size=1)
  36. metric.add(l * d2l.size(y), d2l.size(y))
  37. return math.exp(metric[0] / metric[1]), metric[1] / timer.stop()

```{.python .input}

@tab tensorflow

def train_epoch_ch8(model, train_iter, loss, updater, #@save params, use_random_iter): “””Train a model within one epoch (defined in Chapter 8).””” state, timer = None, d2l.Timer() metric = d2l.Accumulator(2) # Sum of training loss, no. of tokens for X, Y in train_iter: if state is None or use_random_iter:

  1. # Initialize `state` when either it is the first iteration or
  2. # using random sampling
  3. state = model.begin_state(batch_size=X.shape[0])
  4. with tf.GradientTape(persistent=True) as g:
  5. g.watch(params)
  6. y_hat, state= model(X, state, params)
  7. y = d2l.reshape(Y, (-1))
  8. l = loss(y, y_hat)
  9. grads = g.gradient(l, params)
  10. grads = grad_clipping(grads, 1)
  11. updater.apply_gradients(zip(grads, params))
  12. # Keras loss by default returns the average loss in a batch
  13. # l_sum = l * float(d2l.size(y)) if isinstance(
  14. # loss, tf.keras.losses.Loss) else tf.reduce_sum(l)
  15. metric.add(l * d2l.size(y), d2l.size(y))
  16. return math.exp(metric[0] / metric[1]), metric[1] / timer.stop()
  1. The training function supports
  2. an RNN model implemented
  3. either from scratch
  4. or using high-level APIs.
  5. ```{.python .input}
  6. def train_ch8(model, train_iter, vocab, lr, num_epochs, device, #@save
  7. use_random_iter=False):
  8. """Train a model (defined in Chapter 8)."""
  9. loss = gluon.loss.SoftmaxCrossEntropyLoss()
  10. animator = d2l.Animator(xlabel='epoch', ylabel='perplexity',
  11. legend=['train'], xlim=[10, num_epochs])
  12. # Initialize
  13. if isinstance(model, gluon.Block):
  14. model.initialize(ctx=device, force_reinit=True,
  15. init=init.Normal(0.01))
  16. trainer = gluon.Trainer(model.collect_params(),
  17. 'sgd', {'learning_rate': lr})
  18. updater = lambda batch_size: trainer.step(batch_size)
  19. else:
  20. updater = lambda batch_size: d2l.sgd(model.params, lr, batch_size)
  21. predict = lambda prefix: predict_ch8(prefix, 50, model, vocab, device)
  22. # Train and predict
  23. for epoch in range(num_epochs):
  24. ppl, speed = train_epoch_ch8(
  25. model, train_iter, loss, updater, device, use_random_iter)
  26. if (epoch + 1) % 10 == 0:
  27. animator.add(epoch + 1, [ppl])
  28. print(f'perplexity {ppl:.1f}, {speed:.1f} tokens/sec on {str(device)}')
  29. print(predict('time traveller'))
  30. print(predict('traveller'))

```{.python .input}

@tab pytorch

@save

def train_ch8(model, train_iter, vocab, lr, num_epochs, device, use_random_iter=False): “””Train a model (defined in Chapter 8).””” loss = nn.CrossEntropyLoss() animator = d2l.Animator(xlabel=’epoch’, ylabel=’perplexity’, legend=[‘train’], xlim=[10, num_epochs])

  1. # Initialize
  2. if isinstance(model, nn.Module):
  3. updater = torch.optim.SGD(model.parameters(), lr)
  4. else:
  5. updater = lambda batch_size: d2l.sgd(model.params, lr, batch_size)
  6. predict = lambda prefix: predict_ch8(prefix, 50, model, vocab, device)
  7. # Train and predict
  8. for epoch in range(num_epochs):
  9. ppl, speed = train_epoch_ch8(
  10. model, train_iter, loss, updater, device, use_random_iter)
  11. if (epoch + 1) % 10 == 0:
  12. print(predict('time traveller'))
  13. animator.add(epoch + 1, [ppl])
  14. print(f'perplexity {ppl:.1f}, {speed:.1f} tokens/sec on {str(device)}')
  15. print(predict('time traveller'))
  16. print(predict('traveller'))
  1. ```{.python .input}
  2. #@tab tensorflow
  3. #@save
  4. def train_ch8(model, train_iter, vocab, num_hiddens, lr, num_epochs, strategy,
  5. use_random_iter=False):
  6. """Train a model (defined in Chapter 8)."""
  7. with strategy.scope():
  8. params = get_params(len(vocab), num_hiddens)
  9. loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
  10. updater = tf.keras.optimizers.SGD(lr)
  11. animator = d2l.Animator(xlabel='epoch', ylabel='perplexity',
  12. legend=['train'], xlim=[10, num_epochs])
  13. predict = lambda prefix: predict_ch8(prefix, 50, model, vocab, params)
  14. # Train and predict
  15. for epoch in range(num_epochs):
  16. ppl, speed = train_epoch_ch8(
  17. model, train_iter, loss, updater, params, use_random_iter)
  18. if (epoch + 1) % 10 == 0:
  19. print(predict('time traveller'))
  20. animator.add(epoch + 1, [ppl])
  21. device = d2l.try_gpu()._device_name
  22. print(f'perplexity {ppl:.1f}, {speed:.1f} tokens/sec on {str(device)}')
  23. print(predict('time traveller'))
  24. print(predict('traveller'))

Now we can train the RNN model. Since we only use 10000 tokens in the dataset, the model needs more epochs to converge better.

```{.python .input}

@tab mxnet,pytorch

num_epochs, lr = 500, 1 train_ch8(model, train_iter, vocab, lr, num_epochs, d2l.try_gpu())

  1. ```{.python .input}
  2. #@tab tensorflow
  3. num_epochs, lr = 500, 1
  4. train_ch8(model, train_iter, vocab, num_hiddens, lr, num_epochs, strategy)

Finally, let us check the results of using the random sampling method.

```{.python .input}

@tab mxnet,pytorch

train_ch8(model, train_iter, vocab, lr, num_epochs, d2l.try_gpu(), use_random_iter=True)

  1. ```{.python .input}
  2. #@tab tensorflow
  3. params = get_params(len(vocab_random_iter), num_hiddens)
  4. train_ch8(model, train_random_iter, vocab_random_iter, num_hiddens, lr,
  5. num_epochs, strategy, use_random_iter=True)

While implementing the above RNN model from scratch is instructive, it is not convenient. In the next section we will see how to improve the RNN model, such as how to make it easier to implement and make it run faster.

Summary

  • We can train an RNN-based character-level language model to generate text following the user-provided text prefix.
  • A simple RNN language model consists of input encoding, RNN modeling, and output generation.
  • RNN models need state initialization for training, though random sampling and sequential partitioning use different ways.
  • When using sequential partitioning, we need to detach the gradient to reduce computational cost.
  • A warm-up period allows a model to update itself (e.g., obtain a better hidden state than its initialized value) before making any prediction.
  • Gradient clipping prevents gradient explosion, but it cannot fix vanishing gradients.

Exercises

  1. Show that one-hot encoding is equivalent to picking a different embedding for each object.
  2. Adjust the hyperparameters (e.g., number of epochs, number of hidden units, number of time steps in a minibatch, and learning rate) to improve the perplexity.
    • How low can you go?
    • Replace random sampling with sequential partitioning. Does this lead to better performance?
    • Replace one-hot encoding with learnable embeddings. Does this lead to better performance?
    • How well will it work on other books by H. G. Wells, e.g., The War of the Worlds?
  3. Modify the prediction function such as to use sampling rather than picking the most likely next character.
    • What happens?
    • Bias the model towards more likely outputs, e.g., by sampling from $q(xt \mid x{t-1}, \ldots, x1) \propto P(x_t \mid x{t-1}, \ldots, x_1)^\alpha$ for $\alpha > 1$.
  4. Run the code in this section without clipping the gradient. What happens?
  5. Change sequential partitioning so that it does not separate hidden states from the computational graph. Does the running time change? How about the perplexity?
  6. Replace the activation function used in this section with ReLU and repeat the experiments in this section. Do we still need gradient clipping? Why?

:begin_tab:mxnet Discussions :end_tab:

:begin_tab:pytorch Discussions :end_tab:

:begin_tab:tensorflow Discussions :end_tab: