双向循环神经网络 - 《循环神经网络RNN》

深度循环神经网络的结构图
前向计算
- 公式推导
- 代码实现
反向传播
- 正向循环的反向传播
- 逆向循环的反向传播
代码实现
keras实现
- MNIST分类-单向循环神经网络
- MNIST分类-双向循环神经网络
代码位置

深度循环神经网络的结构图

前面学习的内容，都是因为“过去”的时间步的状态对“未来”的时间步的状态有影响，在本节中，我们将学习一种双向影响的结构，即双向循环神经网络。

比如在一个语音识别的模型中，可能前面的一个词听上去比较模糊，会产生多个猜测，但是后面的词都很清晰，于是可以用后面的词来为前面的词提供一个最有把握（概率最大）的猜测。再比如，在手写识别应用中，前面的笔划与后面的笔划是相互影响的，特别是后面的笔划对整个字的识别有较大的影响。

在本节中会出现两组相似的词：前向计算、反向传播、正向循环、逆向循环。区别如下：

前向计算：是指神经网络中通常所说的前向计算，包括正向循环的前向计算和逆向循环的前向计算。
反向传播：是指神经网络中通常所说的反向传播，包括正向循环的反向传播和逆向循环的反向传播。
正向循环：是指双向循环神经网络中的从左到右时间步。在正向过程中，会存在前向计算和反向传播。
逆向循环：是指双向循环神经网络中的从右到左时间步。在逆向过程中，也会存在前向计算和反向传播。

很多资料中关于双向循环神经网络的示意如图19-22所示。

双向循环神经网络 - 图1

在图19-22中，h_{tn}中的n表示时间步，在图中取值为1至4。

$$h{t1}$$至$$h{t4}$$是正向循环的四个隐层状态值，$$U$$、$$V$$、$$W$$ 分别是它们的权重矩阵值；
_$$h’{t1}$$至_$$h’{t4}$$是逆向循环的四个隐层状态值，$$U’$$、$$V’$$、 $$W’$$ 分别是它们的权重矩阵值；
$$S{t1}$$至$$S{t4}$$是正逆两个方向的隐层状态值的和。

但是，请大家记住，图19-22和上面的相关解释是不正确的！主要的问题集中在 s_t 是如何生成的。

h_t 和 h’_t 到s_t之间不是矩阵相乘的关系，所以没有 V 和 V’这两个权重矩阵。

正向循环的最后一个时间步h{t4}和逆向循环的第一个时间步h{t4}’共同生成s{t4}，这也是不对的。因为对于正向循环来说，用 h{t4} 没问题。但是对于逆向循环来说，h’{t4} 只是第一个时间步的结果，后面的计算还未发生，所以 h’{t4} 非常不准确。

正确的双向循环神经网络图应该如图19-23所示。

双向循环神经网络 - 图2

用h1/s1表示正向循环的隐层状态，U1、W1表示权重矩阵；用h2/s2表示逆向循环的隐层状态，U2、W2表示权重矩阵。s 是 h 的激活函数结果。

请注意上下两组x{t1}至x{t4}的顺序是相反的：

对于正向循环的最后一个时间步来说，$$x{t4}$$ 作为输入，$$s1{t4}$$是最后一个时间步的隐层值；
对于逆向循环的最后一个时间步来说，$$x{t1}$$ 作为输入，$$s2{t4}$$是最后一个时间步的隐层值；
然后 $$s1{t4}$$ 和 $$s2{t4}$$ 拼接得到 $$s_{t4}$$，再通过与权重矩阵 $$V$$ 相乘得出 $$Z$$。

这就解决了图19-22中的逆向循环在第一个时间步的输出不准确的问题，对于两个方向的循环，都是用最后一个时间步的输出。

图19-23中的 s 节点有两种，一种是绿色实心的，表示有实际输出；另一种是绿色空心的，表示没有实际输出，对于没有实际输出的节点，也不需要做反向传播的计算。

如果需要在每个时间步都有输出，那么图19-23也是一种合理的结构，而图19-22就无法解释了。

前向计算

我们先假设应用场景只需要在最后一个时间步有输出，所以t2所代表的所有中间步都没有a、loss、y三个节点（用空心的圆表示），只有最后一个时间步有输出。

与前面的单向循环网络不同的是，由于有逆向网络的存在，在逆向过程中，t3是第一个时间步，t1是最后一个时间步，所以t1也应该有输出。

公式推导

h1 = x \cdot U1 + s1_{t-1} \cdot W1 \tag{1}

注意公式1在t1时，s1_{t-1}是空，所以加法的第二项不存在。

s1 = Tanh(h1) \tag{2}

h2 = x \cdot U2 + s2_{t-1} \cdot W2 \tag{3}

注意公式3在t1时，s2_{t-1}是空，所以加法的第二项不存在。而且 x 是颠倒时序后的值。

s2 = Tanh(h2) \tag{4}

s = s1 \oplus s2 \tag{5}

公式5有几种实现方式，比如sum（矩阵求和）、concat（矩阵拼接）、mul（矩阵相乘）、ave（矩阵平均），我们在这里使用矩阵求和，这样在反向传播时的公式比较容易推导。

z = s \cdot V \tag{6}

a = Softmax(z) \tag{7}

公式4、5、6、7只在最后一个时间步发生。

代码实现

由于是双向的，所以在主过程中，存在一正一反两个计算链，1表示正向，2表示逆向，3表示输出时的计算。

class timestep(object):
    def forward_1(self, x1, U1, bU1, W1, prev_s1, isFirst):
        ...
    def forward_2(self, x2, U2, bU2, W2, prev_s2, isFirst):
        ...
    def forward_3(self, V, bV, isLast):
        ...

反向传播

正向循环的反向传播

先推导正向循环的反向传播公式，即关于h1、s1节点的计算。

对于最后一个时间步（即\tau）：

\frac{\partial Loss}{\partial z\tau} = \frac{\partial loss\tau}{\partial z\tau}=a\tau-y\tau \rightarrow dz\tau \tag{8}

对于其它时间步来说$dz_t=0$，因为不需要输出。

因为s=s1 + s2，所以\frac{\partial s}{\partial s1}=1，代入下面的公式中：

\begin{aligned} \frac{\partial Loss}{\partial h1\tau}=\frac{\partial loss\tau}{\partial h1\tau}=\frac{\partial loss\tau}{\partial z\tau}\frac{\partial z\tau}{\partial s\tau}\frac{\partial s\tau}{\partial s1\tau}\frac{\partial s1\tau}{\partial h1\tau} \ &=dz\tau \cdot V^T \odot \sigma’(s1\tau) \rightarrow dh1\tau \end{aligned} \tag{9}

其中，下标\tau表示最后一个时间步，\sigma’(s1)表示激活函数的导数，s1是激活函数的数值。下同。

比较公式9和通用循环神经网络模型中的公式9，形式上是完全相同的，原因是\frac{\partial s}{\partial s1}=1，并没有给我们带来任何额外的计算，所以关于其他时间步的推导也应该相同。

对于中间的所有时间步，除了本时间步的losst回传误差外，后一个时间步的h1{t+1}也会回传误差：

\begin{aligned} \frac{\partial Loss}{\partial h1t} = \frac{\partial loss_t}{\partial z_t}\frac{\partial z_t}{\partial s_t}\frac{\partial s_t}{\partial s1_t}\frac{\partial s1_t}{\partial h1_t} + \frac{\partial Loss}{\partial h1{t+1}}\frac{\partial h1{t+1}}{\partial s1{t}}\frac{\partial s1t}{\partial h1_t} \=dz_t \cdot V^T \odot \sigma’(s1_t) + \frac{\partial Loss}{\partial h1{t+1}} \cdot W1^T \odot \sigma’(s1t) \=(dz_t \cdot V^T + dh1{t+1} \cdot W1^T) \odot \sigma’(s1_t) \rightarrow dh1_t \end{aligned} \tag{10}

公式10中的dh1{t+1}，就是上一步中计算得到的dh1_t，如果追溯到最开始，即公式9中的dh1\tau。因此，先有最后一个时间步的dh1_\tau，然后依次向前推，就可以得到所有时间步的dh1_t。

对于V来说，只有当前时间步的损失函数会给它反向传播的误差，与别的时间步没有关系，所以有：

\frac{\partial loss_t}{\partial V_t} = \frac{\partial loss_t}{\partial z_t}\frac{\partial z_t}{\partial V_t}= s_t^T \cdot dz_t \rightarrow dV_t \tag{11}

对于U1，后面的时间步都会给它反向传播误差，但是我们只从h1节点考虑：

\frac{\partial Loss}{\partial U1_t} = \frac{\partial Loss}{\partial h1_t}\frac{\partial h1_t}{\partial U1_t}= x^T_t \cdot dh1_t \rightarrow dU1_t \tag{12}

对于W1，和U1的考虑是一样的，只从当前时间步的h1节点考虑：

\frac{\partial Loss}{\partial W1t} = \frac{\partial Loss}{\partial h1_t}\frac{\partial h1_t}{\partial W1_t}= s1{t-1}^T \cdot dh1_t \rightarrow dW1_t \tag{13}

对于第一个时间步，s1_{t-1}不存在，所以没有dW1：

dW1 = 0 \tag{14}

逆向循环的反向传播

逆向循环的反向传播和正向循环一模一样，只是把 1 变成 2 即可，比如公式13变成：

$$
\frac{\partial Loss}{\partial W2t} = \frac{\partial Loss}{\partial h2_t}\frac{\partial h2_t}{\partial W2_t}= s2{t-1}^T \cdot dh2_t \rightarrow dW2_t
$$

代码实现

单向循环神经网络的效果

为了与单向的循环神经网络比较，笔者实现了一个MNIST分类，超参如下：

    net_type = NetType.MultipleClassifier # 多分类
    output_type = OutputType.LastStep     # 只在最后一个时间步输出
    num_step = 28
    eta = 0.005                           # 学习率
    max_epoch = 100
    batch_size = 128
    num_input = 28
    num_hidden = 32                       # 隐层神经元32个
    num_output = 10

得到的训练结果如下：

...
99:42784:0.005000 loss=0.212298, acc=0.943200
99:42999:0.005000 loss=0.200447, acc=0.947200
save last parameters...
testing...
loss=0.186573, acc=0.948800
load best parameters...
testing...
loss=0.176821, acc=0.951700

最好的时间点的权重矩阵参数得到的准确率为95.17%，损失函数值为0.176821。

双向循环神经网络的效果

    eta = 0.01
    max_epoch = 100
    batch_size = 128
    num_step = 28
    num_input = 28
    num_hidden1 = 20          # 正向循环隐层神经元20个
    num_hidden2 = 20          # 逆向循环隐层神经元20个
    num_output = 10

得到的结果如图19-23所示。

双向循环神经网络 - 图3

下面是打印输出：

...
save best parameters...
98:42569:0.002000 loss=0.163360, acc=0.955200
99:42784:0.002000 loss=0.164529, acc=0.954200
99:42999:0.002000 loss=0.163679, acc=0.955200
save last parameters...
testing...
loss=0.144703, acc=0.958000
load best parameters...
testing...
loss=0.146799, acc=0.958000

最好的时间点的权重矩阵参数得到的准确率为95.59%，损失函数值为0.153259。

比较

表19-12 单向和双向循环神经网络的比较

	单向	双向
参数个数	2281	2060
准确率	95.17%	95.8%
损失函数值	0.176	0.144

keras实现

MNIST分类-单向循环神经网络

import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

import matplotlib.pyplot as plt

from ExtendedDataReader.MnistImageDataReader import *

from keras.models import Sequential, load_model
from keras.layers import SimpleRNN, Dense, Reshape



def load_data():
    dataReader = MnistImageDataReader(mode="timestep")
    dataReader.ReadData()
    dataReader.NormalizeX()
    dataReader.NormalizeY(NetType.MultipleClassifier, base=0)
    dataReader.Shuffle()
    dataReader.GenerateValidationSet(k=12)
    x_train, y_train = dataReader.XTrain, dataReader.YTrain
    x_test, y_test = dataReader.XTest, dataReader.YTest
    x_val, y_val = dataReader.XDev, dataReader.YDev
    x_train = x_train.squeeze()
    x_test = x_test.squeeze()
    x_val = x_val.squeeze()

    x_test_raw = dataReader.XTestRaw[0:64]
    y_test_raw = dataReader.YTestRaw[0:64]

    return x_train, y_train, x_test, y_test, x_val, y_val, x_test_raw, y_test_raw


def build_model():
    model = Sequential()
    model.add(SimpleRNN(input_shape=(28,28),
                        units=8))
    model.add(Dense(10, activation='softmax'))
    model.compile(optimizer='Adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model

#画出训练过程中训练和验证的精度与损失
def draw_train_history(history):
    plt.figure(1)

    # summarize history for accuracy
    plt.subplot(211)
    plt.plot(history.history['accuracy'])
    plt.plot(history.history['val_accuracy'])
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train', 'validation'])

    # summarize history for loss
    plt.subplot(212)
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'validation'])
    plt.show()


def show_result(x, y, y_raw):
    x = x / 255
    fig, ax = plt.subplots(nrows=8, ncols=8, figsize=(11, 11))
    for i in range(64):
        ax[i // 8, i % 8].imshow(x[i].transpose(1, 2, 0).squeeze())
        print(y[i, 0])
        print(y_raw[i, 0])
        if y[i, 0] == y_raw[i, 0]:
            ax[i // 8, i % 8].set_title(y[i, 0])
        else:
            ax[i // 8, i % 8].set_title(y[i, 0], fontdict={'color':'r'})
        ax[i // 8, i % 8].axis('off')
    # endfor
    plt.show()


if __name__ == '__main__':
    x_train, y_train, x_test, y_test, x_val, y_val, x_test_raw, y_test_raw = load_data()
    print(x_train.shape)
    print(y_train.shape)
    print(x_test.shape)
    print(y_test.shape)
    print(x_val.shape)
    print(y_train.shape)

    model = build_model()
    history = model.fit(x_train, y_train,
                        epochs=10,
                        batch_size=64,
                        validation_data=(x_val, y_val))
    model.save("Base_MNIST.h5")
    print(model.summary())
    draw_train_history(history)

    loss, accuracy = model.evaluate(x_test, y_test)
    print("test loss: {}, test accuracy: {}".format(loss, accuracy))

    z = model.predict(x_test[0:64])
    show_result(x_test_raw[0:64], np.argmax(z, axis=1).reshape(64, 1), y_test_raw[0:64])

模型输出

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
simple_rnn_1 (SimpleRNN)     (None, 8)                 296       
_________________________________________________________________
dense_1 (Dense)              (None, 10)                90        
=================================================================
Total params: 386
Trainable params: 386
Non-trainable params: 0
_________________________________________________________________

test loss: 1.0559768465042114, test accuracy: 0.6190000176429749

损失以及准确率曲线

双向循环神经网络 - 图4

模型分类结果

双向循环神经网络 - 图5

MNIST分类-双向循环神经网络

import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

import matplotlib.pyplot as plt

from ExtendedDataReader.MnistImageDataReader import *

from keras.models import Sequential, load_model
from keras.layers import SimpleRNN, Dense, Bidirectional



def load_data():
    dataReader = MnistImageDataReader(mode="timestep")
    dataReader.ReadData()
    dataReader.NormalizeX()
    dataReader.NormalizeY(NetType.MultipleClassifier, base=0)
    dataReader.Shuffle()
    dataReader.GenerateValidationSet(k=12)
    x_train, y_train = dataReader.XTrain, dataReader.YTrain
    x_test, y_test = dataReader.XTest, dataReader.YTest
    x_val, y_val = dataReader.XDev, dataReader.YDev
    x_train = x_train.squeeze()
    x_test = x_test.squeeze()
    x_val = x_val.squeeze()

    x_test_raw = dataReader.XTestRaw[0:64]
    y_test_raw = dataReader.YTestRaw[0:64]

    return x_train, y_train, x_test, y_test, x_val, y_val, x_test_raw, y_test_raw


def build_model():
    model = Sequential()
    model.add(Bidirectional(SimpleRNN(units=8), input_shape=(28,28)))
    model.add(Dense(10, activation='softmax'))
    model.compile(optimizer='Adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model

#画出训练过程中训练和验证的精度与损失
def draw_train_history(history):
    plt.figure(1)

    # summarize history for accuracy
    plt.subplot(211)
    plt.plot(history.history['accuracy'])
    plt.plot(history.history['val_accuracy'])
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train', 'validation'])

    # summarize history for loss
    plt.subplot(212)
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'validation'])
    plt.show()


def show_result(x, y, y_raw):
    x = x / 255
    fig, ax = plt.subplots(nrows=8, ncols=8, figsize=(11, 11))
    for i in range(64):
        ax[i // 8, i % 8].imshow(x[i].transpose(1, 2, 0).squeeze())
        print(y[i, 0])
        print(y_raw[i, 0])
        if y[i, 0] == y_raw[i, 0]:
            ax[i // 8, i % 8].set_title(y[i, 0])
        else:
            ax[i // 8, i % 8].set_title(y[i, 0], fontdict={'color':'r'})
        ax[i // 8, i % 8].axis('off')
    # endfor
    plt.show()


if __name__ == '__main__':
    x_train, y_train, x_test, y_test, x_val, y_val, x_test_raw, y_test_raw = load_data()
    print(x_train.shape)
    print(y_train.shape)
    print(x_test.shape)
    print(y_test.shape)
    print(x_val.shape)
    print(y_train.shape)

    model = build_model()
    history = model.fit(x_train, y_train,
                        epochs=10,
                        batch_size=64,
                        validation_data=(x_val, y_val))
    model.save("BiRNN_MNIST.h5")
    print(model.summary())
    draw_train_history(history)

    loss, accuracy = model.evaluate(x_test, y_test)
    print("test loss: {}, test accuracy: {}".format(loss, accuracy))

    z = model.predict(x_test[0:64])
    show_result(x_test_raw[0:64], np.argmax(z, axis=1).reshape(64, 1), y_test_raw[0:64])

模型输出

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bidirectional_1 (Bidirection (None, 16)                592       
_________________________________________________________________
dense_1 (Dense)              (None, 10)                170       
=================================================================
Total params: 762
Trainable params: 762
Non-trainable params: 0
_________________________________________________________________

test loss: 0.4622671669960022, test accuracy: 0.8648999929428101

损失以及准确率曲线

双向循环神经网络 - 图6

模型分类结果

双向循环神经网络 - 图7

代码位置

原代码位置：ch19, Level7

其中，Level7_Base_MNIST.py是单向的循环神经网络，Level7_BiRnn_MNIST.py是双向的循环神经网络。

个人代码：Base_MNIST**