GRU 的基本概念

LSTM 存在很多变体,其中门控循环单元(Gated Recurrent Unit, GRU)是最常见的一种,也是目前比较流行的一种。GRU是由 Cho 等人在2014年提出的,它对LSTM做了一些简化:

  1. GRU将LSTM原来的三个门简化成为两个:重置门 $$r_t$$(Reset Gate)和更新门 $$z_t$$ (Update Gate)。
  2. GRU不保留单元状态 $$c_t$$,只保留隐藏状态 $$h_t$$作为单元输出,这样就和传统RNN的结构保持一致。
  3. 重置门直接作用于前一时刻的隐藏状态 $$h_{t-1}$$。

GRU的前向计算

GRU的单元结构

图20-7展示了GRU的单元结构。

GRU基本原理与实现 - 图1

GRU单元的前向计算公式如下:

  1. 更新门
    zt = \sigma(h{t-1} \cdot W_z + x_t \cdot U_z) \tag{1}

  2. 重置门
    rt = \sigma(h{t-1} \cdot W_r + x_t \cdot U_r) \tag{2}

  3. 候选隐藏状态
    \tilde{h}t = \tanh((r_t \circ h{t-1}) \cdot W_h + x_t \cdot U_h) \tag{3}

  4. 隐藏状态
    h = (1 - zt) \circ h{t-1} + z_t \circ \tilde{h}_t \tag{4}

GRU的原理浅析

从上面的公式可以看出,GRU通过更新们和重置门控制长期状态的遗忘和保留,以及当前输入信息的选择。更新门和重置门通过sigmoid函数,将输入信息映射到[0,1]区间,实现门控功能。

首先,上一时刻的状态h_{t-1}通过重置门,加上当前时刻输入信息,共同构成当前时刻的即时状态\tilde{h}_t,并通过\tanh函数映射到[-1,1]区间。

然后,通过更新门实现遗忘和记忆两个部分。从隐藏状态的公式可以看出,通过z_t进行选择性的遗忘和记忆。(1-z_t)z_t有联动关系,上一时刻信息遗忘的越多,当前信息记住的就越多,实现了LSTM中f_ti_t的功能。

GRU的反向传播

学习了LSTM的反向传播的推导,GRU的推导就相对简单了。我们仍然以lt时刻的GRU单元为例,推导反向传播过程。

同LSTM, 令:lt时刻传入误差为\delta{t}l和上一层传入误差\delta{xt}^{l+1}之和,简写为\delta{t}。

令:

z{zt} = h{t-1} \cdot W_z + x_t \cdot U_z \tag{5}

z{rt} = h{t-1} \cdot W_r + x_t \cdot U_r \tag{6}

z_{\tilde{h}t} = (r_t \circ h{t-1}) \cdot W_h + x_t \cdot U_h \tag{7}

则:

\begin{aligned} \delta{z{zt}} = \frac{\partial{loss}}{\partial{ht}} \cdot \frac{\partial{h_t}}{\partial{z_t}} \cdot \frac{\partial{z_t}}{\partial{z{zt}}} \ = \delta_t \cdot (-diag[h{t-1}] + diag[\tilde{h}_t]) \cdot diag[z_t \circ (1-z_t)] \ = \delta_t \circ (\tilde{h}t - h{t-1}) \circ z_t \circ (1-z_t) \end{aligned} \tag{8}

\begin{aligned} \delta{z{\tilde{h}t}} = \frac{\partial{loss}}{\partial{ht}} \cdot \frac{\partial{h_t}}{\partial{\tilde{h}_t}} \cdot \frac{\partial{\tilde{h}t}}{\partial{z{\tilde{h}_t}}} \ = \delta_t \cdot diag[z_t] \cdot diag[1-(\tilde{h}_t)^2] \ &= \delta_t \circ z_t \circ (1-(\tilde{h}_t)^2) \end{aligned} \tag{9} \begin{aligned} \delta{z{rt}} = \frac{\partial{loss}}{\partial{\tilde{h}t}} \cdot \frac{\partial{\tilde{h}t}}{\partial{z{\tilde{h}t}}} \cdot \frac{\partial{z{\tilde{h}t}}}{\partial{r_t}} \cdot \frac{\partial{r_t}}{\partial{z{r_t}}} \ = \delta{z{\tilde{h}t}} \cdot Wh^T \cdot diag[h{t-1}] \cdot diag[rt \circ (1-r_t)] \ &= \delta{z{\tilde{h}t}} \cdot W_h^T \circ h{t-1} \circ r_t \circ (1-r_t) \end{aligned} \tag{10}

由此可求出,t时刻各个可学习参数的误差:

\begin{aligned} d{W{h,t}} = \frac{\partial{loss}}{\partial{z{\tilde{h}t}}} \cdot \frac{\partial{z{\tilde{h}t}}}{\partial{W_h}} = (r_t \circ h{t-1})^T \cdot \delta{z_{\tilde{h}t}} \end{aligned} \tag{11}

\begin{aligned} d{U{h,t}} = \frac{\partial{loss}}{\partial{z{\tilde{h}t}}} \cdot \frac{\partial{z{\tilde{h}t}}}{\partial{U_h}} = x_t^T \cdot \delta{z{\tilde{h}t}} \end{aligned} \tag{12}

\begin{aligned} d{W{r,t}} = \frac{\partial{loss}}{\partial{z{r_t}}} \cdot \frac{\partial{z{rt}}}{\partial{W_r}} = h{t-1}^T \cdot \delta{z{rt}} \end{aligned} \tag{13}

\begin{aligned} d{U{r,t}} = \frac{\partial{loss}}{\partial{z{r_t}}} \cdot \frac{\partial{z{rt}}}{\partial{U_r}} = x_t^T \cdot \delta{z_{rt}} \end{aligned} \tag{14}

\begin{aligned} d{W{z,t}} = \frac{\partial{loss}}{\partial{z{z_t}}} \cdot \frac{\partial{z{zt}}}{\partial{W_z}} = h{t-1}^T \cdot \delta{z{zt}} \end{aligned} \tag{15}

\begin{aligned} d{U{z,t}} = \frac{\partial{loss}}{\partial{z{z_t}}} \cdot \frac{\partial{z{zt}}}{\partial{U_z}} = x_t^T \cdot \delta{z_{zt}} \end{aligned} \tag{16}

可学习参数的最终误差为各个时刻误差之和,即:

d{W_h} = \sum{t=1}^{\tau} d{W{h,t}} = \sum{t=1}^{\tau} (r_t \circ h{t-1})^T \cdot \delta{z{\tilde{h}t}} \tag{17}

d{U_h} = \sum{t=1}^{\tau} d{U{h,t}} = \sum{t=1}^{\tau} x_t^T \cdot \delta{z_{\tilde{h}t}} \tag{18}

d{W_r} = \sum{t=1}^{\tau} d{W{r,t}} = \sum{t=1}^{\tau} h{t-1}^T \cdot \delta{z{rt}} \tag{19}

d{U_r} = \sum{t=1}^{\tau} d{U{r,t}} = \sum{t=1}^{\tau} x_t^T \cdot \delta{z_{rt}} \tag{20}

d{W_z} = \sum{t=1}^{\tau} d{W{z,t}} = \sum{t=1}^{\tau} h{t-1}^T \cdot \delta{z{zt}} \tag{21}

d{U_z} = \sum{t=1}^{\tau} d{U{z,t}} = \sum{t=1}^{\tau} x_t^T \cdot \delta{z_{zt}} \tag{22}

当前GRU cell分别向前一时刻($t-1$)和下一层($l-1$)传递误差,公式如下:

沿时间向前传递:

$$
\begin{aligned} \delta{h{t-1}} = \frac{\partial{loss}}{\partial{h{t-1}}} \ = \frac{\partial{loss}}{\partial{h_t}} \cdot \frac{\partial{h_t}}{\partial{h{t-1}}} + \frac{\partial{loss}}{\partial{z{\tilde{h}t}}} \cdot \frac{\partial{z{\tilde{h}t}}}{\partial{h{t-1}}} \ &+ \frac{\partial{loss}}{\partial{z{rt}}} \cdot \frac{\partial{z{rt}}}{\partial{h{t-1}}} + \frac{\partial{loss}}{\partial{z{zt}}} \cdot \frac{\partial{z{zt}}}{\partial{h{t-1}}} \ = \delta{t} \circ (1-zt) + \delta{z{\tilde{h}t}} \cdot W_h^T \circ r_t \ &+ \delta{z{rt}} \cdot W_r^T + \delta{z_{zt}} \cdot W_z^T \end{aligned} \tag{23}
$$

沿层次向下传递:

\begin{aligned} \delta{x_t} = \frac{\partial{loss}}{\partial{x_t}} = \frac{\partial{loss}}{\partial{z{\tilde{h}t}}} \cdot \frac{\partial{z{\tilde{h}t}}}{\partial{xt}} \ &+ \frac{\partial{loss}}{\partial{z{r_t}}} \cdot \frac{\partial{z{rt}}}{\partial{x_t}} + \frac{\partial{loss}}{\partial{z{zt}}} \cdot \frac{\partial{z{zt}}}{\partial{x_t}} \ = \delta{z{\tilde{h}t}} \cdot U_h^T + \delta{z{rt}} \cdot U_r^T + \delta{z_{zt}} \cdot U_z^T \end{aligned} \tag{24}

以上,GRU反向传播公式推导完毕。

代码实现

本节进行了GRU网络单元前向计算和反向传播的实现。为了统一和简单,测试用例依然是二进制减法。

初始化

本案例实现了没有bias的GRU单元,只需初始化输入维度和隐层维度。

  1. def __init__(self, input_size, hidden_size):
  2. self.input_size = input_size
  3. self.hidden_size = hidden_size

前向计算

  1. def forward(self, x, h_p, W, U):
  2. self.get_params(W, U)
  3. self.x = x
  4. self.z = Sigmoid().forward(np.dot(h_p, self.wz) + np.dot(x, self.uz))
  5. self.r = Sigmoid().forward(np.dot(h_p, self.wr) + np.dot(x, self.ur))
  6. self.n = Tanh().forward(np.dot((self.r * h_p), self.wn) + np.dot(x, self.un))
  7. self.h = (1 - self.z) * h_p + self.z * self.n
  8. def split_params(self, w, size):
  9. s=[]
  10. for i in range(3):
  11. s.append(w[(i*size):((i+1)*size)])
  12. return s[0], s[1], s[2]
  13. # Get shared parameters, and split them to fit 3 gates, in the order of z, r, \tilde{h} (n stands for \tilde{h} in code)
  14. def get_params(self, W, U):
  15. self.wz, self.wr, self.wn = self.split_params(W, self.hidden_size)
  16. self.uz, self.ur, self.un = self.split_params(U, self.input_size)

反向传播

  1. def backward(self, h_p, in_grad):
  2. self.dzz = in_grad * (self.n - h_p) * self.z * (1 - self.z)
  3. self.dzn = in_grad * self.z * (1 - self.n * self.n)
  4. self.dzr = np.dot(self.dzn, self.wn.T) * h_p * self.r * (1 - self.r)
  5. self.dwn = np.dot((self.r * h_p).T, self.dzn)
  6. self.dun = np.dot(self.x.T, self.dzn)
  7. self.dwr = np.dot(h_p.T, self.dzr)
  8. self.dur = np.dot(self.x.T, self.dzr)
  9. self.dwz = np.dot(h_p.T, self.dzz)
  10. self.duz = np.dot(self.x.T, self.dzz)
  11. self.merge_params()
  12. # pass to previous time step
  13. self.dh = in_grad * (1 - self.z) + np.dot(self.dzn, self.wn.T) * self.r + np.dot(self.dzr, self.wr.T) + np.dot(self.dzz, self.wz.T)
  14. # pass to previous layer
  15. self.dx = np.dot(self.dzn, self.un.T) + np.dot(self.dzr, self.ur.T) + np.dot(self.dzz, self.uz.T)

我们将所有拆分的参数merge到一起,便于更新梯度。

  1. def merge_params(self):
  2. self.dW = np.concatenate((self.dwz, self.dwr, self.dwn), axis=0)
  3. self.dU = np.concatenate((self.duz, self.dur, self.dun), axis=0)

最终结果

图20-8展示了训练过程,以及loss和accuracy的曲线变化。

GRU基本原理与实现 - 图2

该模型在验证集上可得100%的正确率。网络正确性得到验证。

  1. x1: [1, 1, 1, 0]
  2. - x2: [1, 0, 0, 0]
  3. ------------------
  4. true: [0, 1, 1, 0]
  5. pred: [0, 1, 1, 0]
  6. 14 - 8 = 6
  7. ====================
  8. x1: [1, 1, 0, 0]
  9. - x2: [0, 0, 0, 0]
  10. ------------------
  11. true: [1, 1, 0, 0]
  12. pred: [1, 1, 0, 0]
  13. 12 - 0 = 12
  14. ====================
  15. x1: [1, 0, 1, 0]
  16. - x2: [0, 0, 0, 1]
  17. ------------------
  18. true: [1, 0, 0, 1]
  19. pred: [1, 0, 0, 1]
  20. 10 - 1 = 9

keras实现

  1. import os
  2. os.environ['KMP_DUPLICATE_LIB_OK']='True'
  3. import matplotlib.pyplot as plt
  4. from MiniFramework.DataReader_2_0 import *
  5. from keras.models import Sequential
  6. from keras.layers import GRU, Dense
  7. train_file = "../data/ch19.train_minus.npz"
  8. test_file = "../data/ch19.test_minus.npz"
  9. def load_data():
  10. dataReader = DataReader_2_0(train_file, test_file)
  11. dataReader.ReadData()
  12. dataReader.Shuffle()
  13. dataReader.GenerateValidationSet(k=10)
  14. x_train, y_train = dataReader.XTrain, dataReader.YTrain
  15. x_test, y_test = dataReader.XTest, dataReader.YTest
  16. x_val, y_val = dataReader.XDev, dataReader.YDev
  17. return x_train, y_train, x_test, y_test, x_val, y_val
  18. def build_model():
  19. model = Sequential()
  20. model.add(GRU(input_shape=(4,2),
  21. units=4))
  22. model.add(Dense(4, activation='sigmoid'))
  23. model.compile(optimizer='Adam',
  24. loss='binary_crossentropy',
  25. metrics=['accuracy'])
  26. return model
  27. #画出训练过程中训练和验证的精度与损失
  28. def draw_train_history(history):
  29. plt.figure(1)
  30. # summarize history for accuracy
  31. plt.subplot(211)
  32. plt.plot(history.history['accuracy'])
  33. plt.plot(history.history['val_accuracy'])
  34. plt.title('model accuracy')
  35. plt.ylabel('accuracy')
  36. plt.xlabel('epoch')
  37. plt.legend(['train', 'validation'])
  38. # summarize history for loss
  39. plt.subplot(212)
  40. plt.plot(history.history['loss'])
  41. plt.plot(history.history['val_loss'])
  42. plt.title('model loss')
  43. plt.ylabel('loss')
  44. plt.xlabel('epoch')
  45. plt.legend(['train', 'validation'])
  46. plt.show()
  47. def test(x_test, y_test, model):
  48. print("testing...")
  49. count = x_test.shape[0]
  50. result = model.predict(x_test)
  51. r = np.random.randint(0, count, 10)
  52. for i in range(10):
  53. idx = r[i]
  54. x1 = x_test[idx, :, 0]
  55. x2 = x_test[idx, :, 1]
  56. print(" x1:", reverse(x1))
  57. print("- x2:", reverse(x2))
  58. print("------------------")
  59. print("true:", reverse(y_test[idx]))
  60. print("pred:", reverse(result[idx]))
  61. x1_dec = int("".join(map(str, reverse(x1))), 2)
  62. x2_dec = int("".join(map(str, reverse(x2))), 2)
  63. print("{0} - {1} = {2}".format(x1_dec, x2_dec, (x1_dec - x2_dec)))
  64. print("====================")
  65. # end for
  66. def reverse(a):
  67. l = a.tolist()
  68. l.reverse()
  69. return l
  70. if __name__ == '__main__':
  71. x_train, y_train, x_test, y_test, x_val, y_val = load_data()
  72. print(x_train.shape)
  73. print(y_train.shape)
  74. print(x_test.shape)
  75. print(x_val.shape)
  76. model = build_model()
  77. history = model.fit(x_train, y_train,
  78. epochs=200,
  79. batch_size=64,
  80. validation_data=(x_val, y_val))
  81. print(model.summary())
  82. draw_train_history(history)
  83. loss, accuracy = model.evaluate(x_test, y_test)
  84. print("test loss: {}, test accuracy: {}".format(loss, accuracy))
  85. test(x_test, y_test, model)

模型输出

test loss: 0.6068302603328929, test accuracy: 0.623161792755127

损失以及准确率曲线

GRU基本原理与实现 - 图3

代码位置

原代码位置:ch20, Level2

个人代码:GRU_BinaryNumberMinus**