基本技巧

可以在链式法则中使用雅可比矩阵。
梯度矩阵的维度：梯度矩阵与原始矩阵维度一致，方便梯度更新时候的维度对齐。
可以先计算单个元素的梯度，然后再整理成向量化的形式。
利用链式法则逐个计算。
对 softmax 求导：分别考虑对正确类和非正确类的求导。
常见的等式
等式推导可以：先计算对单个元素的梯度，再向量化，同时注意维度调整。
矩阵与向量的乘积，对向量求梯度

对矩阵求梯度
，直接求是比较复杂的。假设损失函数为，则可以直接求，对每个元素挨个求。因此：

对向量自己求梯度

向量的 elementwise 函数的梯度

同时，注意：乘对角矩阵等价于按元素乘一对角线元素构成的列向量

交叉熵损失，对 logits 的梯度

梯度计算小结 - 图12

梯度检验：Numeric Gradient

检验公式

通过数值方式计算梯度，用来检验推导的梯度公式是否正确。
梯度计算小结 - 图13
对每个参数，都需要重新计算一个梯度计算小结 - 图14 值。

实现代码

def eval_numerical_gradient(f, x):
    """
    a naive implementation of numerical gradient of f at x
    - f should be a function that takes a single argument
    - x is the point (numpy array) to evaluate the gradient at
    """
    fx = f(x) # evaluate function value at original point
    grad = np.zeros(x.shape)
    h = 0.00001
    # iterate over all indexes in x
    it = np.nditer(x, flags=[’multi_index’], op_flags=[’readwrite’])
    while not it.finished:
        # evaluate function at x+h
        ix = it.multi_index
        old_value = x[ix]
        x[ix] = old_value + h # increment by h
        fxh_left = f(x) # evaluate f(x + h)
        x[ix] = old_value - h # decrement by h
        fxh_right = f(x) # evaluate f(x - h)
        x[ix] = old_value # restore to previous value (very
        important!)
        # compute the partial derivative
        grad[ix] = (fxh_left - fxh_right) / (2*h) # the slope
        it.iternext() # step to next dimension
    return grad

梯度计算小结

基本技巧

常见的等式

矩阵与向量的乘积，对向量求梯度

对矩阵求梯度

对向量自己求梯度

向量的 elementwise 函数的梯度

交叉熵损失，对 logits 的梯度

梯度检验：Numeric Gradient

检验公式

实现代码