一、自定义操作

在这篇笔记中，我们将介绍扩展的方式，torch.autograd。然后编写自定义的C操作，扩展利用我们的C操作

1.1 扩展torch.autograd

往autograd中添加操作要求对于每个操作实施一个新的Function，autograd通过调用Functions来计算结果和梯度。所以每一个新的操作都要求添加下面2个方法：

1.1.1 forward()

执行操作的代码。它可以接受任意多的参数，如果指定默认值，其中一些参数是可选的。这里接受所有类型的Python对象。跟踪历史记录的参数(即with)将被转换为在调用之前不跟踪历史记录的参数，它们的使用将在图中注册。注意，这个逻辑不会遍历列表/字典/任何其他数据结构，只会考虑调用的直接参数s。您可以返回单个输出，如果有多个输出，则返回一个s元组。另外，请参考Function的文档来找到只能从函数forward中调用的有用方法的描述。

1.1.2 backward()

梯度公式。它将被给予与输出一样多的参数，每一个参数代表梯度w.r.t.的输出。它应该返回与有多少输入一样多的s，每个s包含梯度w.r.t它对应的输入。如果你的输入不需要梯度(是一个布尔元组，指示每个输入是否需要梯度计算)，或者是非对象，你可以返回。此外，如果你有可选参数forward()，你可以返回比输入更多的梯度，只要它们都是None。

使用者必须保证前向的ctx使用准确，保证我们这个创建的新的Function能够正确使用。

save_for_backward()：必须在保存前向（forwar）输入或输出时使用，以便稍后在后向（backward）使用
mark_dirty()：用于标记被forward函数改变的输入
mark_non_differentiable()：如果某个输出不可微，要告诉引擎
set_materialize_grads()：可以用来告诉autograd引擎在输出不依赖于输入的情况下，通过不物化给向后函数的grad张量来优化梯度计算。也就是说，如果设置为False, python中的None对象或c++中的” undefined tensor “(张量x, x.defined()为False)将不会在向后调用之前转换为一个充满0的张量。然而，支持这种优化意味着你的自定义autograd函数必须处理以这种方式表示的渐变，因此是可选择的。默认值为True。

默认情况下，所有可微分类型的输出张量将被设置为需要梯度，并为它们设置所有autograd元数据。如果你不希望它们要求梯度，你可以使用上面提到的mark_non_differential方法。对于非可微类型的输出张量(例如整数类型)，它们不会被标记为需要梯度。

1.1.3 例子1

# 继承自Function函数
class LinearFunction(Function):
    # 注意forward和backward函数都是@staticmethods
    @staticmethod
    def forward(ctx, input, weight, bias=None):        # bias是一个可选的参数，因为我们给了默认值
        ctx.save_for_backward(input, weight, bias)
        output = input.mm(weight.t())
        if bias is not None:
            output += bias.unsqueeze(0).expand_as(output)
        return output 
    # 因为这个网络只有一个输出，所以只有一个梯度
    @staticmethod
    def backward(ctx, grad_output):
        # 这种模式是非常方便的，在backward函数的上面返回已经保存好的tesor向量，然后将所有的梯度w.r.t初始化为0
        # 由于忽略了附加的末尾None，因此即使函数有可选的输入，返回语句也很简单。
        input, weight, bias = ctx.saved_tensors
        grad_input = grad_weight = grad_bias = None
        # 剩下的needs_input_grad是可选的，仅仅是用来提升效率的。
        # 如果你想使代码简单一点，可以略过这些。对于输入来说，返回的梯度不要求为一个tensor向量
        if ctx.needs_input_grad[0]:
            grad_input = grad_output.mm(weight)
        if ctx.needs_input_grad[1]:
            grad_input = grad_output.t().mm(input)
        if bias is not None and ctx.needs_input_grad[2]:
            grad_bias = grad_output.sum(0)
        return grad_input, grad_weight, grad_bias

接下来，就可使用这个定义的操作了

linear = LinearFunction.apply

1.1.4 例子2

接下来，添加一个非Tensor的参数

class MulConstant(Function):
    @staticmethod
    def forward(ctx, tensor, constant):
        # ctx可以用来保存backward计算所需的信息
        ctx.constant = constant
        return tensor * constant
    @staticmethod
    def backward(ctx, grad_output):
        # 返回和参数一样多的输入梯度，非Tensor类型的参数其对foward的梯度必须设置为None
        return grad_output * ctx.constant, None

接下来，通过调用set_materialize_grads(False)来优化上面的例子

class MulConstant(Function):
    @staticmethod
    def forward(ctx, tensor, constant):
        ctx.set_materialize_grads(False)
        ctx.constant = constant
        return tensor * constant
    @staticmethod
    def backward(ctx, grad_output):
        # 这里我们必须处理要对非tensor的grad_output是不是None进行判断，如果是None，第一个微分返回None；
        # 如果不是None，就按第二种返回
        if grad_output is None:
            return None, None
        # 因为必须得返回和foward输入同样多的梯度，所以上面返回2个数
        return grad_output * ctx.constant, None

例如，输入也可以是跟踪历史的张量。因此，如果是通过可微分操作实现的，(例如，调用另一个自定义)，高阶导数将起作用。在这种情况下，保存的张量也可以用于向后，并有梯度回流，但保存在向后的张量不会有梯度回流。如果你需要为保存在的张量返回梯度，你应该使它成为自定义的输出并保存它

1.2 检查梯度

你可能想检查你执行的逆向方法是否真的计算了你的函数的导数。通过与使用小有限差分的数值近似比较，这是可能的

from torch.autograd import gradcheck
# 继承自Function函数
class LinearFunction(Function):
    # 注意forward和backward函数都是@staticmethods
    @staticmethod
    def forward(ctx, input, weight, bias=None):        # bias是一个可选的参数，因为我们给了默认值
        ctx.save_for_backward(input, weight, bias)
        output = input.mm(weight.t())
        if bias is not None:
            output += bias.unsqueeze(0).expand_as(output)
        return output 
    # 因为这个网络只有一个输出，所以只有一个梯度
    @staticmethod
    def backward(ctx, grad_output):
        # 这种模式是非常方便的，在backward函数的上面返回已经保存好的tesor向量，然后将所有的梯度w.r.t初始化为0
        # 由于忽略了附加的末尾None，因此即使函数有可选的输入，返回语句也很简单。
        input, weight, bias = ctx.saved_tensors
        grad_input = grad_weight = grad_bias = None
        # 剩下的needs_input_grad是可选的，仅仅是用来提升效率的。
        # 如果你想使代码简单一点，可以略过这些。对于输入来说，返回的梯度不要求为一个tensor向量
        if ctx.needs_input_grad[0]:
            grad_input = grad_output.mm(weight)
        if ctx.needs_input_grad[1]:
            grad_input = grad_output.t().mm(input)
        if bias is not None and ctx.needs_input_grad[2]:
            grad_bias = grad_output.sum(0)
        return grad_input, grad_weight, grad_bias
# gradcheck的输入为tensor的元组， 检查用这些张量计算的梯度是否
# 足够接近数值近似，如果它们都验证了这个条件，则返回True。
linear = LinearFunction.apply
input = (torch.randn(20, 20, dtype=torch.double, requires_grad=True), 
         torch.randn(30, 20, dtype=torch.double, requires_grad=True))
test = gradcheck(linear, input, eps=1e-6, atol=1e-4)
print(test)

有关有限差分梯度比较的更多细节，请参阅数值梯度检查。如果你的函数被用于高阶导数(对反向传递进行微分)，你可以使用同一个包中的函数来检查高阶导数。

二、扩展torch.nn

nn对外开放了两种类型的接口，模型和函数版本的接口，可以同时扩展它们。但是推荐对所有层都是用modules，它可以保存任何参数或者缓冲区的参数；推荐对所有行为函数使用functional接口，像激活函数、池化函数等。

在上面的小节中已经完全介绍了添加操作(operator)的功能版本。

2.1 添加模型

因为nn完全利用autograd，所以添加一个新的Module要求对某个operator实施Function接口并且计算梯度。从现在起，假设我们想要实现一个Linear模块，并且我们有了下面的函数。我们不需要多少代码就能完成，现在需要实现连个函数。

下面就是Linear的实现代码：

class Linear(nn.Module):
    def __init__(self, input_features, output_features, bias=True):
        supper(Linear, self).__init__()
        self.input_features = input_features
        self.output_features = output_featuers
        # nn.Parameter是一个特殊的Tensor类型，当它被分配属性的时候就自动作为模块的参数。
        # 参数和缓冲区需要注册，不然的话它们不会出现在 .parameters()中。当.cuda()函数被
        # 调用的时候它就不会发生转换。可以通过.register_buffer()来注册缓冲区
        # nn.Parameters默认要求的是微分。
        self.weight = nn.Parameter(torch.Tensor(output_features, input_features))
        if bias:
            self.bias = nn.Parameter(torch.Tensor(output_features))
        else:
            # 对于可能为parameters类型的都注册一下，多余的参数可以设为None
            self.register_parameter('bias', None)
            # 下面这个初始化权重的方式不是特别智能
            self.weight.data.unifor_(-0.1, 0.1)
            if self.bias is not None:
                self.bias.data.uniform_(-0.1, 0.1)
            def forward(self, input):
                # 看一下自动求微分过程，看看发生了什么
                return LinearFunction.apply(input, self.weight, self.bias)
            def extra_repr(self):
                # 设置关于这个模型的额外信息，通过打印这个类的实例来检查一下
                return 'input_features={}, output_features={}, bias={}'.format(self.input_features,
                                                                              self.output_features,
                                                                              self.bias is not None)

深度学习

§ 扩展PyTorch操作