本文基于PyTorch1.7.0,https://github.com/pytorch/pytorch/tree/v1.7.0 如果本文有不清楚或者不正确的地方,请在评论区指正
An important thing to note is that the graph is recreated from scratch at every iteration, and this is exactly what allows for using arbitrary Python control flow statements, that can change the overall shape and size of the graph at every iteration. You don’t have to encode all possible paths before you launch the training - what you run is what you differentiate.
反向传播需要的动态计算图是在前向传播的过程中构建的,而之所以称为动态是因为每次迭代(前向传播)都会重新构建计算图。
在本文中,将会介绍动态图的构建过程。
import torch
a = torch.tensor(1.0, requires_grad=True)
b = torch.tensor(2.0, requires_grad=True)
c = torch.add(a, b)
d = torch.mul(a, c)
d.backward()
print(f"a grad:{a.grad} grad_fn:{a.grad_fn}")
print(f"b grad:{b.grad} grad_fn:{b.grad_fn}")
print(f"c grad:{c.grad} grad_fn:{c.grad_fn}")
print(f"d grad:{d.grad} grad_fn:{d.grad_fn}")
"""
a grad:4.0 grad_fn:None
b grad:1.0 grad_fn:None
c grad:None grad_fn:<AddBackward0 object at 0x7fdb27bcbf90>
d grad:None grad_fn:<MulBackward0 object at 0x7fdb27bcb210>
"""
关于tensor的创建过程可以参考Tensor的构建过程,本文主要探讨torch.add()
和torch.mul()
。
1 torch.add()
逐语句调试c = torch.add(a, b)
,代码跳转到THPVariable_add()
THPVariable_add()
划分为三部分:
- 参数解析
- 利用分发机制调用
add()
实现并返回C++层面的tensor - 把
tensor
封装成THPVariable(即可在python层面调用的tensor)
下面是THPVariable_add()的实现
[torch/csrc/autograd/generated/python_torch_functions.cpp]
// add
static PyObject * THPVariable_add(PyObject* self_, PyObject* args, PyObject* kwargs)
{
HANDLE_TH_ERRORS
static PythonArgParser parser({
"add(Tensor input, Scalar alpha, Tensor other, *, Tensor out=None)|deprecated",
"add(Tensor input, Tensor other, *, Scalar alpha=1, Tensor out=None)",
}, /*traceable=*/true);
ParsedArgs<4> parsed_args;
auto _r = parser.parse(nullptr, args, kwargs, parsed_args);
if(_r.has_torch_function()) {
return handle_torch_function(_r, nullptr, args, kwargs, THPVariableFunctionsModule, "torch");
}
switch (_r.idx) {
...
case 1: {
if (_r.isNone(3)) {
// aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor
auto dispatch_add = [](const Tensor & self, const Tensor & other, Scalar alpha) -> Tensor {
pybind11::gil_scoped_release no_gil;
return self.add(other, alpha);
};
return wrap(dispatch_add(_r.tensor(0), _r.tensor(1), _r.scalar(2)));
} else {
...
}
}
}
Py_RETURN_NONE;
END_HANDLE_TH_ERRORS
}
在本例代码中,_r.idx
等于1,所以进入case 1
执行代码,_r.isNone(3)
为True_r.tensor(0)
返回C++层面的tensor
,_r.tensor(1)
也是如此
参数解析就不在此解释了,下面介绍add_Tensor()
1.1 add_Tensor()
调用dispatch_add(_r.tensor(0), _r.tensor(1), _r.scalar(2))
完成加法操作并返回一个C++层面的tensor
。dispatch_add()
调用Tensor::add()
。
[build/aten/src/Aten/core/TensorMethods.cpp]
// aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor
Tensor Tensor::add(const Tensor & other, Scalar alpha) const {
static auto op = c10::Dispatcher::singleton()
.findSchemaOrThrow("aten::add", "Tensor")
.typed<Tensor (const Tensor &, const Tensor &, Scalar)>();
return op.call(const_cast<Tensor&>(*this), other, alpha);
}
Tensor::add()
通过分发机制最终调用add_Tensor()
[torch/csrc/autograd/generated/VariableType_2.cpp]
Tensor add_Tensor(const Tensor & self, const Tensor & other, Scalar alpha) {
auto& self_ = unpack(self, "self", 0);
auto& other_ = unpack(other, "other", 1);
auto _any_requires_grad = compute_requires_grad( self, other );
(void)_any_requires_grad;
std::shared_ptr<AddBackward0> grad_fn;
if (_any_requires_grad) {
grad_fn = std::shared_ptr<AddBackward0>(new AddBackward0(), deleteNode);
grad_fn->set_next_edges(collect_next_edges( self, other ));
grad_fn->other_scalar_type = other.scalar_type();
grad_fn->alpha = alpha;
grad_fn->self_scalar_type = self.scalar_type();
}
...
auto tmp = ([&]() {
at::AutoNonVariableTypeMode non_var_type_mode(true);
return at::add(self_, other_, alpha);
})();
auto result = std::move(tmp);
...
if (grad_fn) {
set_history(flatten_tensor_args( result ), grad_fn);
}
return result;
}
torch/csrc/autograd/generated/里的代码都是tools/autograd/动态生成的
add_Tensor()
划分为三部分:
- 构建反向传播结点
AddBackward0
- 计算两数相加的结果
result
-
1.1.1 构建反向传播结点AddBackward0
auto _any_requires_grad = compute_requires_grad( self, other );
判断是否需要求导,self
或other
的requires_grad
为True
则需要求导。在本例中,需要求导,所以_any_requires_grad
为True
[torch/csrc/autograd/generated/VariableType_2.cpp] Tensor add_Tensor(const Tensor & self, const Tensor & other, Scalar alpha) { ... auto _any_requires_grad = compute_requires_grad( self, other ); (void)_any_requires_grad; std::shared_ptr<AddBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<AddBackward0>(new AddBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self, other )); grad_fn->other_scalar_type = other.scalar_type(); grad_fn->alpha = alpha; grad_fn->self_scalar_type = self.scalar_type(); } ... }
第8行创建
AddBackward0()
结点并赋值给grad_fn
AddBackward0
的定义如下
[torch/csrc/autograd/generated/Functions.h]
struct TORCH_API AddBackward0 : public TraceableFunction {
using TraceableFunction::TraceableFunction;
variable_list apply(variable_list&& grads) override;
...
ScalarType other_scalar_type;
Scalar alpha;
ScalarType self_scalar_type;
};
- 第9行调用
collect_next_edges()
返回edge
列表,并把返回的edge
列表赋值给grad_fn
的next_edges
数据成员
collect_next_edges()
的定义如下
[torch/csrc/autograd/function.h]
/// Return the next edges of all the given variables, or tuples of variables.
template <typename... Variables>
edge_list collect_next_edges(Variables&&... variables) {
if (!GradMode::is_enabled())
return {};
detail::MakeNextFunctionList make;
make.apply(std::forward<Variables>(variables)...);
return std::move(make.next_edges);
}
collect_next_edges()
是通过MakeNextFunctionList
类实现的
[torch/csrc/autograd/function.h]
// Implementation of `collect_next_edges` (see below).
struct MakeNextFunctionList : IterArgs<MakeNextFunctionList> {
edge_list next_edges;
using IterArgs<MakeNextFunctionList>::operator();
void operator()(const Variable& variable) {
if (variable.defined()) {
next_edges.push_back(impl::gradient_edge(variable));
} else {
next_edges.emplace_back();
}
}
...
};
collect_next_edges()
中的gradient_edge()
返回一个Edge
实例。
[torch/csrc/autograd/variable.cpp]
Edge gradient_edge(const Variable& self) {
if (const auto& gradient = self.grad_fn()) {
return Edge(gradient, self.output_nr());
} else {
return Edge(grad_accumulator(self), 0);
}
}
如果self
是内部创建的(非叶子结点),即通过运算生成的,则返回self
的grad_fn
数据成员,否则(即用户创建的叶子结点)返回AccumulateGrad
实例。AccumulateGrad
实现如下
[torch/csrc/autograd/variable.cpp]
std::shared_ptr<Node> grad_accumulator(const Variable& self) {
auto autograd_meta = get_autograd_meta(self);
...
auto intrusive_from_this = c10::intrusive_ptr<at::TensorImpl>::reclaim(self.unsafeGetTensorImpl());
result = std::make_shared<AccumulateGrad>(Variable(std::move(intrusive_from_this)));
autograd_meta->grad_accumulator_ = result;
return result;
}
AccumulateGrad
中的variable
指向self
的TensorImpl对象,用于叶子结点的梯度更新。
[torch/csrc/autograd/functions/accumulate_grad.h]
struct TORCH_API AccumulateGrad : public Node {
explicit AccumulateGrad(Variable variable_);
variable_list apply(variable_list&& grads) override;
...
Variable variable;
};
collect_next_edges()
最终返回[(AccumulateGrad, 0), (AccumulateGrad, 0)]
set_next_edges()
把返回的edge
列表赋值给grad_fn
的next_edges
数据成员
[torch/csrc/autograd/function.h]
void set_next_edges(edge_list&& next_edges) {
next_edges_ = std::move(next_edges);
}
至此,grad_fn
已经初始化完成,接下来计算两数相加的结果。
1.1.2 计算两数相加的结果result
[torch/csrc/autograd/generated/VariableType_2.cpp]
Tensor add_Tensor(const Tensor & self, const Tensor & other, Scalar alpha) {
...
auto tmp = ([&]() {
at::AutoNonVariableTypeMode non_var_type_mode(true);
return at::add(self_, other_, alpha);
})();
auto result = std::move(tmp);
...
}
at::add()
通过分发机制最终调用ATen中的add_Tenor()
完成加法的运算。
[build/aten/src/ATen/RegisterCPU.cpp]
Tensor add_Tensor(const Tensor & self, const Tensor & other, Scalar alpha) {
structured_add_out_functional op;
op.meta(self, other, alpha);
op.impl(op.outputs_[0], self, other, alpha);
return std::move(op.outputs_[0]);
}
1.1.3 把反向传播结点AddBackward0关联到result中
Tensor add_Tensor(const Tensor & self, const Tensor & other, Scalar alpha) {
...
if (grad_fn) {
set_history(flatten_tensor_args( result ), grad_fn);
}
...
}
上面第4行的set_history()
会调用下面第16行的set_history()
,然后该函数内部再调用第2行的set_history()
完成grad_fn
和result
的关联
[torch/csrc/autograd/functions/utils.h]
inline void set_history(
at::Tensor& variable,
const std::shared_ptr<Node>& grad_fn) {
AT_ASSERT(grad_fn);
if (variable.defined()) {
TORCH_INTERNAL_ASSERT(isDifferentiableType(variable.scalar_type()));
auto output_nr =
grad_fn->add_input_metadata(variable);
impl::set_gradient_edge(variable, {grad_fn, output_nr});
} else {
...
}
}
inline void set_history(
std::vector<Variable>&& variables,
const std::shared_ptr<Node>& grad_fn) {
for (auto& variable : variables) {
set_history(variable, grad_fn);
}
}
因为一般操作通常只有一个输出,所以output_nr
通常是0,而有些操作例如split
可能会有多个输出,假如有3个输出,那么就需要为每个输出grad_fn
,并且其对应的output_nr
分别为0、1、2
上面代码第10行的set_gradient_edge()实
现如下
[torch/csrc/autograd/variable.cpp]
void set_gradient_edge(const Variable& self, Edge edge) {
auto* meta = materialize_autograd_meta(self);
meta->grad_fn_ = std::move(edge.function);
meta->output_nr_ = edge.input_nr;
if (self.is_view()) {
// NB: is_view() ==> get_autograd_meta()
auto diff_view_meta = static_cast<torch::autograd::DifferentiableViewMeta*>(meta);
diff_view_meta->set_attr_version(self._version());
}
}
1.2 把tensor封装成THPVariable
调用wrap()
封装C++层面的Tensor。
[torch/csrc/autograd/utils/wrap_outputs.cpp]
inline PyObject* wrap(at::Tensor tensor) {
return THPVariable_Wrap(Variable(std::move(tensor)));
}
wrap()
调用THPVariable_Wrap()
[torch/csrc/autograd/python_variable.cpp]
PyObject * THPVariable_Wrap(Variable var)
{
if (!var.defined()) {
Py_RETURN_NONE;
}
if (auto obj = torch::autograd::impl::pyobj(var)) {
Py_INCREF(obj);
return obj;
}
return THPVariable_NewWithVar((PyTypeObject *)THPVariableClass, std::move(var));
}
THPVariable_NewWithVar()
是真正把C++版本的tensor封装成可在python层面使用的THPVariable
。
[torch/csrc/autograd/python_variable.cpp]
// Creates a new Python object for a Variable. The Variable must not already
// have a PyObject* associated with it.
static PyObject* THPVariable_NewWithVar(PyTypeObject* type, Variable var)
{
PyObject* obj = type->tp_alloc(type, 0);
if (obj) {
auto v = (THPVariable*) obj;
new (&v->cdata) Variable(std::move(var));
torch::autograd::impl::set_pyobj(v->cdata, obj);
}
return obj;
}
至此,torch.add()
的建图过程已完成,建立的图如下
2 torch.mul()
和torch.add()
类似,torch.mul()
最终调用mul_Tensor()
完成C++层面tensor的构建
2.1 mul_Tensor()
[torch/csrc/autograd/generated/VariableType_4.cpp]
Tensor mul_Tensor(const Tensor & self, const Tensor & other) {
auto& self_ = unpack(self, "self", 0);
auto& other_ = unpack(other, "other", 1);
auto _any_requires_grad = compute_requires_grad( self, other );
(void)_any_requires_grad;
std::shared_ptr<MulBackward0> grad_fn;
if (_any_requires_grad) {
grad_fn = std::shared_ptr<MulBackward0>(new MulBackward0(), deleteNode);
grad_fn->set_next_edges(collect_next_edges( self, other ));
if (grad_fn->should_compute_output(1)) {
grad_fn->self_ = SavedVariable(self, false);
}
grad_fn->other_scalar_type = other.scalar_type();
grad_fn->self_scalar_type = self.scalar_type();
if (grad_fn->should_compute_output(0)) {
grad_fn->other_ = SavedVariable(other, false);
}
}
...
auto tmp = ([&]() {
at::AutoNonVariableTypeMode non_var_type_mode(true);
return at::mul(self_, other_);
})();
auto result = std::move(tmp);
...
if (grad_fn) {
set_history(flatten_tensor_args( result ), grad_fn);
}
return result;
}
torch.mul()
和torch.add()
的不同之处在于
- 加法的求导和输入无关
- 乘法的求导和输入有关
因为乘法的求导和输入有关,所以我们在构建MulBackward0
的时候需要把输入保存下来,所以上面代码12、17行的代码就是保存SavedVariable
实例MulBackward0
的定义如下
[torch/csrc/autograd/generated/Functions.h]
struct TORCH_API MulBackward0 : public TraceableFunction {
using TraceableFunction::TraceableFunction;
variable_list apply(variable_list&& grads) override;
std::string name() const override { return "MulBackward0"; }
void release_variables() override {
std::lock_guard<std::mutex> lock(mutex_);
self_.reset_data();
self_.reset_grad_function();
other_.reset_data();
other_.reset_grad_function();
}
SavedVariable self_;
ScalarType other_scalar_type;
ScalarType self_scalar_type;
SavedVariable other_;
};
SavedVariable
的定义如下
[torch/csrc/autograd/saved_variable.h]
/// A snapshot of a variable at a certain version. A `SavedVariable` stores
/// enough information to reconstruct a variable from a certain point in time.
class TORCH_API SavedVariable {
private:
at::Tensor data_;
// This field is used to store the forward AD gradients associated with
// the saved Tensor. Note that this shared_ptr must never be shared with
// either the saved Tensor or the unpacked Tensor. See note [ Using ForwardGrad ]
std::shared_ptr<ForwardGrad> fw_grad_;
// The gradient function associated with this node. If has_grad_fn
// is false, then this is a leaf node. Note that the grad_fn is not saved if
// it would create a circular reference. In that case, the grad_fn must be
// passed in to the unpack function when reconstructing the Variable.
std::shared_ptr<Node> grad_fn_;
// Weak version of grad_fn_ that prevents leaks in rebase_history() for
// inplace views.
std::weak_ptr<Node> weak_grad_fn_;
std::weak_ptr<Node> grad_accumulator_;
c10::VariableVersion version_counter_;
uint32_t saved_version_ = 0;
uint32_t output_nr_ = 0;
bool was_default_constructed_ = true;
bool requires_grad_ = false;
bool has_grad_fn_ = false;
bool is_inplace_view_ = false;
};
至此,以下代码的动态计算图已经构建完毕,在后续的文章中将分析反向传播是如何完成的。
import torch
a = torch.tensor(1.0, requires_grad=True)
b = torch.tensor(2.0, requires_grad=True)
c = torch.add(a, b)
d = torch.mul(a, c)
d.backward()
print(f"a grad:{a.grad} grad_fn:{a.grad_fn}")
print(f"b grad:{b.grad} grad_fn:{b.grad_fn}")
print(f"c grad:{c.grad} grad_fn:{c.grad_fn}")
print(f"d grad:{d.grad} grad_fn:{d.grad_fn}")
"""
a grad:4.0 grad_fn:None
b grad:1.0 grad_fn:None
c grad:None grad_fn:<AddBackward0 object at 0x7fdb27bcbf90>
d grad:None grad_fn:<MulBackward0 object at 0x7fdb27bcb210>
"""