本文基于PyTorch1.7.0，https://github.com/pytorch/pytorch/tree/v1.7.0

1 了解Node类

1.1 Node

[torch/csrc/autograd/function.h] A Node is an abstract class that represents an operation taking zero or more input Variables and producing zero or more output Variables. All functions in PyTorch’s autograd machinery derive from this class and override its apply method. Instances of such subclasses will then be invokeable via the call operator.

Node类（Node类等同于Function类）表示操作（操作在本篇文章指反向传播类，例如AddBackward0）的抽象类，该操作有0个或多个Variable类型（Variable类型等同于Tensor类型）的输入并有0个或多个Variable类型的输出，例如AccumulateGrad有1个输入和0个输出，GraphRoot有0个输入和至少一个输出。每个反向传播类都继承自Node类并重载Node类中的apply()方法，反向传播类的实例通过调用()操作进而调用apply()。

addBackward = New AddBackward();
addBackward(...); // 相当于调用apply()

1.2 Nodes in the Autograd Graph

[torch/csrc/autograd/function.h] When viewing the autograd system as a graph, Nodes are the vertices or nodes, connected to each other via (directed) Edges, which themselves are represented via ( Node, input_nr) pairs. Variables are the outputs to and inputs of Nodes, and travel between these edges during execution of the graph. When two or more Edges (from different sources) point at the same input to a Node, the values produced along all of these edges are implicitly summed prior to being forwarded to the target Node.

如果把自动微分（autograd）系统比喻为图，那么Node类就是顶点或者结点，其中Node之间通过有方向的边即Edge实例相连，Edge实例用[Node实例，input_nr]表示。Variable是Node的输出和输入，在运行计算图（运行计算图的意思是反向传播即backward）的过程中，Variable在Node之间流动。自动微分的图是有向图，如果一个结点被多条边指向，那么这个结点运行的优先级低于边的起始结点。

1.3 Interface

[torch/csrc/autograd/function.h]
The most important method on Node is the call operator, which takes in a list of variables and produces a list of variables. The precise size of these lists can be determined with num_inputs() and num_outputs() . Nodes are stitched together via their next_edge interface, which let you manipulate the set of outgoing edges of a Node. You can add an edge with add_next_edge(), retrieve an edge with next_edge(index) and iterate over them via the next_edges() method. Other methods exist for integration with the JIT and other parts of PyTorch. Every Node has a sequence number that increases monotonically in the order of Node this sequence number is thread local. This means that when Nodes A , B and C are created consecutively in the same thread, their sequence numbers will be ordered A < B < C . If, however, A and B are created in one thread and C is created in a new thread, there are no guarantees w.r.t. the ordering of C relative to A or B.

Node 中最终要的方法是 call操作（形如node())，call操作的输入和输出都是variable列表。Node 是通过 next_edge 接口连接在一起的。

2 Node类属性和方法

[torch/csrc/autograd/function.h, https://github.com/pytorch/pytorch/blob/v1.7.0/torch/csrc/autograd/function.h]
using tensor_list = std::vector<at::Tensor>;
using variable_list = std::vector<Variable>;
using edge_list = std::vector<Edge>;

2.1 属性

[torch/csrc/autograd/function.h, https://github.com/pytorch/pytorch/blob/v1.7.0/torch/csrc/autograd/function.h]
struct TORCH_API Node : std::enable_shared_from_this<Node> {
  // Since `Node`s are neither copyable nor moveable, we can have const fields.
  const uint64_t sequence_nr_;
  // Id of the thread that created the instance
  uint64_t thread_id_ = 0;
  ...
  std::mutex mutex_;
  edge_list next_edges_;
  PyObject* pyobj_ = nullptr; // weak reference
  std::unique_ptr<AnomalyMetadata> anomaly_metadata_ = nullptr;
  std::vector<std::unique_ptr<FunctionPreHook>> pre_hooks_;
  std::vector<std::unique_ptr<FunctionPostHook>> post_hooks_;
  at::SmallVector<InputMetadata, 2> input_metadata_;
};

sequencenr
nextedges

1.2 Nodes in the Autograd Graph中介绍到，edge是结点之间的有向边，用[Node实例，input_nr]表示，而每个Node实例中都有 next_edges数据成员，用以寻找下一个结点。

[torch/csrc/autograd/edge.h]
/// Represents a particular input of a function.
struct Edge {
  Edge() noexcept : function(nullptr), input_nr(0) {}
  Edge(std::shared_ptr<Node> function_, uint32_t input_nr_) noexcept
      : function(std::move(function_)), input_nr(input_nr_) {}
  /// The function this `Edge` points to.
  std::shared_ptr<Node> function;
  /// The identifier of a particular input to the function.
  uint32_t input_nr;
};

prehooks
posthooks

inputmetadata

[torch/csrc/autograd/input_metadata.h]
struct InputMetadata {
private:
const at::TensorOptions options_;
at::DimVector shape_;
at::Device device_ = at::kCPU;
c10::Stream stream_ = c10::Stream(c10::Stream::Default::DEFAULT, device_);
};

2.2 方法

2.2.1 构造函数

[torch/csrc/autograd/function.h, https://github.com/pytorch/pytorch/blob/v1.7.0/torch/csrc/autograd/function.h]
struct TORCH_API Node : std::enable_shared_from_this<Node> { 
/// Construct a new `Node` with the given `next_edges`. `sequence_nr` is
/// a (currently THE) hint to prioritization in the backward() pass, with
/// higher sequence numbers prioritized before lower sequence numbers.
explicit Node(
    uint64_t sequence_nr,
    edge_list&& next_edges = edge_list())
    : sequence_nr_(sequence_nr),
    next_edges_(std::move(next_edges)) {...}

explicit Node(edge_list&& next_edges = edge_list())
    : Node(at::sequence_number::get_and_increment(), std::move(next_edges)) {}
};

2.2.2 call操作

[torch/csrc/autograd/function.h, https://github.com/pytorch/pytorch/blob/v1.7.0/torch/csrc/autograd/function.h]
struct TORCH_API Node : std::enable_shared_from_this<Node> { 
/// Evaluates the function on the given inputs and returns the result of the
/// function call.
variable_list operator()(variable_list&& inputs) {
    ...
    return apply(std::move(inputs));
    ...
}
};

()调用apply()，Node的子类需要重载apply()

2.2.3 图连接API（Graph Connectivity API）

inputmeta相关

Node的输入（inputs）对应forward时的输出（outputs）

[torch/csrc/autograd/function.h, https://github.com/pytorch/pytorch/blob/v1.7.0/torch/csrc/autograd/function.h]
struct TORCH_API Node : std::enable_shared_from_this<Node> { 
  // Inputs. NOTE: inputs of the grad_fn correspond to Tensor outputs of the
  // forward function.

  // Marker for expected undefined input
  struct undefined_input {};

  /// Adds the type and shape metadata for a new input. Returns the index of
  /// of the new input.
  uint32_t add_input_metadata(
    const at::TensorOptions& options
  , at::IntArrayRef shape
  , at::Device device) noexcept {
    uint32_t input_nr = input_metadata_.size();
    input_metadata_.emplace_back(options, shape, device);
    return input_nr;
  }

  uint32_t add_input_metadata(const at::Tensor& t) noexcept {
    uint32_t input_nr = input_metadata_.size();
    input_metadata_.emplace_back(t);
    return input_nr;
  }

  /// Adds a placeholder for an input that will not be used.
  uint32_t add_input_metadata(undefined_input u) noexcept {
    uint32_t input_nr = input_metadata_.size();
    input_metadata_.emplace_back();
    return input_nr;
  }

  uint32_t num_inputs() const noexcept {
    return input_metadata_.size();
  }

  const InputMetadata& input_metadata(size_t index) const {
    return input_metadata_[index];
  }

  /**
   * Note: Function Streams
   * A function's stream (for a given device type) is the stream of the first
   * element of its input buffer on a device of that type.
   *
   * If all elements are on the same device they MUST share a stream. If
   * elements are on different devices (across multiple GPUs, for example)
   * they may have different streams.
   */
  c10::optional<c10::Stream> stream(const c10::DeviceType device_type) {
    for (const auto& metadata : input_metadata_) {
      if (metadata.device().type() == device_type) return metadata.stream();
    }

    return c10::nullopt;
  }

  void clear_input_metadata() {
    input_metadata_.clear();
  }
};

edge相关

[torch/csrc/autograd/function.h, https://github.com/pytorch/pytorch/blob/v1.7.0/torch/csrc/autograd/function.h]
struct TORCH_API Node : std::enable_shared_from_this<Node> { 
// Outputs ("Next Edges")

const Edge& next_edge(size_t index) const noexcept {
  return next_edges_[index];
}

void set_next_edge(size_t index, Edge edge) {
  next_edges_[index] = std::move(edge);
}

void add_next_edge(Edge edge) {
  next_edges_.push_back(std::move(edge));
}

void set_next_edges(edge_list&& next_edges) {
  next_edges_ = std::move(next_edges);
}

const edge_list& next_edges() const noexcept {
  return next_edges_;
}

edge_list& next_edges() noexcept {
  return next_edges_;
}

uint32_t num_outputs() const noexcept {
  return next_edges_.size();
}
};

Miscellaneous Methods

Hook API

[torch/csrc/autograd/function.h, https://github.com/pytorch/pytorch/blob/v1.7.0/torch/csrc/autograd/function.h]
struct TORCH_API Node : std::enable_shared_from_this<Node> { 
// Hook API
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

uintptr_t add_post_hook(std::unique_ptr<FunctionPostHook>&& post_hook) {
  post_hooks_.push_back(std::move(post_hook));
  // Use the raw pointer as the unique key to identify this hook. This key
  // can then be used in del_post_hook(key) to remove this hook.
  return reinterpret_cast<std::uintptr_t>(post_hooks_.back().get());
}

const std::vector<std::unique_ptr<FunctionPostHook>>& post_hooks() const
    noexcept {
  return post_hooks_;
}

// delete a post hook matching the key
bool del_post_hook(const uintptr_t& key) {
  for (auto it = post_hooks_.begin(); it != post_hooks_.end(); ++it) {
    if (key == reinterpret_cast<std::uintptr_t>(it->get())) {
      post_hooks_.erase(it);
      return true;
    }
  }
  return false;
}

std::vector<std::unique_ptr<FunctionPostHook>>& post_hooks() noexcept {
  return post_hooks_;
}

void add_pre_hook(std::unique_ptr<FunctionPreHook>&& pre_hook) {
  pre_hooks_.push_back(std::move(pre_hook));
}

const std::vector<std::unique_ptr<FunctionPreHook>>& pre_hooks() const
    noexcept {
  return pre_hooks_;
}

std::vector<std::unique_ptr<FunctionPreHook>>& pre_hooks() noexcept {
  return pre_hooks_;
}
};

3 Node类继承体系

3.1 直接继承Node

3.1.1 AccumulateGrad

叶子结点在计算图中的结点为AccumulateGrad，通过调用AccumulateGrad的apply()方法更新叶子结点的grad值。

[torch/csrc/autograd/functions/accumulate_grad.h]
struct TORCH_API AccumulateGrad : public Node {
explicit AccumulateGrad(Variable variable_);

variable_list apply(variable_list&& grads) override;

static at::Tensor callHooks(
    const Variable& variable,
    at::Tensor new_grad) {
  for (auto& hook : impl::hooks(variable)) {
    new_grad = (*hook)({new_grad})[0];
  }
  return new_grad;
}

// variable: the variable whose grad we're accumulating.
// variable_grad: the current grad for the variable.
// new_grad: new grad we want to acummulate for the variable.
// num_expected_refs: the number of refs we expect to hold internally
//                    such that it is safe to avoid cloning the grad
//                    if use_count() of the grad is less than or equal
//                    to this value (in addition to post_hooks).
// update_grad: Function that is used to update grad for the variable.
//              The argument to the function is a Tensor which
//              is used to set a new value for the grad.
template <typename T>
static void accumulateGrad(
    const Variable& variable,
    at::Tensor& variable_grad,
    const at::Tensor& new_grad,
    size_t num_expected_refs,
    const T& update_grad) {
  if (!variable_grad.defined()) {
    if (!GradMode::is_enabled() &&
        !new_grad.is_sparse() &&
        new_grad.use_count() <= num_expected_refs &&
        utils::obeys_layout_contract(new_grad, variable)) {
      // we aren't setting up for double-backward
      // not sparse
      // no other user-visible tensor references new_grad
      // new_grad obeys the "Gradient Layout Contract"
      // Under these conditions, we can steal new_grad without a deep copy.
      update_grad(new_grad.detach());
    } else if (
        !GradMode::is_enabled() && new_grad.is_sparse() &&
        new_grad._indices().is_contiguous() &&
        new_grad._values().is_contiguous() &&
        // Use count for indices and values should always be <=1 since the
        // SparseTensor should be the only one holding a reference to these.
        new_grad._indices().use_count() <= 1 &&
        new_grad._values().use_count() <= 1 &&
        new_grad.use_count() <= num_expected_refs) {
      // Can't detach sparse tensor (since metadata changes are not allowed
      // after detach), so just create a new one for the grad which is a
      // shallow copy. We need a shallow copy so that modifying the original
      // grad tensor doesn't modify the grad we accumulate.
      // We only skip clone if indices and values themselves are contiguous
      // for backward compatiblity reasons. Since without this optimization,
      // earlier we would clone the entire SparseTensor which cloned indices
      // and values.
      // For details see https://github.com/pytorch/pytorch/issues/34375.
      update_grad(at::_sparse_coo_tensor_unsafe(
          new_grad._indices(),
          new_grad._values(),
          new_grad.sizes(),
          new_grad.options()));
    } else {
      if (new_grad.is_sparse()) {
        update_grad(new_grad.clone());
      } else {
        // Deep copies new_grad according to the "Gradient Layout Contract."
        update_grad(utils::clone_obey_contract(new_grad, variable));
      }
    }
  } else if (!GradMode::is_enabled()) {
    // This case is not strictly necessary, but it makes the first-order only
    // case slightly more efficient.
    if (variable_grad.is_sparse() && !new_grad.is_sparse()) {
      // If `variable_grad` is sparse and `new_grad` is not sparse, their
      // sum is not sparse, and we must change the TensorImpl type of
      // `variable_grad` for it to store the result. However, changing the
      // TensorImpl type of a tensor requires changing the tensor itself, and
      // thus in this case we have to change the grad tensor.
      auto result = new_grad + variable_grad;
      CHECK_RESULT(result, variable);
      update_grad(std::move(result));
    } else if (!at::inplaceIsVmapCompatible(variable_grad, new_grad)) {
      // Ideally we'd perform an in-place operation to avoid changing
      // the grad tensor. However, if that's impossible because the grads
      // are vmap-incompatible (See NOTE: [vmap-incompatible in-place operations]),
      // then we just add them out-of-place.
      auto result = variable_grad + new_grad;
      CHECK_RESULT(result, variable);
      update_grad(std::move(result));
    } else {
      // In this case we can avoid changing the grad tensor. There are three
      // scenarios when we'll hit this case:
      //
      // 1. `variable_grad` is sparse, and `new_grad` is sparse.
      // 2. `variable_grad` is dense, and `new_grad` is sparse.
      // 3. `variable_grad` is dense, and `new_grad` is dense.
      //
      // In all of these three cases, `variable_grad += new_grad` is a
      // valid operation which adds `new_grad` to `variable_grad` in
      // place. `variable_grad` is thus still referring to the same tensor
      // after the operation.
      // Also DistributedDataParallel(DDP) package relies on grad being
      // mutated in place for saving peak memory usage. DDP will still
      // work correctly if it is mutated out of place here, but DDP will
      // maintain one extra copy of grad tensors in buffer and thus
      // increase peak memory usage.
      variable_grad += new_grad;
      CHECK_RESULT(variable_grad, variable);
      // ^ We could enforce the contract more aggressively here by writing:
      // if (variable_grad.is_sparse() || new_grad.is_sparse()) {
      //   variable_grad += new_grad;
      // } else if (obeys_layout_contract(variable_grad, variable)) {
      //   variable_grad += new_grad;
      // } else {
      //   result = at::empty_strided(variable.sizes(), variable.strides(),
      //                              variable.options().memory_format(c10::nullopt));
      //   update_grad(at::native::add_out(result, variable_grad, new_grad, 1.0);
      // }
      // However, that accumulation is sometimes in place and sometimes not,
      // which may break user code.
    }
  } else {
    at::Tensor result;
    if (variable_grad.is_sparse() && !new_grad.is_sparse()) {
      // CPU backend throws an error on sparse + dense, so prefer dense + sparse here.
      result = new_grad + variable_grad;
    } else {
      // Assumes operator+ result typically matches strides of first arg,
      // and hopes variable_grad was originally created obeying layout contract.
      result = variable_grad + new_grad;
    }
    CHECK_RESULT(result, variable);
    update_grad(std::move(result));
    // ^ We could enforce the contract more aggressively here by saying
    // if (obeys_layout_contract(new_grad, variable)) {
    //   update_grad(new_grad + variable_grad);
    // } else {
    //   update_grad(variable_grad + new_grad);
    // }
    // such that the stashed grad is likely to have the right strides if
    // either variable_grad or new_grad already has the right strides.
    // We could enforce the contract with certainty by saying
    // auto result = variable_grad + new_grad (or vice versa), checking result's
    // layout, and copying to an obedient clone if necessary before update_grad.
    // The copy would require another gmem pass.  We can't create empty result with
    // the right layout then add_out into it with a single kernel, because GradMode
    // is enabled in this branch, and add_out isn't differentiable.
    // Maybe more trouble than it's worth.
  }
}

Variable variable;
};

apply()

[torch/csrc/autograd/functions/accumulate_grad.cpp]
auto AccumulateGrad::apply(variable_list&& grads) -> variable_list {
check_input_variables("AccumulateGrad", grads, 1, 0);

if (!grads[0].defined())
  return {};
if (variable.grad_fn())
  throw std::logic_error(
      "leaf variable has been moved into the graph interior");
if (!variable.requires_grad())
  return {};

// std::move(grads[0]) to avoid bumping up refcount
at::Tensor new_grad = callHooks(variable, std::move(grads[0]));

// Acquire lock to here protect thread safety on variable, this ensures
// AccumulateGrad does not race to shared variable from different threads
// when updating the gradients. We don't ensure thread safety on hooks
// and rely on user to provide thread safe hooks
// see Note [Thread Safety on Autograd Node]
std::lock_guard<std::mutex> lock(mutex_);

at::Tensor& grad = variable.mutable_grad();

// If the function has post hooks (for example, a DDP allreduce hook),
// call_function in Engine.cpp will temporarily bump the expected refcount
// by one, hence the addition of !post_hooks().empty() for 'num_expected_refs'
// in addition to the one reference that we're holding.
// 'num_expected_refs' is used to determine whether or not we should clone
// the grad or can steal the grad.
accumulateGrad(
    variable,
    grad,
    new_grad,
    1 + !post_hooks().empty() /* num_expected_refs */,
    [&grad](at::Tensor&& grad_update) { grad = std::move(grad_update); });

return variable_list();
}

3.1.2 GraphRoot

图的起点

[torch/csrc/autograd/functions/basic_ops.h]
struct TORCH_API GraphRoot : public Node {
GraphRoot(edge_list functions, variable_list inputs)
    : Node(std::move(functions)),
      outputs(std::move(inputs)) {
  // Ensures calls to stream() on a GraphRoot instance reflect current stream(s)
  // on devices of root grad tensors at the time the instance is constructed.
  for (const auto& t : outputs) {
    add_input_metadata(t);
  }
}

variable_list apply(variable_list&& inputs) override {
  return outputs;
}

variable_list outputs;
};

3.1.3 TraceableFunction

[torch/csrc/autograd/function.h]
/// See Node::is_traceable() for definition.
struct TraceableFunction : public Node {
using Node::Node;
bool is_traceable() final {
  return true;
}
};

3.2 间接继承Node

MulBackward0 ```cpp [torch/csrc/autograd/generated/Functions.h] struct TORCHAPI MulBackward0 : public TraceableFunction { using TraceableFunction::TraceableFunction; variable_list apply(variable_list&& grads) override; std::string name() const override { return “MulBackward0”; } void release_variables() override { std::lock_guard lock(mutex); self.reset_data(); self.resetgrad_function(); other.resetdata(); other.reset_grad_function(); }

SavedVariable self; ScalarType other_scalar_type; ScalarType self_scalar_type; SavedVariable other;

};

[torch/csrc/autograd/generated/Functions.cpp] variablelist MulBackward0::apply(variable_list&& grads) { std::lock_guard lock(mutex);

IndexRangeGenerator gen; auto selfix = gen.range(1); auto other_ix = gen.range(1); variable_list grad_inputs(gen.size()); auto& grad = grads[0]; auto self = self.unpack(); auto other = other_.unpack(); bool any_grad_defined = any_variable_defined(grads); if (should_compute_output({ other_ix })) { auto grad_result = any_grad_defined ? (mul_tensor_backward(grad, self, other_scalar_type)) : Tensor(); copy_range(grad_inputs, other_ix, grad_result); } if (should_compute_output({ self_ix })) { auto grad_result = any_grad_defined ? (mul_tensor_backward(grad, other, self_scalar_type)) : Tensor(); copy_range(grad_inputs, self_ix, grad_result); } return grad_inputs; } ```

PyTorch源码剖析

动态计算图之Node类（Function类）