model.train()和model.eval()的区别主要在于Batch Normalization和Dropout两层

使用总结

模型中如果使用了Dropout去解决过拟合问题时，使用了这两个方法，会有不同的表现形式：

对于model.train()方法，dropout正常发挥作用，按照设置的概率随机丢弃神经元

对于model.eval()方法，dropout失效

方法部分源码

 def train(self: T, mode: bool = True) -> T:
     r"""Sets the module in training mode.
     This has any effect only on certain modules. See documentations of
     particular modules for details of their behaviors in training/evaluation
     mode, if they are affected, e.g. :class:`Dropout`, :class:`BatchNorm`,
     etc.
     Args:
         mode (bool): whether to set training mode (``True``) or evaluation
                      mode (``False``). Default: ``True``.
     Returns:
         Module: self
     """
     if not isinstance(mode, bool):
         raise ValueError("training mode is expected to be boolean")
     self.training = mode
     for module in self.children():
         module.train(mode)
     return self
 def eval(self: T) -> T:
     r"""Sets the module in evaluation mode.
     This has any effect only on certain modules. See documentations of
     particular modules for details of their behaviors in training/evaluation
     mode, if they are affected, e.g. :class:`Dropout`, :class:`BatchNorm`,
     etc.
     This is equivalent with :meth:`self.train(False) <torch.nn.Module.train>`.
     See :ref:`locally-disable-grad-doc` for a comparison between
     `.eval()` and several similar mechanisms that may be confused with it.
     Returns:
         Module: self
     """
     return self.train(False)

其中修改的self.training在dropout里面有用到，用于控制是否进行Dropout：

class Dropout(_DropoutNd):
 r"""During training, randomly zeroes some of the elements of the input
 tensor with probability :attr:`p` using samples from a Bernoulli
 distribution. Each channel will be zeroed out independently on every forward
 call.
 This has proven to be an effective technique for regularization and
 preventing the co-adaptation of neurons as described in the paper
 `Improving neural networks by preventing co-adaptation of feature
 detectors`_ .
 Furthermore, the outputs are scaled by a factor of :math:`\frac{1}{1-p}` during
 training. This means that during evaluation the module simply computes an
 identity function.
 Args:
     p: probability of an element to be zeroed. Default: 0.5
     inplace: If set to ``True``, will do this operation in-place. Default: ``False``
 Shape:
     - Input: :math:`(*)`. Input can be of any shape
     - Output: :math:`(*)`. Output is of the same shape as input
 Examples::
     >>> m = nn.Dropout(p=0.2)
     >>> input = torch.randn(20, 16)
     >>> output = m(input)
 .. _Improving neural networks by preventing co-adaptation of feature
     detectors: https://arxiv.org/abs/1207.0580
 """
 def forward(self, input: Tensor) -> Tensor:
     return F.dropout(input, self.p, self.training, self.inplace)

Dropout介绍

dropout方法参数

torch.nn.functional.dropout

torch.nn.functional.dropout(input, p=0.5, training=False, inplace=False)

input：操作数据
p：每个神经元被丢弃的概率（注意，不会是简单的概率丢弃操作）
training：是否进行dropout操作，为True则进行Dropout操作，False则不进行
inplace：是否直接使用前面得到的一些参数进行运算，而不开辟新的空间

torch.nn.Dropout

torch.nn.Dropout(input)

只需要指定input即可，其与参数直接使用self指定的参数

该方法通常搭配model.train()和model.eval()使用，通过这两个方法设置self.training可以指定训练使用dropout，测试不使用dropout

torch.nn.Dropout2d

torch.nn.Dropout2d(input)

针对彩色图像进行Dropout：Input: (N, C, H, W) 对应：（batch N，通道 C，高度 H，宽 W）

torch.nn.Dropout3d

torch.nn.Dropout3d(input)

针对彩色点云数据进行Dropout：Input: (N, C, D, H, W) 对应：（batch N，通道 C，深度 D，高度 H，宽 W）

dropout原理

总结

Vote作用

对于全连接神经网络而言，我们用相同的数据去训练5个不同的神经网络可能会得到多个不同的结果，我们可以通过一种vote机制来决定多票者胜出，因此相对而言提升了网络的精度与鲁棒性。同理，对于单个神经网络而言，如果我们将其进行分批，虽然不同的网络可能会产生不同程度的过拟合，但是将其公用一个损失函数，相当于对其同时进行了优化，取了平均，因此可以较为有效地防止过拟合的发生

减少神经元共适应性

减少神经元之间复杂的共适应性。当隐藏层神经元被随机删除之后，使得全连接网络具有了一定的稀疏化，从而有效地减轻了不同特征的协同效应。也就是说，有些特征可能会依赖于固定关系的隐含节点的共同作用，而通过Dropout的话，它强迫一个神经单元，和随机挑选出来的其他神经单元共同工作，达到好的效果。消除减弱了神经元节点间的联合适应性，增强了泛化能力

分析

首先，可以把dropout操作理解为模型平均化；我们每次进行dropout后，网络模型都可以看成是整个网络的自网络；（需要注意的是，如果采用dropout，训练时长会大大延长）

dropout会从原始的网络中找到一个更瘦的网络，如下右图所示：
model.train()和model.eval()介绍 & Dropout - 图1

之前网络的计算公式为：
model.train()和model.eval()介绍 & Dropout - 图2
加入dropout后计算公式变为如下，其中Bernoulli函数表示根据概率p随机生成0或1：

同时需要注意一点：在进行p概率屏蔽神经元后，还需要对余下的神经元进行rescale操作，即乘以1/p；如果在训练的时候没有对神经元进行rescale操作，那么就需要在测试的时候对权重进行rescale操作，即：
model.train()和model.eval()介绍 & Dropout - 图4
因为训练时，每个神经单源都会以概率p去除，而在测试阶段，每个神经元都是存在，所以权重参数w需要乘以p，成为 pw