前言

AlexNet是由Alex Krizhevsky等人于2012年提出的一个开创性的卷积神经网络。本文参考Alex的两篇论文ImageNet Classification with Deep Convolutional以及One weird trick for parallelizing convolutional neural networks，详细介绍了模型了复现步骤和其PyTorch的实现代码。这里前者我将其称为AlextNetV1, 后者称为AlexNetV2。

完整代码

完整代码，训练日志和模型文件请参考：
https://github.com/ethanyanjiali/deep-vision/tree/master/CNNs/imagenet-2012/pytorch#alexnet

网络结构

AlexNetV1

由于当年GPU资源不足，V1采取了双塔的方案。具体双塔的意思是将一张图像分割为两部分，并使用两组GPU分别对其不同部分提取特征，然后再将结果整合起来。不过由于目前单个GPU的容量完全足以支撑起整个模型，本文采取了单塔方案进行复现，并因此将参数提高了一倍。

AlexNet1的结构可谓是教科书一般的卷积深度神经网络：卷积层，激活层，正则化层加上池化层四层一组构成一个特征提取单元，多个特征提取单元的串联构成一个特征提取器，或者叫做编码器(Encoder)。该特征提取器的目标就是将原始的RGB数据提取成一个特征向量。在特征提取器之后，由多层全连接层(Fully Conntected Layer, Dense Layer)配合随机选择层(Dropout)又构成一个分类单元，多个分类单元又可以组成一个分类器。分类器的作用是将特征向量最终映射到目标训练值上。这奠定了之后很多神经网络的基础结构。虽然不同的网络会对某一层稍有不同的改动，比如改用Batch Normalization或者Average Pooling，但是万变不离其宗。下面贴出完整的网络结构代码，然后我再对每一部分详细解释：

# coding: utf-8
import torch
import torch.nn as nn
# [1] ImageNet Classification with Deep Convolutional Neural Networks
# https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
# [2] http://cs231n.github.io/convolutional-networks
# [3] https://prateekvjoshi.com/2016/04/05/what-is-local-response-normalization-in-convolutional-neural-networks/
class AlexNetV1(nn.Module):
    def __init__(self):
        super(AlexNetV1, self).__init__()
        # formula
        # [conv layer]
        # output_size = (input_size - kernel_size + 2 * padding) / stride + 1
        # padding = ((output_size - 1) * stride) + kernel_size - input_size) / 2
        # [pooling layer]
        # output_size = (input_size - kernel_size) / stride + 1
        # where input_size and output_size are the square image side length
        self.features = nn.Sequential(
            # "The first convolutional layer filters the 224×224×3 input image with
            # 96 kernels of size 11×11×3 with a stride of 4 pixels."[1]
            # Also from [1]Fig.2, next layer is 55x55x48, output channels is 48. (I use 96=2x48 here)
            # hence padding = ((55 - 1) * 4 + 11 - 224) / 2 = 2
            # to verify, output = (224 - 11 + 2 * 2) / 4 + 1 = 55
            nn.Conv2d(3, 96, 11, stride=4, padding=2),
            # The ReLU non-linearity is applied to the output of every convolutional
            # and fully-connected layer.
            nn.ReLU(inplace=True),
            # "The second convolutional layer takes as input the (response-normalized
            # and pooled) output of the first convolutional layer"
            nn.LocalResponseNorm(96),
            # From Fig.2 in [1], there's a maxpooling layer after first conv layer
            # Also from Fig.2 in [1], the pooling reduces dimension from 55x55 to 27x27
            # hence it's likely that they uses overlapping pooling kernel=3, stride=2
            # to verify, output_size = (55 - 3) / 2 + 1 = 27
            nn.MaxPool2d(3, 2),
            # "The second convolutional layer takes ... with 256 kernels of size 5 × 5 × 48."[1]
            # From Fig.2 in [1], output channels is 128. (I use 256=2x128 here)
            # To keep dimension same as 27, we can infer that stride = 2, padding = 1
            # output_size = (27 - 5 + 2 * 2) / 1 + 1 = 27
            nn.Conv2d(96, 256, 5, stride=1, padding=2),
            nn.ReLU(inplace=True),
            # "The third convolutional layer has 384 kernels of size 3 × 3 ×
            # 256 connected to the (normalized, pooled) outputs of the second convolutional layer"[1]
            # Since the output of second layer is 256, the normalized layer input should be 256 here as well
            nn.LocalResponseNorm(256),
            # From Fig.2 in [1], there's a maxpooling layer after second conv layer
            # Also from Fig.2 in [1], the pooling reduces dimension from 27x27 to 13x13
            # similar to last one, output_size = (27 - 3) / 2 + 1 = 13
            nn.MaxPool2d(3, 2),
            # "The third, fourth, and fifth convolutional layers are connected to one another
            # without any intervening pooling or normalization layers"[1]
            # Also from Fig.2 in [1], next layer is 13x13x192, and it uses a kernel size of 3.
            # (I use 384=2x192 here)
            # to keep dimension same as 13, we can infer that stride = 1, padding = 1
            # output_size = (13 - 3 + 2 * 1) / 1 + 1 = 13
            nn.Conv2d(256, 384, 3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            # same as last conv layer
            # output_size = (13 - 3 + 2 * 1) / 1 + 1 = 13
            # (I use 384=2x192 here)
            nn.Conv2d(384, 384, 3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            # From Fig.2 in [1], the output channels drop to 128
            # (I use 256=2x128 here)
            # output_size = (13 - 3 + 2 * 1) / 1 + 1 = 13
            nn.Conv2d(384, 256, 3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            # there's another pooling layer after 5th conv layer from Fig.2 in [1]
            # output_size = (13 - 3) / 2 + 1 = 6
            nn.MaxPool2d(3, 2),
        )
        self.classifier = nn.Sequential(
            # "We use dropout in the first two fully-connected layers of Figure 2.
            # Without dropout, our network exhibits substantial overfitting.
            # Dropout roughly doubles the number of iterations required to converge."[1]
            # "...consists of setting to zero the output of each hidden neuron with probability 0.5"[1]
            nn.Dropout(p=0.5),
            # From Fig.2 in [1], the frist FC layer has 4096 (2x2048) activations
            nn.Linear(6 * 6 * 256, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            # From Fig.2 in [1], the second FC layer also has 4096 activations
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            # "The output of the last fully-connected layer is fed to a 1000-way softmax which produces
            # a distribution over the 1000 class labels."[1]
            nn.Linear(4096, 1000),
            # There's no softmax here because we use CrossEntropyLoss which already includes Softmax
            # https://discuss.pytorch.org/t/vgg-output-layer-no-softmax/9273/5
        )
    def forward(self, x):
        x = self.features(x)
        # flatten the output from conv layers, but keep batch size
        x = x.view(x.size(0), 6 * 6 * 256)
        x = self.classifier(x)
        return x

在init函数中，我们定好需要使用的网络模块，然后再在forward函数中定义他们的调用顺序。PyTorch通过python的执行顺序来动态定义图，相比TensorFlow简单了许多。

self.features = nn.Sequential(
    nn.Conv2d(3, 96, 11, stride=4, padding=2),
    nn.ReLU(inplace=True),
    nn.LocalResponseNorm(96),
    nn.MaxPool2d(3, 2),
    nn.Conv2d(96, 256, 5, stride=1, padding=2),
    nn.ReLU(inplace=True),
    nn.LocalResponseNorm(256),
    nn.MaxPool2d(3, 2),
    nn.Conv2d(256, 384, 3, stride=1, padding=1),
    nn.ReLU(inplace=True),
    nn.Conv2d(384, 384, 3, stride=1, padding=1),
    nn.ReLU(inplace=True),
    nn.Conv2d(384, 256, 3, stride=1, padding=1),
    nn.ReLU(inplace=True),
    nn.MaxPool2d(3, 2),
)

这里我们可以清晰的看见，整个网络有两大部分，features就是特征提取器，classifier就是分类器。这里参数的定义都是按照文中所给参数x2 (由于双塔变单塔)。如上，在features部分，参考论文中图2的展示，每一个卷积层后都使用ReLu非线性函数。按照作者的意思，前两个卷积层后还附带LRN以提高网络性能。该网络先由一个11×11的卷积层开始，接着一个池化层降低维数，再用一个5×5的卷积层和一个池化层。之后，接上三个过滤器为3×3的小卷积层，并再次池化，特征提取部分便完成了。这里基本的思路是底层卷积层用大卷积核提取更广阔的范围，高层卷积层用小卷积核来保证细节的提取。然而在后世的网络中我们发现，小卷积核其实更有效率。AlexNet相比之后的各类神经网络全是不能算是深度神经网络，但启发了之后VGG和GoogLeNet团队继续加深网络的想法。
特征提取完毕后，我们使用提取的特征输入全连接层进行分类

self.classifier = nn.Sequential(
    nn.Dropout(p=0.5),
    nn.Linear(6 * 6 * 256, 4096),
    nn.ReLU(inplace=True),
    nn.Dropout(p=0.5),
    nn.Linear(4096, 4096),
    nn.ReLU(inplace=True),
    nn.Linear(4096, 1000),
)

这里作者在每个线性层之前都加入了Dropout层，通过随机关闭一些Activation通道，来达到regularization降低过度拟合的风险。第一个线性层的输入是上面我们特征提取的输出，而之后都是4096。最后一个线性层输出1000个分类。这里并没有加入Softmax，因为之后我们的Optimizer将添加这一部分。
前向传播部分就非常简单了，将我们之前定义好的特征提取和分类器连接起来就好了，只是中间需要将特征提取的输出展开铺平，方便线性层使用。

def forward(self, x):
    x = self.features(x)
    x = x.view(x.size(0), 6 * 6 * 256)
    x = self.classifier(x)
    return x

AlexNetV2

在V1出世不久，Alex又提出了一个改进版网络，改进版同样取消了双塔的操作，合并到单塔，同时调整了一些参数以提高单GPU下网络的性能。这里我将其叫做AlexNetV2.
首先贴出完整网络结构代码：

# coding: utf-8
import torch
import torch.nn as nn
import torch.nn.functional as Func
# can use the below import should you choose to initialize the weights of your Net
import torch.nn.init as Init
# [1] One weird trick for parallelizing convolutional neural networks https://arxiv.org/pdf/1404.5997.pdf
class AlexNetV2(nn.Module):
    '''
    This implements the network from the second version of AlexNet
    '''
    def __init__(self):
        super(AlexNetV2, self).__init__()
        # "In detail, the single-column model has 64, 192, 384, 384, 256 filters
        # in the five convolutional layers, respectivel"[1]
        # "It has the same number of layers as the two-tower model, and the
        # (x, y) map dimensions in each layer are equivalent to
        # the (x, y) map dimensions in the two-tower model.
        # The minor difference in parameters and connections
        # arises from a necessary adjustment in the number of
        # kernels in the convolutional layers, due to the unrestricted
        # layer-to-layer connectivity in the single-tower model."[1]
        # According to the above, I just need to change the # of output channels
        # Please refer to ./alexnet1 for detailed calculation
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, 11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            # Later in the VGG paper, it demonstrated that LRN is not necessary
            # Hence most of AlexNet implementation doesn't include LRN
            # However, for study purpose, I still added this layer
            nn.LocalResponseNorm(64),
            nn.MaxPool2d(3, 2),
            nn.Conv2d(64, 192, 5, stride=1, padding=2),
            nn.ReLU(inplace=True),
            nn.LocalResponseNorm(192),
            nn.MaxPool2d(3, 2),
            nn.Conv2d(192, 384, 3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 384, 3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, 3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(3, 2),
        )
        self.classifier = nn.Sequential(
            # This part is same with ./alexnet 1 as mentioned above
            nn.Dropout(p=0.5),
            nn.Linear(6 * 6 * 256, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, 1000),
            # "Another difference is that instead of a softmax
            # final layer with multinomial logistic regression
            # cost, this model’s final layer has 1000 independent logistic
            # units, trained to minimize cross-entropy"[1]
        )
    def forward(self, x):
        x = self.features(x)
        # flatten the output from conv layers, but keep b∏atch size
        x = x.view(x.size(0), 6 * 6 * 256)
        x = self.classifier(x)
        return x

可以看出，基本的网络结构是和V1非常相似的，同样是5个卷积层和相应的一些LRN及Maxpooling池化层。在权重初始化部分也同样有相同的限制。

训练代码

这里贴出训练代码中关键部分，完整训练代码请参考文首的Github Repo中train.py。
https://github.com/ethanyanjiali/deep-vision/tree/master/CNNs/imagenet-2012/pytorch#alexnet
https://github.com/ethanyanjiali/deep-vision/tree/master/CNNs/imagenet-2012/pytorch/train.py#L26

transform = transforms.Compose([
    # "Therefore, we down-sampled the images to a fixed resolution of 256 × 256" alexnet1.[1]
    Rescale(256),
    RandomHorizontalFlip(0.5),
    RandomCrop(224),
    ToTensor(),
])
batch_size = 128
# instantiate the neural network
net = AlexNetV1()
# define the loss function using CrossEntropyLoss
criterion = nn.CrossEntropyLoss()
# define the params updating function using SGD
optimizer = optim.SGD(
    net.parameters(),
    lr=0.01,
    momentum=0.9,
    weight_decay=0.0005,
)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    factor=0.1,
    mode="max",
)

在数据预处理transform部分，我首先将图片缩放到256的正方形，然后做了一些数据提升(Data augmentation)的处理，比如将图片随机水平方向翻转，并且在256的区域中随机取得大小为224×224的子区域用作训练。最后一步是将numpy数据转成PyTorch所要求的Tensor数据结构。数据提升看似简单，但实际训练中对训练效果有这至关重要的影响。
训练时，采用了mini-batch方式。batch大小我选了128，如果你不是使用16G GPU的话，需要加大或者减小这个数字，防止内存不够的情况。损失函数就是PyTorch自带的CrossEntropy，注意该函数已经包含了Softmax，所以不要在网络中重复添加，否则无法计算正确的loss。优化器部分就是经典的stochastic gradient descent，这里参数的定义也是按照论文中所述。学习率调整器(learing rate scheduler)部分是我后面加入的，一开始我是手动调整的，之后我使用了平原下降策略(ReduceLROnPlateau)，在top1准确率10个epoch还未提升的情况下降学习率降为10%。

结论

AlexNet虽然在性能和体积上都不占优势，在实际生产环境中也很少使用，但其奠定了很多深度神经网络的经典结构，通过对AlexNet的学习，我们能够一瞥CNN历史的进程，对之后学习其他网络打下良好的基础。AlexNet中很多参数的选择也都是经验得来，并无太多理论支持，但是往往一个参数的不同就会导致结果大相径庭。比如网络结构中卷积核的大小，或者训练时学习率，这也体现了深度神经网络调参的重要性。

人工智能文档池

PyTorch复现AlexNet神经网络详细实现教程

前言

完整代码

网络结构

AlexNetV1

AlexNetV2

训练代码

结论