VGG块

VGG块在连续使用数个多个填充为1，窗口形状为 VGG - 图1 的卷积层后接上一个步幅，形状为2的池化层。输入在经过每个卷积层时形状都不变，因为之前提到过，要使输出形状不变的话填充行数应为 VGG - 图2 。所以形状为3的卷积核填充行数为2，那么也就是上下各增添一行，填充为1。

对于给定的感受野（与输出有关的输入图片的局部大小），采用堆积的小卷积核优于采用大的卷积核，因为可以增加网络深度来保证学习更复杂的模式，而且代价还比较小（参数更少）。例如，在VGG中，使用了3个3x3卷积核来代替7x7卷积核，使用了2个3x3卷积核来代替5*5卷积核，这样做的主要目的是在保证具有相同感知野的条件下，提升了网络的深度，在一定程度上提升了神经网络的效果。

import torch
from torch import nn
def vgg_block(num_convs, in_channels, out_channels):
    blk = []
    for i in range(num_convs):
        if i == 0:
            blk.append(nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1))
        else:
            blk.append(nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1))
        blk.append(nn.ReLU())
    blk.append(nn.MaxPool2d(kernel_size=2))
    return nn.Sequential(*blk)
if __name__ == '__main__':
    print(vgg_block(5, 3, 5))
结果：
Sequential(
  (0): Conv2d(3, 5, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): Conv2d(5, 5, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (3): Conv2d(5, 5, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (4): Conv2d(5, 5, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (5): Conv2d(5, 5, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)

VGG网络

与AlexNet和LeNet一样，VGG网络由卷积层模块后接全连接层模块构成。卷积层模块串联数个vgg_block，其超参数由变量conv_arch定义。该变量指定了每个VGG块里卷积层个数和输入输出通道数。全连接模块则跟AlexNet中的一样。

def vgg(conv_arch, fc_features, fc_hidden_units):
    net = []
    for _, (num_convs, in_channels, out_channels) in enumerate(conv_arch):
        net.append(vgg_block(num_convs, in_channels, out_channels))
    net.append(nn.Sequential(FlattenLayer(),
                             nn.Linear(fc_features, fc_hidden_units),
                             nn.ReLU(),
                             nn.Dropout(0.5),
                             nn.Linear(fc_hidden_units, fc_hidden_units),
                             nn.ReLU(),
                             nn.Dropout(0.5),
                             nn.Linear(fc_hidden_units, 10)))
    return nn.Sequential(*net)
if __name__ == '__main__':
    conv_arch = ((1, 1, 64), (1, 64, 128), (2, 128, 256), (2, 256, 512), (2, 512, 512))
    # 经过5个vgg_block, 宽高会减半5次, 变成 224/32 = 7
    fc_features = 512 * 7 * 7  # c * w * h
    fc_hidden_units = 4096  # 任意
    net = vgg(conv_arch, fc_features, fc_hidden_units)
    print(net)
    X = torch.rand(1, 1, 224, 224)
    print(net(X))
结果：
Sequential(
  (0): Sequential(
    (0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (1): Sequential(
    (0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (2): Sequential(
    (0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU()
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (3): Sequential(
    (0): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU()
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (4): Sequential(
    (0): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU()
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (5): Sequential(
    (0): FlattenLayer()
    (1): Linear(in_features=25088, out_features=4096, bias=True)
    (2): ReLU()
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=4096, out_features=4096, bias=True)
    (5): ReLU()
    (6): Dropout(p=0.5, inplace=False)
    (7): Linear(in_features=4096, out_features=10, bias=True)
  )
)
tensor([[ 0.0160, -0.0046,  0.0058, -0.0016,  0.0075, -0.0157, -0.0055, -0.0055,
          0.0034,  0.0153]], grad_fn=<AddmmBackward>)

VGG这种高和宽减半以及通道翻倍的设计使得多数卷积层都有相同的模型参数尺寸和计算复杂度。

训练模型

由于MNIST过于基础，可以适当减小网络规模:

ratio = 8
conv_arch = ((1, 1, 64//ratio), (1, 64//ratio, 128//ratio), (2, 128//ratio, 256//ratio), (2, 256//ratio, 512//ratio), (2, 512//ratio, 512//ratio))
fc_features = fc_features//ratio
fc_hidden_units = fc_hidden_units//ratio
net = vgg(conv_arch, fc_features, fc_hidden_units)
batch_size = 64
train_iter, test_iter = load_data_fashion_mnist(batch_size, resize=224)
lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
train_ch5(net, train_iter, test_iter, batch_size, optimizer, device="cuda", num_epochs=num_epochs)

训练效果：

training on  cuda
epoch 1, loss 0.6016, train acc 0.773, test acc 0.871, time 67.3 sec
epoch 2, loss 0.3298, train acc 0.880, test acc 0.894, time 65.5 sec
epoch 3, loss 0.2827, train acc 0.896, test acc 0.905, time 66.3 sec
epoch 4, loss 0.2485, train acc 0.909, test acc 0.913, time 66.0 sec
epoch 5, loss 0.2289, train acc 0.916, test acc 0.914, time 64.8 sec