8.5.1 进行模型训练的步骤:

  • 第一步: 构建数据加载器函数.
  • 第二步: 构建模型训练函数.
  • 第三步: 构建模型验证函数.
  • 第四步: 调用训练和验证函数并打印日志.
  • 第五步: 绘制训练和验证的损失和准确率对照曲线.
  • 第六步: 模型保存.

8.5.2 第一步: 构建数据加载器函数

  1. import pandas as pd
  2. from sklearn.utils import shuffle
  3. from functools import reduce
  4. from collections import Counter
  5. from bert_chinese_encode import get_bert_encode
  6. import torch
  7. import torch.nn as nn
  8. def data_loader(data_path, batch_size, split=0.2):
  9. """
  10. description: 从持久化文件中加载数据, 并划分训练集和验证集及其批次大小
  11. :param data_path: 训练数据的持久化路径
  12. :param batch_size: 训练和验证数据集的批次大小
  13. :param split: 训练集与验证的划分比例
  14. :return: 训练数据生成器, 验证数据生成器, 训练数据数量, 验证数据数量
  15. """
  16. # 使用pd进行csv数据的读取
  17. data = pd.read_csv(data_path, header=None, sep="\t")
  18. # 打印整体数据集上的正负样本数量
  19. print("数据集的正负样本数量:")
  20. print(dict(Counter(data[0].values)))
  21. # 打乱数据集的顺序
  22. data = shuffle(data).reset_index(drop=True)
  23. # 划分训练集和验证集
  24. split_point = int(len(data)*split)
  25. valid_data = data[:split_point]
  26. train_data = data[split_point:]
  27. # 验证数据集中的数据总数至少能够满足一个批次
  28. if len(valid_data) < batch_size:
  29. raise("Batch size or split not match!")
  30. def _loader_generator(data):
  31. """
  32. description: 获得训练集/验证集的每个批次数据的生成器
  33. :param data: 训练数据或验证数据
  34. :return: 一个批次的训练数据或验证数据的生成器
  35. """
  36. # 以每个批次的间隔遍历数据集
  37. for batch in range(0, len(data), batch_size):
  38. # 预定于batch数据的张量列表
  39. batch_encoded = []
  40. batch_labels = []
  41. # 将一个bitch_size大小的数据转换成列表形式,[[label, text_1, text_2]]
  42. # 并进行逐条遍历
  43. for item in data[batch: batch+batch_size].values.tolist():
  44. # 每条数据中都包含两句话, 使用bert中文模型进行编码
  45. encoded = get_bert_encode(item[1], item[2]) # 就是迁移模型的过程
  46. # 将编码后的每条数据装进预先定义好的列表中
  47. batch_encoded.append(encoded)
  48. # 同样将对应的该batch的标签装进labels列表中
  49. batch_labels.append([item[0]])
  50. # 使用reduce高阶函数将列表中的数据转换成模型需要的张量形式
  51. # encoded的形状是(batch_size, 2*max_len, embedding_size)
  52. encoded = reduce(lambda x, y : torch.cat((x, y), dim=0), batch_encoded)
  53. labels = torch.tensor(reduce(lambda x, y : x + y, batch_labels))
  54. # 以生成器的方式返回数据和标签
  55. yield (encoded, labels)
  56. # 对训练集和验证集分别使用_loader_generator函数, 返回对应的生成器(官方建议)
  57. # 最后还要返回训练集和验证集的样本数量
  58. return _loader_generator(train_data), _loader_generator(valid_data), len(train_data), len(valid_data)

  • 代码位置: /data/doctor_online/bert_serve/train.py

  • 输入参数:
  1. # 数据所在路径
  2. data_path = "./train_data.csv"
  3. # 定义batch_size大小
  4. batch_size = 32

  • 调用:
  1. train_data_labels, valid_data_labels, \
  2. train_data_len, valid_data_len = data_loader(data_path, batch_size)
  3. print(next(train_data_labels))
  4. print(next(valid_data_labels))
  5. print("train_data_len:", train_data_len)
  6. print("valid_data_len:", valid_data_len)

  • 输出效果:
  1. (tensor([[[-0.7295, 0.8199, 0.8320, ..., 0.0933, 1.2171, 0.4833],
  2. [ 0.8707, 1.0131, -0.2556, ..., 0.2179, -1.0671, 0.1946],
  3. [ 0.0344, -0.5605, -0.5658, ..., 1.0855, -0.9122, 0.0222]]], tensor([0, 0, 1, 1, 1, 1, 0, 1, 0, 0, ..., 1, 0, 1, 0, 1, 1, 1, 1]))
  4. (tensor([[[-0.5263, -0.3897, -0.5725, ..., 0.5523, -0.2289, -0.8796],
  5. [ 0.0468, -0.5291, -0.0247, ..., 0.4221, -0.2501, -0.0796],
  6. [-0.2133, -0.5552, -0.0584, ..., -0.8031, 0.1753, -0.3476]]]), tensor([0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, ..., 0, 0, 1, 0, 1,1, 1]))
  7. train_data_len: 22186
  8. valid_data_len: 5546

8.5.3 第二步: 构建模型训练函数

  1. # 加载微调网络
  2. from finetuning_net import Net
  3. import torch.optim as optim
  4. # 定义embedding_size, char_size
  5. embedding_size = 768
  6. char_size = 2 * max_len
  7. # 实例化微调网络
  8. net = Net(embedding_size, char_size)
  9. # 定义交叉熵损失函数
  10. criterion = nn.CrossEntropyLoss()
  11. # 定义SGD优化方法
  12. optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
  13. def train(train_data_labels):
  14. """
  15. description: 训练函数, 在这个过程中将更新模型参数, 并收集准确率和损失
  16. :param train_data_labels: 训练数据和标签的生成器对象
  17. :return: 整个训练过程的平均损失之和以及正确标签的累加数
  18. """
  19. # 定义训练过程的初始损失和准确率累加数
  20. train_running_loss = 0.0
  21. train_running_acc = 0.0
  22. # 循环遍历训练数据和标签生成器, 每个批次更新一次模型参数
  23. for train_tensor, train_labels in train_data_labels:
  24. # 初始化该批次的优化器
  25. optimizer.zero_grad()
  26. # 使用微调网络获得输出
  27. train_outputs = net(train_tensor)
  28. # 得到该批次下的平均损失
  29. train_loss = criterion(train_outputs, train_labels)
  30. # 将该批次的平均损失加到train_running_loss中
  31. train_running_loss += train_loss.item()
  32. # 损失反向传播
  33. train_loss.backward()
  34. # 优化器更新模型参数
  35. optimizer.step()
  36. # 将该批次中正确的标签数量进行累加, 以便之后计算准确率
  37. train_running_acc += (train_outputs.argmax(1) == train_labels).sum().item()
  38. return train_running_loss, train_running_acc

  • 代码位置: /data/doctor_online/bert_serve/train.py

8.5.4 第三步: 模型验证函数

  1. def valid(valid_data_labels):
  2. """
  3. description: 验证函数, 在这个过程中将验证模型的在新数据集上的标签, 收集损失和准确率
  4. :param valid_data_labels: 验证数据和标签的生成器对象
  5. :return: 整个验证过程的平均损失之和以及正确标签的累加数
  6. """
  7. # 定义训练过程的初始损失和准确率累加数
  8. valid_running_loss = 0.0
  9. valid_running_acc = 0.0
  10. # 循环遍历验证数据和标签生成器
  11. for valid_tensor, valid_labels in valid_data_labels:
  12. # 不自动更新梯度
  13. with torch.no_grad():
  14. # 使用微调网络获得输出
  15. valid_outputs = net(valid_tensor)
  16. # 得到该批次下的平均损失
  17. valid_loss = criterion(valid_outputs, valid_labels)
  18. # 将该批次的平均损失加到valid_running_loss中
  19. valid_running_loss += valid_loss.item()
  20. # 将该批次中正确的标签数量进行累加, 以便之后计算准确率
  21. valid_running_acc += (valid_outputs.argmax(1) == valid_labels).sum().item()
  22. return valid_running_loss, valid_running_acc

  • 代码位置: /data/doctor_online/bert_serve/train.py

8.5.5 第四步: 调用训练和验证函数并打印日志

  1. # 定义训练轮数
  2. epochs = 20
  3. # 定义盛装每轮次的损失和准确率列表, 用于制图
  4. all_train_losses = []
  5. all_valid_losses = []
  6. all_train_acc = []
  7. all_valid_acc = []
  8. # 进行指定轮次的训练
  9. for epoch in range(epochs):
  10. # 打印轮次
  11. print("Epoch:", epoch + 1)
  12. # 通过数据加载器获得训练数据和验证数据生成器, 以及对应的样本数量
  13. train_data_labels, valid_data_labels, train_data_len, valid_data_len = data_loader(data_path, batch_size)
  14. # 调用训练函数进行训练
  15. train_running_loss, train_running_acc = train(train_data_labels)
  16. # 调用验证函数进行验证
  17. valid_running_loss, valid_running_acc = valid(valid_data_labels)
  18. # 计算每一轮的平均损失, train_running_loss和valid_running_loss是每个批次的平均损失之和
  19. # 因此将它们乘以batch_size就得到了该轮的总损失, 除以样本数即该轮次的平均损失
  20. train_average_loss = train_running_loss * batch_size / train_data_len
  21. valid_average_loss = valid_running_loss * batch_size / valid_data_len
  22. # train_running_acc和valid_running_acc是每个批次的正确标签累加和,
  23. # 因此只需除以对应样本总数即是该轮次的准确率
  24. train_average_acc = train_running_acc / train_data_len
  25. valid_average_acc = valid_running_acc / valid_data_len
  26. # 将该轮次的损失和准确率装进全局损失和准确率列表中, 以便制图
  27. all_train_losses.append(train_average_loss)
  28. all_valid_losses.append(valid_average_loss)
  29. all_train_acc.append(train_average_acc)
  30. all_valid_acc.append(valid_average_acc)
  31. # 打印该轮次下的训练损失和准确率以及验证损失和准确率
  32. print("Train Loss:", train_average_loss, "|", "Train Acc:", train_average_acc)
  33. print("Valid Loss:", valid_average_loss, "|", "Valid Acc:", valid_average_acc)
  34. print('Finished Training')

  • 代码位置: /data/doctor_online/bert_serve/train.py

  • 输出效果:
  1. Epoch: 1
  2. Train Loss: 0.693169563147374 | Train Acc: 0.5084898843930635
  3. Valid Loss: 0.6931480603018824 | Valid Acc: 0.5042777377521613
  4. {1: 14015, 0: 13720}
  5. Epoch: 2
  6. Train Loss: 0.6931440165277162 | Train Acc: 0.514992774566474
  7. Valid Loss: 0.6931474804019379 | Valid Acc: 0.5026567002881844
  8. {1: 14015, 0: 13720}
  9. Epoch: 3
  10. Train Loss: 0.6931516138804441 | Train Acc: 0.5
  11. Valid Loss: 0.69314516217633 | Valid Acc: 0.5065291786743515
  12. {1: 14015, 0: 13720}
  13. Epoch: 4
  14. Train Loss: 0.6931474804878235 | Train Acc: 0.5065028901734104
  15. Valid Loss: 0.6931472256650842 | Valid Acc: 0.5052233429394812
  16. {1: 14015, 0: 13720}
  17. Epoch: 5
  18. Train Loss: 0.6931474804878235 | Train Acc: 0.5034320809248555
  19. Valid Loss: 0.6931475739314165 | Valid Acc: 0.5055385446685879
  20. {1: 14015, 0: 13720}
  21. Epoch: 6
  22. Train Loss: 0.6931492934337241 | Train Acc: 0.5126445086705202
  23. Valid Loss: 0.6931462547277512 | Valid Acc: 0.5033771613832853
  24. {1: 14015, 0: 13720}
  25. Epoch: 7
  26. Train Loss: 0.6931459204309938 | Train Acc: 0.5095736994219653
  27. Valid Loss: 0.6931174922229921 | Valid Acc: 0.5065742074927954
  28. {1: 14015, 0: 13720}
  29. Epoch: 8
  30. Train Loss: 0.5545259035391614 | Train Acc: 0.759393063583815
  31. Valid Loss: 0.4199462383770805 | Valid Acc: 0.9335374639769453
  32. {1: 14015, 0: 13720}
  33. Epoch: 9
  34. Train Loss: 0.4011955714294676 | Train Acc: 0.953757225433526
  35. Valid Loss: 0.3964169790877045 | Valid Acc: 0.9521793948126801
  36. {1: 14015, 0: 13720}
  37. Epoch: 10
  38. Train Loss: 0.3893018603497158 | Train Acc: 0.9669436416184971
  39. Valid Loss: 0.3928600374491139 | Valid Acc: 0.9525846541786743
  40. {1: 14015, 0: 13720}
  41. Epoch: 11
  42. Train Loss: 0.3857506763383832 | Train Acc: 0.9741690751445087
  43. Valid Loss: 0.38195425426582097 | Valid Acc: 0.9775306195965417
  44. {1: 14015, 0: 13720}
  45. Epoch: 12
  46. Train Loss: 0.38368317760484066 | Train Acc: 0.9772398843930635
  47. Valid Loss: 0.37680484129046155 | Valid Acc: 0.9780259365994236
  48. {1: 14015, 0: 13720}
  49. Epoch: 13
  50. Train Loss: 0.37407022137517876 | Train Acc: 0.9783236994219653
  51. Valid Loss: 0.3750278927192564 | Valid Acc: 0.9792867435158501
  52. {1: 14015, 0: 13720}
  53. Epoch: 14
  54. Train Loss: 0.3707401707682306 | Train Acc: 0.9801300578034682
  55. Valid Loss: 0.37273150721097886 | Valid Acc: 0.9831592219020173
  56. {1: 14015, 0: 13720}
  57. Epoch: 15
  58. Train Loss: 0.37279492521906177 | Train Acc: 0.9817557803468208
  59. Valid Loss: 0.3706809586123362 | Valid Acc: 0.9804574927953891
  60. {1: 14015, 0: 13720}
  61. Epoch: 16
  62. Train Loss: 0.37660940017314315 | Train Acc: 0.9841040462427746
  63. Valid Loss: 0.3688154769390392 | Valid Acc: 0.984600144092219
  64. {1: 14015, 0: 13720}
  65. Epoch: 17
  66. Train Loss: 0.3749892661681754 | Train Acc: 0.9841040462427746
  67. Valid Loss: 0.3688570175760074 | Valid Acc: 0.9817633285302594
  68. {1: 14015, 0: 13720}
  69. Epoch: 18
  70. Train Loss: 0.37156562515765945 | Train Acc: 0.9826589595375722
  71. Valid Loss: 0.36880484627028365 | Valid Acc: 0.9853656340057637
  72. {1: 14015, 0: 13720}
  73. Epoch: 19
  74. Train Loss: 0.3674713007976554 | Train Acc: 0.9830202312138728
  75. Valid Loss: 0.366314563545954 | Valid Acc: 0.9850954610951008
  76. {1: 14015, 0: 13720}
  77. Epoch: 20
  78. Train Loss: 0.36878046806837095 | Train Acc: 0.9842846820809249
  79. Valid Loss: 0.367835852100114 | Valid Acc: 0.9793317723342939
  80. Finished Training

8.5.6 第五步: 绘制训练和验证的损失和准确率对照曲线

  1. # 导入制图工具包
  2. import matplotlib.pyplot as plt
  3. from matplotlib.pyplot import MultipleLocator
  4. # 创建第一张画布
  5. plt.figure(0)
  6. # 绘制训练损失曲线
  7. plt.plot(all_train_losses, label="Train Loss")
  8. # 绘制验证损失曲线, 颜色为红色
  9. plt.plot(all_valid_losses, color="red", label="Valid Loss")
  10. # 定义横坐标刻度间隔对象, 间隔为1, 代表每一轮次
  11. x_major_locator=MultipleLocator(1)
  12. # 获得当前坐标图句柄
  13. ax=plt.gca()
  14. # 设置横坐标刻度间隔
  15. ax.xaxis.set_major_locator(x_major_locator)
  16. # 设置横坐标取值范围
  17. plt.xlim(1,epochs)
  18. # 曲线说明在左上方
  19. plt.legend(loc='upper left')
  20. # 保存图片
  21. plt.savefig("./loss.png")
  22. # 创建第二张画布
  23. plt.figure(1)
  24. # 绘制训练准确率曲线
  25. plt.plot(all_train_acc, label="Train Acc")
  26. # 绘制验证准确率曲线, 颜色为红色
  27. plt.plot(all_valid_acc, color="red", label="Valid Acc")
  28. # 定义横坐标刻度间隔对象, 间隔为1, 代表每一轮次
  29. x_major_locator=MultipleLocator(1)
  30. # 获得当前坐标图句柄
  31. ax=plt.gca()
  32. # 设置横坐标刻度间隔
  33. ax.xaxis.set_major_locator(x_major_locator)
  34. # 设置横坐标取值范围
  35. plt.xlim(1,epochs)
  36. # 曲线说明在左上方
  37. plt.legend(loc='upper left')
  38. # 保存图片
  39. plt.savefig("./acc.png")

  • 代码位置: /data/doctor_online/bert_serve/train.py

  • 训练和验证损失对照曲线:

8.5 进行模型训练 - 图1


  • 训练和验证准确率对照曲线:

8.5 进行模型训练 - 图2


  • 分析:
    • 根据损失对照曲线, 微调模型在第6轮左右开始掌握数据规律迅速下降, 说明模型能够从数据中获取语料特征,正在收敛. 根据准确率对照曲线中验证准确率在第10轮左右区域稳定,最终维持在98%左右.

8.5.7 第六步: 模型保存

  1. # 模型保存时间
  2. time_ = int(time.time())
  3. # 保存路径
  4. MODEL_PATH = './model/BERT_net_%d.pth' % time_
  5. # 保存模型参数
  6. torch.save(rnn.state_dict(), MODEL_PATH)

  • 代码位置: /data/doctor_online/bert_serve/train.py

  • 输出效果:
    • 在/data/bertserve/路径下生成BERT_net + 时间戳.pth的文件.

8.5.8 小节总结:

  • 学习了进行模型训练的步骤:
    • 第一步: 构建数据加载器函数.
    • 第二步: 构建模型训练函数.
    • 第三步: 构建模型验证函数.
    • 第四步: 调用训练和验证函数并打印日志.
    • 第五步: 绘制训练和验证的损失和准确率对照曲线.
    • 第六步: 模型保存.