Welcome to my knowledge base!
我是Armor,这里是《Armor的自然语言处理实战》博客,课程图、文、代码形式展示。本博客主要用于教学和搭建一个可复用的基于深度学习框架pytorch的文本分类实践代码(非BERT模型)以及完成模型训练后如何基于flask简单地部署到web应用中,做成自己的小项目。教程将按照模型构建的流程分为7个阶段:

教程模块 内容
01.概述 概念介绍、主要内容概览
02.Config.py 学习配置参数准备
03.DataSet.py 学习使用Torchtext
04.Model 学习模型创建
05.train_fine_tune.py 学习使用torch
06.Classify 学习预测
07.app.py 学习部署

新增内容
开源地址:https://github.com/Armorhtk/Text_Classification_Based_Torch_and_Simple_Deployment
开源地址下的内容有细微增改

新增内容
修改了少量代码内容,适配词向量加载,在Config里修改是否要用词向量,用那种词向量即可
词向量使用腾讯实验室的800w语料的腾讯词向量的少量vocab【只截取了部分10000-small、50000-small、500000-large三种w2v】
【加载词向量版】项目源码:链接:https://pan.baidu.com/s/1f4U-tsz4oJODxsk9kq_t9g?pwd=2021
image.png
使用预训练词向量后,F1值有明显增长
image.png

1 概述

1.1 torch版本问题

在开始教程前,先明检查一下torch版本的问题,如果已安装torch、torchtext并且gpu可以正常调用,则可以跳过本小节

一般在代码运行时,容易出现一下两种错误:

  1. 安装pytorch或torchtext时,无法找到对应版本
  2. cuda可以找到,但是无法转为.cuda()

    以上两种或类似错误,一般由两个原因可供分析:

  3. cuda版本不合适,重新安装cuda和cudnn

  4. pytorch和torchtext版本没对应上

1)先查看自己cuda版本
打开conda命令窗口或者cmd,输入nvcc --version
锁定最后一行,cuda为11.0版本
image.png
2)根据cuda查询对应的torch、torchtext版本
建议安装1.7.0及以上版本,以前的版本或多或少有bug,下图是python、pytorch、torchvison(torchtext版本和其一致)和cuda的版本对应关系:
笔者的python环境为3.7版本,cuda是11.0,我安装的是torch1.8.0,torchtext0.9.0。
3)下载和安装,不建议pip安装速度慢易出错,直接去网站下载whl文件,安装快不报错
下载地址 https://download.pytorch.org/whl/torch_stable.html
直接使用ctrl+F去检索对应版本和关键词,例如cu111/torch-1.8.0,出现多个结果,cpxx表示python环境,后面则是操作系统,这里选cp37和win_amd64版本,
image.png
image.png
下载完成后直接,pip安装whl文件即可

  1. pip insatll 你的下载路径\xxx.whl

1.2 先学部署就是玩

1) 搭建Web界面
配置完环境估计已经学累了,学习是痛苦的事情,不如先点无脑copy体会一下学完之后的部署效果,激发一下学习动力!!!
完成深度学习的文本分类模型训练后,保存训练好的最佳模型部署到web上供他人使用,先下载templates.zip,里面提供一个简陋的前端响应界面,前端界面和功能可以自行优化,我们只有一个需求就是针对用户输入的文本,返回该文本的模型预测类别和类别概率分布。templates.zip链接:https://pan.baidu.com/s/1Tr8m0BdLw2zDtnuvH95F-Q?pwd=2021
具体步骤
1.定位到当前项目根目录下,解压templates.zip得到templates文件夹,文件夹中包含index.html(首页)、css、images
image.png
2.当前项目根目录下,创建app.py,输入如下内容:
image.png

  1. import random
  2. from flask import Flask, render_template, request
  3. app=Flask(__name__,template_folder='templates',static_folder="templates",static_url_path='')
  4. # 模型预测函数
  5. def model_predict(text):
  6. # 该函数一般写在别的py文件中,这里只是一个演示示例,方便展示
  7. response = {}
  8. rand1 = random.random()
  9. response["dog"] = round(rand1,4)
  10. response["cat"] = round(1-rand1,4)
  11. results = {value:key for key,value in response.items()}
  12. result = results[max(results.keys())]
  13. return response,result
  14. # 前端表单数据获取和响应
  15. @app.route('/predict',methods=['GET','POST'])
  16. def predict():
  17. isHead = False
  18. if request.method == 'POST':
  19. sent1 = request.form.get('sentence1')
  20. sent2 = request.form.get('sentence2')
  21. text = sent1 + sent2
  22. # 通过预测函数得到响应,result是预测的标签,response是软标签logits
  23. response,result = model_predict(text)
  24. return render_template('index.html',isHead=isHead,sent1=sent1,sent2=sent2,response=response,result=result)
  25. # 首页加载
  26. @app.route('/',methods=['GET','POST'])
  27. def index():
  28. isHead = True
  29. return render_template('index.html',isHead=isHead)
  30. # 主函数
  31. if __name__ == '__main__':
  32. app.run()

2) 本地部署和内网穿透
1.运行启动文件,可在本地运行
第一种:直接用pycharm启动,已知挂机
第二种:使用cmd命令,首先切换python环境为当前分类模型的环境,然后进入当前项目路径,最后在项目路径下运行如下代码(例子中以app.py作为启动名称),如图所示:

  1. python app.py

image.png
2.使用Ngrok进行辅助内网穿透,让他人可访问
使用Ngrok仅为辅助方案,有云服务器最好可以自行部署
步骤1:打开Ngrok官网 https://dashboard.ngrok.com/
步骤2:按照Download提示下载ngrok,windows版本下载比较简单
步骤3:下载完毕按照官网三个小步骤进行安装和启动
第三步的“ngrok http 50”中的50表示穿透端口号,需要和flask设置的端口号一致,需要修改,上面flask使用的是默认5000作为端口号,部署是则为“ngrok http 5000”。
image.png
启动后,界面变更如下:
image.png
访问给的http或https可以让他人访问web页面:

  1. http://a909-119-85-172-34.ngrok.io
  2. https://a909-119-85-172-34.ngrok.io

访问成功且正常运行应用:
image.png

1.3 文本分类应用实现需要几步?

文本分类应用实现需要几步?

答:4步,读数据(Dataset) - 创建模型(model) - 训练模型(Train_fine_tune) - 预测数据(Classify)

如何实现框架的高复用

1.创建一个配置文件(Config)将所有的可变参数都写入这个文件中,后续改动只需要改动Config文件中的参数即可
2.功能不能写死,多用可变参数创造可变动的代码形式 or 多针对一些情况写一些工具函数Utils
3.设置好随机种子,写好文件读写和模型保存模块,方便复现同一种结果

2 Config 配置文件

为了提高代码的复用性,通常会创建一个Config配置文件,将经常要修改的数据参数、模型参数、训练参数、验证参数以变量的形式存取其中。
注意:不必在一开始就想到所有可以设置的变量,在构建工程的过程中Config里的内容会不断更新。
下图是一张配置文件示例:
image.png
其中task_name表示数据集名字,若下次需要训练其他数据集时,只需要更改task_name变量的内容即可。其中还有很多文件夹的名字,例如train_filetest_filevalid_file将训练集、测试机、验证集可以提高代码的复用性,此时没新拿到一个数据集只需要写一个数据处理函数,将数据处理为上述三种名称即可run代码训练,不用从头开始构建代码。
配置文件本身没有什么知识点,但在工程体系中极为重要,是后续所有文件都会import的内容。

点击展开,Config完整代码:

  1. # General parameters
  2. SEED = 1234
  3. # data preprocess
  4. task_name = "TongHuaShun"
  5. data_path = 'dataset'
  6. train_file = 'train.csv'
  7. valid_file = 'test.csv'
  8. test_file = 'test.csv'
  9. predict_file = 'test_data_A.txt'
  10. result_file = 'result.csv'
  11. max_length = 33
  12. batch_size = 128
  13. # data label list
  14. label_list = ['未攻击用户', '攻击用户']
  15. class_number = len(label_list)
  16. if class_number == 2:
  17. type_avrage = 'binary'
  18. else:
  19. type_avrage = 'macro'
  20. """micro、macro"""
  21. # train details
  22. epochs = 500
  23. learning_rate = 5e-4
  24. best_metric = "f1" # 可选acc、p、r、f1
  25. early_stopping_nums = 10
  26. # model lists
  27. model_name = "TextRNN"

3 Dataset 数据集

本节介绍使用torchtext构造符合torch格式的TabularDataset。

3.1 数据集长啥样

由于数据集形式各异,设计一种数据处理的通用方法非常困难。但不管是何种数据集经过处理之后均会变为 的形式,所以本文不涉及各类数据集的处理,均已 的形式作为数据集的输入形式,如下图所示

text label
涨停出了,今天肯定封不上板了 未攻击用户
忽悠,继续忽悠 攻击用户
…… ……

在开始加载数据集之前,可以在本项目下创建一个数据集目录,再在该数据集目录下创建不同任务的数据集文件夹,生成train.csvvalid.csvtest.csv格式的训练、验证、测试数据集。
image.png

3.2 随机种子

随机种子对复现相同的结果有机器重要的作用,pytorch中添加如下代码:

  1. import torch
  2. import random
  3. import numpy as np
  4. from Config import *
  5. torch.manual_seed(SEED)
  6. torch.manual_seed(SEED)
  7. torch.cuda.manual_seed_all(SEED)
  8. np.random.seed(SEED)
  9. random.seed(SEED)
  10. torch.backends.cudnn.deterministic = True

3.3 分词和torchtext

1)jieba
使用结巴进行分词,先用正则仅保留所有中文字符,然后使用结巴lcut的分词,返回分词序列。

  1. import jieba
  2. import re
  3. def x_tokenize(x):
  4. str = re.sub('[^\u4e00-\u9fa5]', "", x)
  5. return jieba.lcut(str)
  6. print(x_tokenize("A股涨停出了,今天肯定封不上板了 "))

结果如下:
['股', '涨停', '出', '了', '今天', '肯定', '封不上', '板', '了']
补充:
分享一篇文章NLP分词PK大赏:开源分词组件对比及一骑红尘的jieba分词源码剖析
里面对开源分词软件进行了分析,经过评估发现jieba的性能最低,F值只有0.82左右,但结巴分词的优点是速度一骑红尘。感兴趣的小伙伴可以研究一下。
哈工大的LTP使用方法也开看我写的另一个帖子:
[3]使用pyltp进行分句、分词、词性标注、命名实体识别

2)torchtext
Torchtext的详解内容可以看这一篇知乎
torchtext包含以下组件:
Field :主要包含以下数据预处理的配置信息,比如指定分词方法,是否转成小写,起始字符,结束字符,补全字符以及词典等等
Dataset :继承自pytorch的Dataset,用于加载数据,提供了TabularDataset可以指点路径,格式,Field信息就可以方便的完成数据加载。同时torchtext还提供预先构建的常用数据集的Dataset对象,可以直接加载使用,splits方法可以同时加载训练集,验证集和测试集。
Iterator : 主要是数据输出的模型的迭代器,可以支持batch定制

3)Field对象创建
简单理解,创建一个Field对象就是创建一个关于包含数据预处理信息的class对象。

代码流程:

  • 针对分别创建两个Field对象。

text作为文本内容,需要分词以列表的形式返回、需要设置截断长度、以及保存词典信息。
label作为标签内容,不需要分词以序列方式返回、也不需要截断以及保存词典信息。

  • 定义两个简单的get函数以获取TEXTLABEL,方便后续从Config中调用。

以下是Field对象包含的参数:
sequential: 是否把数据表示成序列,如果是False, 不能使用分词 默认值: True.
fix_length: 修改每条数据的长度为该值,不够的用pad_token补全. 默认值: None.
tokenize: 分词函数. 默认值: str.split. 必须传入一个函数呦.
use_vocab: 是否使用词典对象. 如果是False 数据的类型必须已经是数值类型. 默认值: True.
pad_token: 用于补全的字符. 默认值: ““.
unk_token: 不存在词典里的字符. 默认值: ““.

  1. import os
  2. from torchtext.legacy import data
  3. # 对文本内容和标签创建两个Field对象
  4. TEXT = data.Field(sequential=True,
  5. tokenize=x_tokenize,
  6. fix_length=max_length,
  7. use_vocab=True)
  8. LABEL = data.Field(sequential=False,
  9. use_vocab=False)
  10. def getTEXT():
  11. return TEXT
  12. def getLabel():
  13. return LABEL

4)TabularDataset 数据集构造
torchtextDataset是继承自pytorchDataset,提供了一个可以下载压缩数据并解压的方法(支持.zip, .gz, .tgz);splits方法可以同时读取训练集,验证集,测试集,TabularDataset可以很方便的读取CSV, TSV, or JSON格式的文件。

代码流程:

  • TabularDataset.splits中创建并分割出训练集、验证集以及测试集
  • 加载完数据后可以,建立TEXT的字典,建立词典时可以使用预训练的词向量(本文未使用词向量)

以下是TabularDataset.splits包含的参数:
path: 数据集文件夹的公共前缀
train: 训练集文件名
validation: 验证集文件名
test: 测试集文件名
format: 文件格式
skip_header: 是否跳过表头
csv_reader_params:数据集以何种符号进行划分
fields:传入的fields必须与列的顺序相同。对于不使用的列,在fields的位置传入一个None

  1. # 创建TabularDataset并分割数据集
  2. train, dev, test = data.TabularDataset.splits(path=os.path.join(data_path,task_name),
  3. train=train_file,
  4. validation=valid_file,
  5. test=test_file,
  6. format='csv',
  7. skip_header=True,
  8. csv_reader_params={'delimiter':','},
  9. fields=[("text",TEXT),('label',LABEL)])
  10. # 构建字典
  11. TEXT.build_vocab(train)

4)iterator 数据迭代器
Iteratortorchtext到模型的输出,它提供了我们对数据的一般处理方式,比如打乱,排序,等等,可以动态修改batch大小,这里也有splits方法 可以同时输出训练集,验证集,测试集。
代码中使用的是BucketIterator,相比Iterator它会将长度相近的数据放在一个batch中

BucketIterator为了使padding最少,会在batch之前先对整个dataset上的cases进行sort(按一定规则),将相近长度的case放在一起,这样一个batch中的cases长度相当,使得padding的个数最小。同时,为了每次能生成稍微不同的batch,在sort之前加入了noise进行长度的干扰。

代码流程:

  • BucketIterator.splits函数中分别创建训练集迭代器、验证集迭代器以及测试集迭代器
  • 定义一个简单的get函数以获取迭代数据集,方便后续调用。

以下是BucketIterator.splits包含的参数:
dataset:加载的数据集
batch_size:batch的大小
shuffle:是否打乱数据
sort:是对全体数据按照升序顺序进行排序,而sort_within_batch仅仅对一个batch内部的数据进行排序。
sort_within_batch:参数设置为True时,按照sort_key按降序对每个小批次内的数据进行降序排序。
repeat:是否在不同的epochs中重复迭代 ,默认是False

  1. train_iter, val_iter, test_iter = data.BucketIterator.splits(dataset = (train,dev,test),
  2. batch_size = batch_size,
  3. shuffle=True,
  4. sort=False,
  5. sort_within_batch=False,
  6. repeat=False)
  7. def getIter():
  8. return train_iter, val_iter, test_iter

3.4 完整代码

点击展开,完整代码如下

  1. import os
  2. import re
  3. import jieba
  4. import torch
  5. import random
  6. import numpy as np
  7. from torchtext.legacy import data
  8. from Config import *
  9. torch.manual_seed(SEED)
  10. torch.manual_seed(SEED)
  11. torch.cuda.manual_seed_all(SEED)
  12. np.random.seed(SEED)
  13. random.seed(SEED)
  14. torch.backends.cudnn.deterministic = True
  15. def x_tokenize(x):
  16. str = re.sub('[^\u4e00-\u9fa5]', "", x)
  17. return jieba.lcut(str)
  18. TEXT = data.Field(sequential=True,
  19. tokenize=x_tokenize,
  20. fix_length=max_length,
  21. use_vocab=True)
  22. LABEL = data.Field(sequential=False,
  23. use_vocab=False)
  24. train, dev, test = data.TabularDataset.splits(path=os.path.join(data_path,task_name),
  25. train=train_file,
  26. validation=valid_file,
  27. test=test_file,
  28. format='csv',
  29. skip_header=True,
  30. csv_reader_params={'delimiter':','},
  31. fields=[("text",TEXT),('label',LABEL)])
  32. TEXT.build_vocab(train)
  33. train_iter, val_iter, test_iter = data.BucketIterator.splits(datasets = (train,dev,test),
  34. batch_size = batch_size,
  35. shuffle=True,
  36. sort=False,
  37. sort_within_batch=False,
  38. repeat=False)
  39. def getTEXT():
  40. return TEXT
  41. def getLabel():
  42. return LABEL
  43. def getIter():
  44. return train_iter, val_iter, test_iter

4 Model 创建模型

前BERT时代,文本分类的模型创建并不难,基本百度torch+对应模型的名字都可以找到代码,不做赘述。
模型也可以创建一个模型文件夹存放起来,在模型文件夹中在创建对应模型的.py文件。
image.png

4.1 TextCNN

  1. import torch
  2. from Config import SEED,class_number
  3. torch.manual_seed(SEED)
  4. torch.backends.cudnn.deterministic = True
  5. import DataSet
  6. import torch.nn as nn
  7. import torch.nn.functional as F
  8. class TextCNN(nn.Module):
  9. def __init__(self):
  10. super(TextCNN, self).__init__()
  11. Vocab = len(DataSet.getTEXT().vocab) ## 已知词的数量
  12. Dim = 256 ##每个词向量长度
  13. Cla = class_number ##类别数
  14. Ci = 1 ##输入的channel数
  15. Knum = 256 ## 每种卷积核的数量
  16. Ks = [2,3,4] ## 卷积核list,形如[2,3,4]
  17. self.embed = nn.Embedding(Vocab, Dim) ## 词向量,这里直接随机
  18. # 指定嵌入矩阵的初始权重
  19. self.convs = nn.ModuleList([nn.Conv2d(Ci, Knum, (K, Dim)) for K in Ks]) ## 卷积层
  20. self.dropout = nn.Dropout(0.5)
  21. self.fc = nn.Linear(len(Ks) * Knum, Cla) ##全连接层
  22. def forward(self, x):
  23. # [batch len, text size]
  24. x = self.embed(x)
  25. # [batch len, text size, emb dim]
  26. x = x.unsqueeze(1)
  27. # [batch len, Ci, text size, emb dim]
  28. x = [F.relu(conv(x)).squeeze(3) for conv in self.convs]
  29. # len(Ks)*[batch size, Knum, text len]
  30. x = [F.max_pool1d(line, line.size(2)).squeeze(2) for line in x]
  31. # len(Ks)*[batch size, Knum]
  32. x = torch.cat(x, 1)
  33. # [batch size, Knum*len(Ks)]
  34. x = self.dropout(x)
  35. # [batch size, Knum*len(Ks)]
  36. logit = self.fc(x)
  37. # [batch size, Cla]
  38. return logit

4.2 TextRNN

  1. import torch
  2. from Config import SEED,class_number
  3. torch.manual_seed(SEED)
  4. torch.backends.cudnn.deterministic = True
  5. import DataSet
  6. import torch.nn as nn
  7. class TextRNN(nn.Module):
  8. def __init__(self):
  9. super(TextRNN, self).__init__()
  10. Vocab = len(DataSet.getTEXT().vocab) ## 已知词的数量
  11. Dim = 256 ##每个词向量长度
  12. dropout = 0.2
  13. hidden_size = 256 #隐藏层数量
  14. num_classes = class_number ##类别数
  15. num_layers = 2 ##双层LSTM
  16. self.embedding = nn.Embedding(Vocab, Dim) ## 词向量,这里直接随机
  17. self.lstm = nn.LSTM(Dim, hidden_size, num_layers,
  18. bidirectional=True, batch_first=True, dropout=dropout)
  19. self.fc = nn.Linear(hidden_size * 2, num_classes)
  20. def forward(self, x):
  21. # [batch len, text size]
  22. x = self.embedding(x)
  23. # [batch size, text size, embedding]
  24. output, (hidden, cell) = self.lstm(x)
  25. # output = [batch size, text size, num_directions * hidden_size]
  26. output = self.fc(output[:, -1, :]) # 句子最后时刻的 hidden state
  27. # output = [batch size, num_classes]
  28. return output

4.3 TextRCNN

  1. import torch
  2. import torch.nn as nn
  3. import torch.nn.functional as F
  4. import DataSet
  5. from Config import SEED,class_number,max_length
  6. torch.manual_seed(SEED)
  7. torch.backends.cudnn.deterministic = True
  8. class TextRCNN(nn.Module):
  9. def __init__(self,
  10. vocab_size = len(DataSet.getTEXT().vocab), # 词典的大小(总共有多少个词语/字)
  11. n_class = class_number, # 分类的类型
  12. embed_dim=256, # embedding的维度
  13. rnn_hidden=256,
  14. dropout=0.2
  15. ):
  16. super(TextRCNN, self).__init__()
  17. self.rnn_hidden = rnn_hidden
  18. self.embed_dim = embed_dim
  19. self.embedding = nn.Embedding(
  20. num_embeddings=vocab_size,
  21. embedding_dim=embed_dim,
  22. )
  23. self.maxpool = nn.MaxPool1d(max_length)
  24. self.lstm = nn.LSTM(embed_dim, rnn_hidden, 2,
  25. bidirectional=True, batch_first=True, dropout=dropout)
  26. self.fc = nn.Linear(in_features=embed_dim+2*rnn_hidden,
  27. out_features=n_class)
  28. def forward(self, x):
  29. #[batch, text_len]
  30. x = self.embedding(x)
  31. # [batch, text_len, embed_dim]
  32. # output, h_n = self.gru(x)
  33. output, _ = self.lstm(x)
  34. x = torch.cat([x, output], dim=2)
  35. # [batch, text_len, 2*rnn_hidden+embed_dim]
  36. x = F.relu(x)
  37. x = x.permute(0, 2, 1)
  38. x = self.maxpool(x).squeeze()
  39. # x = F.max_pool2d(x, (x.shape[1], 1))
  40. # [batch, 1, 2*rnn_hidden+embed_dim]
  41. # x = x.reshape(-1,2 * self.rnn_hidden+self.embed_dim)
  42. # [batch, 2*rnn_hidden+embed_dim]
  43. x = self.fc(x) # [batch, n_class]
  44. # x = torch.sigmoid(x)
  45. return x

4.4 TextRNN_Attention

  1. import torch
  2. from Config import SEED,class_number
  3. torch.manual_seed(SEED)
  4. torch.backends.cudnn.deterministic = True
  5. import DataSet
  6. import torch.nn as nn
  7. import torch.nn.functional as F
  8. class TextRNN_Attention(nn.Module):
  9. def __init__(self):
  10. super(TextRNN_Attention, self).__init__()
  11. Vocab = len(DataSet.getTEXT().vocab) ## 已知词的数量
  12. Dim = 256 ##每个词向量长度
  13. dropout = 0.2
  14. hidden_size = 256 #隐藏层数量
  15. num_classes = class_number ##类别数
  16. num_layers = 2 ##双层LSTM
  17. self.embedding = nn.Embedding(Vocab, Dim) ## 词向量,这里直接随机
  18. self.lstm = nn.LSTM(Dim, hidden_size, num_layers,
  19. bidirectional=True, batch_first=True, dropout=dropout)
  20. self.tanh1 = nn.Tanh()
  21. self.w = nn.Parameter(torch.zeros(hidden_size * 2))
  22. self.tanh2 = nn.Tanh()
  23. self.fc = nn.Linear(hidden_size * 2, num_classes)
  24. def forward(self, x):
  25. # [batch len, text size]
  26. x = self.embedding(x)
  27. # [batch size, text size, embedding]
  28. output, (hidden, cell) = self.lstm(x)
  29. # output = [batch size, text size, num_directions * hidden_size]
  30. M = self.tanh1(output)
  31. # [batch size, text size, num_directions * hidden_size]
  32. alpha = F.softmax(torch.matmul(M, self.w), dim=1).unsqueeze(-1)
  33. # [batch size, text size, 1]
  34. out = output * alpha
  35. # [batch size, text size, num_directions * hidden_size]
  36. out = torch.sum(out, 1)
  37. # [batch size, num_directions * hidden_size]
  38. out = F.relu(out)
  39. # [batch size, num_directions * hidden_size]
  40. out = self.fc(out)
  41. # [batch size, num_classes]
  42. return out

4.5 Transformer

  1. import torch
  2. import torch.nn as nn
  3. import torch.nn.functional as F
  4. import numpy as np
  5. import copy
  6. import DataSet
  7. from Config import SEED,class_number,max_length
  8. torch.manual_seed(SEED)
  9. torch.backends.cudnn.deterministic = True
  10. class Transformer(nn.Module):
  11. def __init__(self,
  12. vocab_size = len(DataSet.getTEXT().vocab), # 词典的大小
  13. seq_len = max_length,
  14. n_class = class_number, # 分类的类型
  15. device = torch.device('cuda' if torch.cuda.is_available() else 'cpu'),
  16. embed_dim=256, # embedding的维度
  17. dim_model=256,
  18. dropout=0.2,
  19. num_head=8,
  20. hidden=512,
  21. num_encoder=4,
  22. ):
  23. super(Transformer, self).__init__()
  24. self.embedding = nn.Embedding(vocab_size, embed_dim)
  25. self.postion_embedding = Positional_Encoding(embed_dim, seq_len, dropout, device)
  26. self.encoder = Encoder(dim_model, num_head, hidden, dropout)
  27. self.encoders = nn.ModuleList([
  28. copy.deepcopy(self.encoder)
  29. for _ in range(num_encoder)])
  30. self.fc1 = nn.Linear(seq_len * dim_model, n_class)
  31. def forward(self, x):
  32. out = self.embedding(x)
  33. out = self.postion_embedding(out)
  34. for encoder in self.encoders:
  35. out = encoder(out)
  36. out = out.view(out.size(0), -1)
  37. out = self.fc1(out)
  38. return out
  39. class Encoder(nn.Module):
  40. def __init__(self, dim_model, num_head, hidden, dropout):
  41. super(Encoder, self).__init__()
  42. self.attention = Multi_Head_Attention(dim_model, num_head, dropout)
  43. self.feed_forward = Position_wise_Feed_Forward(dim_model, hidden, dropout)
  44. def forward(self, x):
  45. out = self.attention(x)
  46. out = self.feed_forward(out)
  47. return out
  48. class Positional_Encoding(nn.Module):
  49. def __init__(self, embed, pad_size, dropout, device):
  50. super(Positional_Encoding, self).__init__()
  51. self.device = device
  52. self.pe = torch.tensor(
  53. [[pos / (10000.0 ** (i // 2 * 2.0 / embed)) for i in range(embed)] for pos in range(pad_size)])
  54. self.pe[:, 0::2] = np.sin(self.pe[:, 0::2])
  55. self.pe[:, 1::2] = np.cos(self.pe[:, 1::2])
  56. self.dropout = nn.Dropout(dropout)
  57. def forward(self, x):
  58. out = x + nn.Parameter(self.pe, requires_grad=False).to(self.device)
  59. out = self.dropout(out)
  60. return out
  61. class Scaled_Dot_Product_Attention(nn.Module):
  62. '''Scaled Dot-Product Attention '''
  63. def __init__(self):
  64. super(Scaled_Dot_Product_Attention, self).__init__()
  65. def forward(self, Q, K, V, scale=None):
  66. '''
  67. Args:
  68. Q: [batch_size, len_Q, dim_Q]
  69. K: [batch_size, len_K, dim_K]
  70. V: [batch_size, len_V, dim_V]
  71. scale: 缩放因子 论文为根号dim_K
  72. Return:
  73. self-attention后的张量,以及attention张量
  74. '''
  75. attention = torch.matmul(Q, K.permute(0, 2, 1))
  76. if scale:
  77. attention = attention * scale
  78. # if mask: # TODO change this
  79. # attention = attention.masked_fill_(mask == 0, -1e9)
  80. attention = F.softmax(attention, dim=-1)
  81. context = torch.matmul(attention, V)
  82. return context
  83. class Multi_Head_Attention(nn.Module):
  84. def __init__(self, dim_model, num_head, dropout=0.0):
  85. super(Multi_Head_Attention, self).__init__()
  86. self.num_head = num_head
  87. assert dim_model % num_head == 0
  88. self.dim_head = dim_model // self.num_head
  89. self.fc_Q = nn.Linear(dim_model, num_head * self.dim_head)
  90. self.fc_K = nn.Linear(dim_model, num_head * self.dim_head)
  91. self.fc_V = nn.Linear(dim_model, num_head * self.dim_head)
  92. self.attention = Scaled_Dot_Product_Attention()
  93. self.fc = nn.Linear(num_head * self.dim_head, dim_model)
  94. self.dropout = nn.Dropout(dropout)
  95. self.layer_norm = nn.LayerNorm(dim_model)
  96. def forward(self, x):
  97. batch_size = x.size(0)
  98. Q = self.fc_Q(x)
  99. K = self.fc_K(x)
  100. V = self.fc_V(x)
  101. Q = Q.view(batch_size * self.num_head, -1, self.dim_head)
  102. K = K.view(batch_size * self.num_head, -1, self.dim_head)
  103. V = V.view(batch_size * self.num_head, -1, self.dim_head)
  104. # if mask: # TODO
  105. # mask = mask.repeat(self.num_head, 1, 1) # TODO change this
  106. scale = K.size(-1) ** -0.5 # 缩放因子
  107. context = self.attention(Q, K, V, scale)
  108. context = context.view(batch_size, -1, self.dim_head * self.num_head)
  109. out = self.fc(context)
  110. out = self.dropout(out)
  111. out = out + x # 残差连接
  112. out = self.layer_norm(out)
  113. return out
  114. class Position_wise_Feed_Forward(nn.Module):
  115. def __init__(self, dim_model, hidden, dropout=0.0):
  116. super(Position_wise_Feed_Forward, self).__init__()
  117. self.fc1 = nn.Linear(dim_model, hidden)
  118. self.fc2 = nn.Linear(hidden, dim_model)
  119. self.dropout = nn.Dropout(dropout)
  120. self.layer_norm = nn.LayerNorm(dim_model)
  121. def forward(self, x):
  122. out = self.fc1(x)
  123. out = F.relu(out)
  124. out = self.fc2(out)
  125. out = self.dropout(out)
  126. out = out + x # 残差连接
  127. out = self.layer_norm(out)
  128. return out

4.6 FastText

  1. import torch
  2. from Config import SEED,class_number
  3. torch.manual_seed(SEED)
  4. torch.backends.cudnn.deterministic = True
  5. import DataSet
  6. import torch.nn as nn
  7. class FastText(nn.Module):
  8. def __init__(self):
  9. super(FastText, self).__init__()
  10. Vocab = len(DataSet.getTEXT().vocab) ## 已知词的数量
  11. Dim = 256 ##每个词向量长度
  12. Cla = class_number ##类别数
  13. hidden_size = 128
  14. self.embed = nn.Embedding(Vocab, Dim) ## 词向量,这里直接随机
  15. self.fc = nn.Sequential( #序列函数
  16. nn.Linear(Dim, hidden_size), #这里的意思是先经过一个线性转换层
  17. nn.BatchNorm1d(hidden_size), #再进入一个BatchNorm1d
  18. nn.ReLU(inplace=True), #再经过Relu激活函数
  19. nn.Linear(hidden_size ,Cla)#最后再经过一个线性变换
  20. )
  21. def forward(self, x):
  22. # [batch len, text size]
  23. x = self.embed(x)
  24. x = torch.mean(x,dim=1)
  25. # [batch size, Dim]
  26. logit = self.fc(x)
  27. # [batch size, Cla]
  28. return logit

4.7 DPCNN

  1. import torch
  2. from Config import SEED,class_number
  3. torch.manual_seed(SEED)
  4. torch.backends.cudnn.deterministic = True
  5. import DataSet
  6. import torch.nn as nn
  7. import torch.nn.functional as F
  8. class DPCNN(nn.Module):
  9. def __init__(self):
  10. super(DPCNN, self).__init__()
  11. Vocab = len(DataSet.getTEXT().vocab) ## 已知词的数量
  12. embed_dim = 256 ##每个词向量长度
  13. Cla = class_number ##类别数
  14. ci = 1 # input chanel size
  15. kernel_num = 250 # output chanel size
  16. # embed_dim = trial.suggest_int("n_embedding", 200, 300, 50)
  17. self.embed = nn.Embedding(Vocab, embed_dim, padding_idx=1)
  18. self.conv_region = nn.Conv2d(ci, kernel_num, (3, embed_dim), stride=1)
  19. self.conv = nn.Conv2d(kernel_num, kernel_num, (3, 1), stride=1)
  20. self.max_pool = nn.MaxPool2d(kernel_size=(3, 1), stride=2)
  21. self.max_pool_2 = nn.MaxPool2d(kernel_size=(2, 1))
  22. self.padding = nn.ZeroPad2d((0, 0, 1, 1)) # top bottom
  23. self.relu = nn.ReLU()
  24. self.fc = nn.Linear(kernel_num, Cla)
  25. def forward(self, x):
  26. x = self.embed(x) # x: (batch, seq_len, embed_dim)
  27. x = x.unsqueeze(1) # x: (batch, 1, seq_len, embed_dim)
  28. m = self.conv_region(x) # [batch_size, 250, seq_len-3+1, 1]
  29. x = self.padding(m) # [batch_size, 250, seq_len, 1]
  30. x = self.relu(x) # [batch_size, 250, seq_len, 1]
  31. x = self.conv(x) # [batch_size, 250, seq_len-3+1, 1]
  32. x = self.padding(x) # [batch_size, 250, seq_len, 1]
  33. x = self.relu(x) # [batch_size, 250, seq_len, 1]
  34. x = self.conv(x) # [batch_size, 250, seq_len-3+1, 1]
  35. x = x + m
  36. while x.size()[2] > 2:
  37. x = self._block(x)
  38. if x.size()[2] == 2:
  39. x = self.max_pool_2(x) # [batch_size, 250, 1, 1]
  40. x = x.squeeze() # [batch_size, 250]
  41. logit = F.log_softmax(self.fc(x), dim=1)
  42. return logit
  43. def _block(self, x): # for example: [batch_size, 250, 4, 1]
  44. px = self.max_pool(x) # [batch_size, 250, 1, 1]
  45. x = self.padding(px) # [batch_size, 250, 3, 1]
  46. x = F.relu(x)
  47. x = self.conv(x) # [batch_size, 250, 1, 1]
  48. x = self.padding(x)
  49. x = F.relu(x)
  50. x = self.conv(x)
  51. # Short Cut
  52. x = x + px
  53. return x

4.8 Capsule

  1. import torch
  2. # 在Config里加一个embedding_dim变量,用于控制词向量维度
  3. from Config import SEED,class_number,load_embedding,embedding_dim
  4. torch.manual_seed(SEED)
  5. torch.backends.cudnn.deterministic = True
  6. import DataSet
  7. import torch.nn as nn
  8. import torch.nn.functional as F
  9. class GRULayer(nn.Module):
  10. def __init__(self, hidden_size,embed_dim):
  11. super(GRULayer, self).__init__()
  12. self.gru = nn.GRU(input_size=embed_dim, hidden_size=hidden_size, bidirectional=True)
  13. def init_weights(self):
  14. ih = (param.data for name, param in self.named_parameters() if 'weight_ih' in name)
  15. hh = (param.data for name, param in self.named_parameters() if 'weight_hh' in name)
  16. b = (param.data for name, param in self.named_parameters() if 'bias' in name)
  17. for k in ih:
  18. nn.init.xavier_uniform_(k)
  19. for k in hh:
  20. nn.init.orthogonal_(k)
  21. for k in b:
  22. nn.init.constant_(k, 0)
  23. def forward(self, x):
  24. return self.gru(x)
  25. class CapsuleLayer(nn.Module):
  26. def __init__(self, input_dim_capsule, num_capsule, dim_capsule, routings, activation='default'):
  27. super(CapsuleLayer, self).__init__()
  28. self.num_capsule = num_capsule
  29. self.dim_capsule = dim_capsule
  30. self.routings = routings
  31. self.t_epsilon = 1e-7 # 计算squash需要的参数
  32. if activation == 'default':
  33. self.activation = self.squash
  34. else:
  35. self.activation = nn.ReLU(inplace=True)
  36. self.W = nn.Parameter(nn.init.xavier_normal_(torch.empty(1, input_dim_capsule, num_capsule * dim_capsule)))
  37. def forward(self, x):
  38. u_hat_vecs = torch.matmul(x, self.W)
  39. batch_size = x.size(0)
  40. input_num_capsule = x.size(1)
  41. u_hat_vecs = u_hat_vecs.view((batch_size, input_num_capsule, self.num_capsule, self.dim_capsule))
  42. u_hat_vecs = u_hat_vecs.permute(0, 2, 1, 3)
  43. b = torch.zeros_like(u_hat_vecs[:, :, :, 0])
  44. outputs = 0
  45. for i in range(self.routings):
  46. b = b.permute(0, 2, 1)
  47. c = F.softmax(b, dim=2)
  48. c = c.permute(0, 2, 1)
  49. b = b.permute(0, 2, 1)
  50. outputs = self.activation(torch.einsum('bij,bijk->bik', (c, u_hat_vecs)))
  51. if i < self.routings - 1:
  52. b = torch.einsum('bik,bijk->bij', (outputs, u_hat_vecs))
  53. return outputs
  54. def squash(self, x, axis=-1):
  55. s_squared_norm = (x ** 2).sum(axis, keepdim=True)
  56. scale = torch.sqrt(s_squared_norm + self.t_epsilon)
  57. return x / scale
  58. class Capsule(nn.Module):
  59. def __init__(self,Vocab = len(DataSet.getTEXT().vocab),
  60. embed_dim = embedding_dim,
  61. num_classes = class_number):
  62. super(Capsule, self).__init__()
  63. # 一些参数
  64. self.hidden_size = 128
  65. self.num_capsule = 10
  66. self.dim_capsule = 16
  67. self.routings = 5
  68. self.dropout_p = 0.25
  69. # 1. 词嵌入
  70. self.embed = nn.Embedding(Vocab,embed_dim)
  71. if load_embedding == "w2v":
  72. weight_matrix = DataSet.getTEXT().vocab.vectors
  73. self.embed.weight.data.copy_(weight_matrix)
  74. elif load_embedding == "glove":
  75. weight_matrix = DataSet.getTEXT().vocab.vectors
  76. self.embed.weight.data.copy_(weight_matrix)
  77. # self.embedding = nn.Embedding(args.n_vocab, args.embed, padding_idx=args.n_vocab - 1)
  78. # 2. 一层gru
  79. self.gru = GRULayer(hidden_size=self.hidden_size,embed_dim=embed_dim)
  80. self.gru.init_weights() # 对gru的参数进行显性初始化
  81. # 3. capsule
  82. self.capsule = CapsuleLayer(input_dim_capsule=self.hidden_size * 2, num_capsule=self.num_capsule,
  83. dim_capsule=self.dim_capsule, routings=self.routings)
  84. # 4. 分类层
  85. self.classify = nn.Sequential(
  86. nn.Dropout(p=self.dropout_p, inplace=True),
  87. nn.Linear(self.num_capsule * self.dim_capsule, num_classes),
  88. )
  89. def forward(self, input_ids):
  90. batch_size = input_ids.size(0)
  91. embed = self.embed(input_ids)
  92. # print(embed.size()) # torch.Size([2, 128, 300])
  93. output, _ = self.gru(embed) # output.size() torch.Size([2, 128, 256])
  94. cap_out = self.capsule(output)
  95. # print(cap_out.size()) # torch.Size([2, 10, 16])
  96. cap_out = cap_out.view(batch_size, -1)
  97. return self.classify(cap_out)

5 Train_fine_tune 训练和精调

5.1 模型字典和载入设备

  • 为了提升复用性,希望在Config中设置model_name的字符串即可更换不同模型进行训练,在一开始根据模型名称创建一个字典,在后续模型创建时调用该字典即可实现希望的效果。
  • device设置使用GPU ```python import os import pandas as pd import torch import torch.nn.functional as F from tqdm import tqdm from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,cohen_kappa_score from Config import * import DataSet from model.TextCNN import TextCNN from model.TextRCNN import TextRCNN from model.TextRNN import TextRNN from model.TextRNN_Attention import TextRNN_Attention from model.Transformer import Transformer from model.FastText import FastText

model_select = {“TextCNN”:TextCNN(), “TextRNN”:TextRNN(), “TextRCNN”:TextRCNN(), “TextRNN_Attention”:TextRNN_Attention(), “Transformer”:Transformer(), “FastText”:FastText(), } device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)

  1. <a name="CL8Q7"></a>
  2. ## 5.1 评价函数
  3. 使用`sklearn.metrics`可以方便地得到各种评价指标,封装为一个函数,并存入字典中,后续结果只需要操对结果字典进行操作。<br />其中`round`是控制小数点位数
  4. ```python
  5. def mutil_metrics(y_true,y_predict,type_avrage='macro'):
  6. result = {}
  7. result["acc"] = accuracy_score(y_true=y_true,y_pred=y_predict)
  8. result["precision"] = precision_score(y_true=y_true,y_pred=y_predict,average=type_avrage)
  9. result["recall"] = recall_score(y_true=y_true, y_pred=y_predict, average=type_avrage)
  10. result["f1"] = f1_score(y_true=y_true, y_pred=y_predict, average=type_avrage)
  11. # result["kappa"] = cohen_kappa_score(y1=y_true, y2=y_predict)
  12. for k,v in result.items():
  13. result[k] = round(v,5)
  14. return result

5.2 训练和验证

1)训练和验证
训练验证学习过torch的应该清楚流程,两个for循环,一个遍历epoch,一个遍历以batch为单位的数据迭代器,一个batch一个batch的操作,验证可以放在几个batch之后,也可以一个epoch验证一次

  1. def train_model(train_iter, dev_iter, model, name, device):
  2. model = model.to(device)
  3. optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
  4. scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[15, 25], gamma=0.6)
  5. model.train()
  6. best_score = 0
  7. early_stopping = 0
  8. print('training...')
  9. # 每一个epoch
  10. for epoch in range(1, epochs + 1):
  11. model.train()
  12. total_loss = 0.0
  13. train_score = 0
  14. total_train_num = len(train_iter)
  15. progress_bar = tqdm(enumerate(train_iter), total=len(train_iter))
  16. # 每一步等于一个Batch
  17. for i,batch in progress_bar:
  18. feature = batch.text
  19. target = batch.label
  20. with torch.no_grad():
  21. feature = torch.t(feature)
  22. feature, target = feature.to(device), target.to(device)
  23. optimizer.zero_grad()
  24. # 输入模型,得到输出概率分布
  25. logit = model(feature)
  26. # 使用损失函得到损失
  27. loss = F.cross_entropy(logit, target)
  28. # 反向传播梯度
  29. loss.backward()
  30. optimizer.step()
  31. scheduler.step()
  32. total_loss += loss.item()
  33. # 计算评估分数
  34. result = mutil_metrics(target.cpu().numpy(),torch.argmax(logit, dim=1).cpu().numpy(),type_avrage='macro')
  35. train_score += result[best_metric]
  36. # 进入验证阶段
  37. print('>>> Epoch_{}, Train loss is {}, {}:{} \n'.format(epoch,loss.item()/total_train_num, best_metric,train_score/total_train_num))
  38. model.eval()
  39. total_loss = 0.0
  40. vaild_score= 0
  41. total_valid_num = len(dev_iter)
  42. for i, batch in enumerate(dev_iter):
  43. feature = batch.text # (W,N) (N)
  44. target = batch.label
  45. with torch.no_grad():
  46. feature = torch.t(feature)
  47. feature, target = feature.to(device), target.to(device)
  48. out = model(feature)
  49. loss = F.cross_entropy(out, target)
  50. total_loss += loss.item()
  51. valid_result = mutil_metrics(target.cpu().numpy(),torch.argmax(out, dim=1).cpu().numpy(),type_avrage='binary')
  52. vaild_score += valid_result[best_metric]
  53. print('>>> Epoch_{}, Valid loss:{}, {}:{} \n'.format(epoch, total_loss/total_valid_num, best_metric,vaild_score/total_valid_num))
  54. if(vaild_score/total_valid_num > best_score):
  55. early_stopping = 0
  56. print('save model...')
  57. best_score = vaild_score/total_valid_num
  58. saveModel(model, name=name)
  59. else:
  60. early_stopping += 1
  61. if early_stopping == early_stopping_nums:
  62. break

3)保存模型
保存模型函数,以模型名称为文件名保存最佳模型

  1. def saveModel(model,name):
  2. torch.save(model, 'done_model/'+name+'_model.pkl')

5.3 测试和保存实验记录

1)测试

  1. def test_model(test_iter, name, device):
  2. model = torch.load('done_model/'+name+'_model.pkl')
  3. model = model.to(device)
  4. model.eval()
  5. y_true = []
  6. y_pred = []
  7. for batch in test_iter:
  8. feature = batch.text
  9. target = batch.label
  10. with torch.no_grad():
  11. feature = torch.t(feature)
  12. feature, target = feature.to(device), target.to(device)
  13. out = model(feature)
  14. y_true.extend(target.cpu().numpy())
  15. y_pred.extend(torch.argmax(out, dim=1).cpu().numpy())
  16. result = mutil_metrics(y_true, y_pred, type_avrage='macro')
  17. print('>>> Test {} Result:{} \n'.format(name,result))
  18. from sklearn.metrics import classification_report
  19. print(classification_report(y_true, y_pred, target_names=label_list, digits=3))
  20. save_experimental_details(result)

2)保存实验记录
训练完成模型,肯定要保存实验记录,除了最终的结果还有实验的一些重要参数,其次也应该增添任务信息,方便对不同任务进行分析。

  1. def save_experimental_details(test_result):
  2. save_path = os.path.join(data_path, task_name, result_file)
  3. var = [task_name, model_name, SEED, batch_size, max_length, learning_rate]
  4. names = ['task_dataset', 'model_name','Seed', 'Batch_size', 'Max_lenth', 'lr']
  5. vars_dict = {k: v for k, v in zip(names, var)}
  6. results = dict(test_result, **vars_dict)
  7. keys = list(results.keys())
  8. values = list(results.values())
  9. if not os.path.exists(save_path):
  10. ori = []
  11. ori.append(values)
  12. new_df = pd.DataFrame(ori, columns=keys)
  13. new_df.to_csv(os.path.join(data_path,task_name,result_file), index=False,sep='\t')
  14. else:
  15. df = pd.read_csv(save_path,sep='\t')
  16. new = pd.DataFrame(results, index=[1])
  17. df = df.append(new, ignore_index=True)
  18. df.to_csv(save_path, index=False,sep='\t')
  19. data_diagram = pd.read_csv(save_path,sep='\t')
  20. print('test_results \n', data_diagram)

6 Classify 分类预测

预测部分,摆烂了,写不动了,随便看看吧
其中,model_predict返回单条数据的分类结果;predict_csv则是预测一整个文件,拥有批处理或提交比赛数据。

  1. import jieba
  2. import torch
  3. import DataSet
  4. import pandas as pd
  5. import os
  6. import re
  7. from Config import max_length, label_list,model_name,data_path,task_name,predict_file
  8. device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
  9. def x_tokenize(x):
  10. str = re.sub('[^\u4e00-\u9fa5]', "", x)
  11. return jieba.lcut(str)
  12. def getModel(name):
  13. model = torch.load('done_model/'+name+'_model.pkl')
  14. return model
  15. def model_predict(model, sentence):
  16. model.eval()
  17. tokenized = x_tokenize(sentence)
  18. indexed = [DataSet.getTEXT().vocab.stoi[t] for t in tokenized]
  19. if(len(indexed) > max_length):
  20. indexed = indexed[:max_length]
  21. else:
  22. for i in range(max_length-len(indexed)):
  23. indexed.append(DataSet.getTEXT().vocab.stoi['<pad>'])
  24. tensor = torch.LongTensor(indexed).to(device)
  25. tensor = tensor.unsqueeze(1)
  26. tensor = torch.t(tensor)
  27. out = torch.softmax(model(tensor),dim=1)
  28. result = label_list[torch.argmax(out, dim=1)]
  29. response = {k:v for k,v in zip(label_list,out.cpu().detach().numpy()[0])}
  30. return response,result
  31. def predict_csv(model):
  32. model.eval()
  33. outs = []
  34. df = pd.read_csv(os.path.join(data_path,task_name,predict_file),sep='\t')
  35. for sentence in df["query"]:
  36. tokenized = x_tokenize(sentence)
  37. indexed = [DataSet.getTEXT().vocab.stoi[t] for t in tokenized]
  38. if(len(indexed) > max_length):
  39. indexed = indexed[:max_length]
  40. else:
  41. for i in range(max_length-len(indexed)):
  42. indexed.append(DataSet.getTEXT().vocab.stoi['<pad>'])
  43. tensor = torch.LongTensor(indexed).to(device)
  44. tensor = tensor.unsqueeze(1)
  45. tensor = torch.t(tensor)
  46. out = label_list[torch.argmax(model(tensor),dim=1)]
  47. outs.append(out)
  48. df["label"] = outs
  49. sumbit = df[["query","label"]]
  50. sumbit.to_csv(os.path.join(data_path,task_name,"predict.txt"),index=False,sep='\t')
  51. def load_model():
  52. model = getModel(model_name)
  53. model = model.to(device)
  54. return model
  55. if __name__=="__main__":
  56. model = load_model()
  57. sent1 = '你要那么确定百分百,家里房子卖了,上'
  58. response, result = model_predict(model, sent1)
  59. print('概率分布:{}\n预测标签:{}'.format(response,result))
  60. predict_csv(model)

7 app.py 应用部署

7.1 改写app.py文件

把Classify的model_predictload_modelimport进来,无需对之前的predict函数整体进行改动,需要修改两个地方:

  • 在app下方添加一行model = load_model()加载已训练的model
  • 在modelpredict()中将model添加进去 ```python from Classify import modelpredict,load_model from flask import Flask, render_template, request app = Flask(__name,template_folder=’templates’,static_folder=”templates”,static_url_path=’’)

model = load_model()

前端表单数据获取和响应

@app.route(‘/predict’,methods=[‘GET’,’POST’]) def predict(): isHead = False if request.method == ‘POST’: sent1 = request.form.get(‘sentence1’) sent2 = request.form.get(‘sentence2’) text = sent1 + sent2

  1. # 通过预测函数得到响应,result是预测的标签,response是软标签logits
  2. response,result = model_predict(model,sentence)
  3. return render_template('index.html',isHead=isHead,sent1=sent1,sent2=sent2,response=response,result=result)

@app.route(‘/‘,methods=[‘GET’,’POST’]) def index(): isHead = True return render_template(‘index.html’,isHead=isHead)

if name == ‘main‘: app.run() ```

7.2 文本分类部署效果

image.png