自google在2018年提出的Bert后,预训练模型成为了NLP领域的马前卒,目前也有提供可直接使用的预训练Model,接下来笔者将更新一套复用性极高的基于Bert预训练模型的文本分类代码详解,分为三篇文章对整套代码详细解读,本篇将详解数据读取部分。
源码下载地址:(链接):(https://pan.baidu.com/s/1pvJmbaLU9fldm9eBjvKKcg)
提取码:2021

(一) 数据读取

代码中对数据的读取在data_loader.py文件下,接下来对每个class和function进行解读,并理清其逻辑关系。
文件内的class和function如下图所示:
image.png
简单解释:
InputExample:样本对象,为每一个样本创建对象,可以根据任务重写内部方法。
InputFeatures:特征对象,为每一个特征创建对象,可以根据任务重写内部方法。
iflytekProcessor:处理对象,对文件数据进行处理,返回InputExample类,这个类不是固定名称的,可以根据具体的task自己修改对应的读取方法。在此处以iflytek为例将Processor命名为iflytekProcessor。
convert_examples_to_features:转换函数,将InputExample类转化为InputFeatures类,输入InputExample类返回InputFeatures
load_and_cache_examples:缓存函数,缓存convert_examples_to_features生成的InputExample类到本地,每次训练不用都加载一边。

逻辑:
[13]高复用Bert模型文本分类代码详解 - 图2

InputExample

这个类很简单,在初始化方法中定义了输入样本的几个属性:
guid: 示例的唯一id。
words: 示例句子。
label: 示例的标签。
无需进行修改,如果是做文本匹配还可以添加另一个words,如self.wordspair = wordspair,具体根据任务来。

  1. class InputExample(object):
  2. """
  3. A single training/test example for simple sequence classification.
  4. Args:
  5. guid: Unique id for the example.
  6. words: list. The words of the sequence.
  7. label: (Optional) string. The label of the example.
  8. """
  9. def __init__(self, guid, words, label=None, ):
  10. self.guid = guid
  11. self.words = words
  12. self.label = label
  13. def __repr__(self):
  14. return str(self.to_json_string())
  15. def to_dict(self):
  16. """Serializes this instance to a Python dictionary."""
  17. output = copy.deepcopy(self.__dict__)
  18. return output
  19. def to_json_string(self):
  20. """Serializes this instance to a JSON string."""
  21. return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"

InputFeatures

这个类主要描述是bert的输入形式,input_idsattention_masktoken_type_ids在加一个label_id
如果也是使用bert模型那么无需进行修改,当然需要根据对应的预训练模型的输入方式进行修改。

  1. class InputFeatures(object):
  2. """A single set of features of data."""
  3. def __init__(self, input_ids, attention_mask, token_type_ids, label_id):
  4. self.input_ids = input_ids
  5. self.attention_mask = attention_mask
  6. self.token_type_ids = token_type_ids
  7. self.label_id = label_id
  8. def __repr__(self):
  9. return str(self.to_json_string())
  10. def to_dict(self):
  11. """Serializes this instance to a Python dictionary."""
  12. output = copy.deepcopy(self.__dict__)
  13. return output
  14. def to_json_string(self):
  15. """Serializes this instance to a JSON string."""
  16. return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"

iflytekProcessor

  1. class iflytekProcessor(object):
  2. """Processor for the JointBERT data set """
  3. def __init__(self, args):
  4. self.args = args
  5. self.labels = get_labels(args)
  6. self.input_text_file = 'data.csv'
  7. @classmethod
  8. def _read_file(cls, input_file, quotechar=None):
  9. """Reads a tab separated value file."""
  10. df = pd.read_csv(input_file)
  11. return df
  12. def _create_examples(self, datas, set_type):
  13. """Creates examples for the training and dev sets."""
  14. examples = []
  15. for i, rows in datas.iterrows():
  16. try:
  17. guid = "%s-%s" % (set_type, i)
  18. # 1. input_text
  19. words = rows["text"]
  20. # 2. intent
  21. label = rows["labels"]
  22. except :
  23. print(rows)
  24. examples.append(InputExample(guid=guid, words=words, label=label))
  25. return examples
  26. def get_examples(self, mode):
  27. """
  28. Args:
  29. mode: train, dev, test
  30. """
  31. data_path = os.path.join(self.args.data_dir, self.args.task, mode)
  32. logger.info("LOOKING AT {}".format(data_path))
  33. return self._create_examples(datas=self._read_file(os.path.join(data_path, self.input_text_file)),
  34. set_type=mode)

convert_examples_to_features

主要根据bert的编码方式对examples进行编码,转换为input_idsattention_maskmax_seq_lentoken_type_ids的形式,生成features

  1. def convert_examples_to_features(examples, max_seq_len, tokenizer,
  2. cls_token_segment_id=0,
  3. pad_token_segment_id=0,
  4. sequence_a_segment_id=0,
  5. mask_padding_with_zero=True):
  6. # Setting based on the current model type
  7. cls_token = tokenizer.cls_token
  8. sep_token = tokenizer.sep_token
  9. unk_token = tokenizer.unk_token
  10. pad_token_id = tokenizer.pad_token_id
  11. features = []
  12. for (ex_index, example) in enumerate(examples):
  13. if ex_index % 5000 == 0:
  14. logger.info("Writing example %d of %d" % (ex_index, len(examples)))
  15. # Tokenize word by word (for NER)
  16. tokens = []
  17. for word in example.words:
  18. word_tokens = tokenizer.tokenize(word)
  19. if not word_tokens:
  20. word_tokens = [unk_token] # For handling the bad-encoded word
  21. tokens.extend(word_tokens)
  22. # Account for [CLS] and [SEP]
  23. special_tokens_count = 2
  24. if len(tokens) > max_seq_len - special_tokens_count:
  25. tokens = tokens[:(max_seq_len - special_tokens_count)]
  26. # Add [SEP] token
  27. tokens += [sep_token]
  28. token_type_ids = [sequence_a_segment_id] * len(tokens)
  29. # Add [CLS] token
  30. tokens = [cls_token] + tokens
  31. token_type_ids = [cls_token_segment_id] + token_type_ids
  32. input_ids = tokenizer.convert_tokens_to_ids(tokens)
  33. # The mask has 1 for real tokens and 0 for padding tokens. Only real
  34. # tokens are attended to.
  35. attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
  36. # Zero-pad up to the sequence length.
  37. padding_length = max_seq_len - len(input_ids)
  38. input_ids = input_ids + ([pad_token_id] * padding_length)
  39. attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
  40. token_type_ids = token_type_ids + ([pad_token_segment_id] * padding_length)
  41. assert len(input_ids) == max_seq_len, "Error with input length {} vs {}".format(len(input_ids), max_seq_len)
  42. assert len(attention_mask) == max_seq_len, "Error with attention mask length {} vs {}".format(len(attention_mask), max_seq_len)
  43. assert len(token_type_ids) == max_seq_len, "Error with token type length {} vs {}".format(len(token_type_ids), max_seq_len)
  44. label_id = int(example.label)
  45. if ex_index < 5:
  46. logger.info("*** Example ***")
  47. logger.info("guid: %s" % example.guid)
  48. logger.info("tokens: %s" % " ".join([str(x) for x in tokens]))
  49. logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
  50. logger.info("attention_mask: %s" % " ".join([str(x) for x in attention_mask]))
  51. logger.info("token_type_ids: %s" % " ".join([str(x) for x in token_type_ids]))
  52. logger.info("label: %s (id = %d)" % (example.label, label_id))
  53. features.append(
  54. InputFeatures(input_ids=input_ids,
  55. attention_mask=attention_mask,
  56. token_type_ids=token_type_ids,
  57. label_id=label_id,
  58. ))
  59. return features

load_and_cache_examples

加载和缓存函数,顾名思义做了读取和保存两件事

1.先根据参数生成以cached_数据集模式_数据集名称_MaxLen命名的缓存名cached_features_file
2.判断是否有当前路径下的缓存文件是否存在,若存在则直接读取保存文件,不存在则加载处理。

  • 首先,使用processors获得examples
  • 然后,通过convert_examples_to_features获得features并保存缓存数据。

3.将features数据转换为张量并使用TensorDataset构建数据集

  1. def load_and_cache_examples(args, tokenizer, mode):
  2. processor = processors[args.task](args)
  3. # Load data features from cache or dataset file
  4. cached_features_file = os.path.join(
  5. args.data_dir,
  6. 'cached_{}_{}_{}_{}'.format(
  7. mode,
  8. args.task,
  9. list(filter(None, args.model_name_or_path.split("/"))).pop(),
  10. args.max_seq_len
  11. )
  12. )
  13. print(cached_features_file)
  14. if os.path.exists(cached_features_file) and False:
  15. logger.info("Loading features from cached file %s", cached_features_file)
  16. features = torch.load(cached_features_file)
  17. else:
  18. # Load data features from dataset file
  19. logger.info("Creating features from dataset file at %s", args.data_dir)
  20. if mode == "train":
  21. examples = processor.get_examples("train")
  22. elif mode == "dev":
  23. examples = processor.get_examples("dev")
  24. elif mode == "test":
  25. examples = processor.get_examples("test")
  26. else:
  27. raise Exception("For mode, Only train, dev, test is available")
  28. # Use cross entropy ignore index as padding label id so that only real label ids contribute to the loss later
  29. features = convert_examples_to_features(examples,
  30. args.max_seq_len,
  31. tokenizer,
  32. )
  33. logger.info("Saving features into cached file %s", cached_features_file)
  34. torch.save(features, cached_features_file)
  35. # Convert to Tensors and build dataset
  36. all_input_ids = torch.tensor(
  37. [f.input_ids for f in features],
  38. dtype=torch.long
  39. )
  40. all_attention_mask = torch.tensor(
  41. [f.attention_mask for f in features],
  42. dtype=torch.long
  43. )
  44. all_token_type_ids = torch.tensor(
  45. [f.token_type_ids for f in features],
  46. dtype=torch.long
  47. )
  48. all_label_ids = torch.tensor(
  49. [f.label_id for f in features],
  50. dtype=torch.long
  51. )
  52. dataset = TensorDataset(all_input_ids, all_attention_mask,
  53. all_token_type_ids, all_label_ids)
  54. return dataset

输出展示

[13]高复用Bert模型文本分类代码详解 - 图3

预告:后续介绍模型构建和训练部分

(二) 模型部分

源码

源码中模型被单独保存在model文件夹下,先来看一下module.py,里面放置有简单的全连接神经网络模型,作为分类器。
分类器的网络结构很简单,仅由两层构成。

  • dropout层
  • Linear层 ```python

    module.py

    import torch.nn as nn

分类器

class IntentClassifier(nn.Module): def init(self, inputdim, numlabels, dropout_rate=0.): super(IntentClassifier, self).__init() self.dropout = nn.Dropout(dropout_rate) self.linear = nn.Linear(input_dim, num_labels)

  1. def forward(self, x):
  2. x = self.dropout(x)
  3. return self.linear(x)
  1. 接下来看重点看bert模型代码
  2. ```python
  3. import torch
  4. import torch.nn as nn
  5. from transformers import BertPreTrainedModel, BertModel, BertConfig
  6. from torchcrf import CRF
  7. from .module import IntentClassifier
  8. class ClsBERT(BertPreTrainedModel):
  9. def __init__(self, config, args, label_lst):
  10. super(ClsBERT, self).__init__(config)
  11. self.args = args
  12. self.num_labels = len(label_lst)
  13. self.bert = BertModel(config=config) # Load pretrained bert
  14. self.classifier = IntentClassifier(config.hidden_size, self.num_labels, args.dropout_rate)
  15. def forward(self, input_ids, attention_mask, token_type_ids, label_ids):
  16. outputs = self.bert(input_ids, attention_mask=attention_mask,
  17. token_type_ids=token_type_ids) # sequence_output, pooled_output, (hidden_states), (attentions)
  18. sequence_output = outputs[0]
  19. pooled_output = outputs[1] # [CLS]
  20. logits = self.classifier(pooled_output)
  21. outputs = ((logits),) + outputs[2:] # add hidden states and attention if they are here
  22. # 1. Intent Softmax
  23. if label_ids is not None:
  24. if self.num_labels == 1:
  25. loss_fct = nn.MSELoss()
  26. loss = loss_fct(logits.view(-1), label_ids.view(-1))
  27. else:
  28. loss_fct = nn.CrossEntropyLoss()
  29. loss = loss_fct(logits.view(-1, self.num_labels), label_ids.view(-1))
  30. outputs = (loss,) + outputs
  31. return outputs # (loss), logits, (hidden_states), (attentions)

重点在forward部分

  1. outputs = self.bert(input_ids, attention_mask=attention_mask,
  2. token_type_ids=token_type_ids) # sequence_output, pooled_output, (hidden_states), (attentions)
  3. sequence_output = outputs[0] # sequence_output = outputs.last_hidden_state
  4. pooled_output = outputs[1] # [CLS] / pooled_output = outputs.pooler_output

一般使用transformers做bert finetune时,bert(input_ids, attention_mask=attention_mask,token_type_ids=token_type_ids)会返回两个值,一个是sequence_output,其shape大小为(batch_size,bert_hidden_size),另一个是pooled_output,这里的pooler_output指的是输出序列最后一个隐层,即CLS标签,其shape大小为(batch_size,bert_hidden_size)

  • 可以通过 outputs[0]或者outputs.last_hidden_state取得sequence_output向量。
  • 可以通过 outputs[1]或者outputs.pooler_output取得pooled_output向量。

一般对于分类任务取bert的最后层输出做平均池化接入线性层,代码中可以直接用outputs.pooler_output作为linear的输入,也可以使用outputs.last_hidden_state.mean(dim=1)作为linear的输入,自己测试后者要更好一点。

改进bert输出

我们知道bert模型有12层transformer层组成,如果我们要取出其中某一层的向量,或者做向量拼接该如何做呢?
我们查看BertModel(BertPreTrainedModel)的官方文档,里面对返回值outputs的解释如下:
Outputs: Tuple comprising various elements depending on the configuration (config) and inputs:
last_hidden_state: torch.FloatTensor of shape (batchsize, sequence_length, hidden_size)
Sequence of hidden-states at the output of the last layer of the model.
pooler_output: torch.FloatTensor of shape (batch_size, hidden_size)
Last layer hidden-state of the first token of the sequence (classification token)further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification)
objective during Bert pretraining. This output is usually _not
a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.
hidden_states: (optional, returned when config.output_hidden_states=True),list of torch.FloatTensor (one for the output of each layer + the output of the embeddings)of shape (batch_size, sequence_length, hidden_size):
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions: (optional, returned when config.output_attentions=True),list of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length):Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
根据官方文档我们可以看到有一个**hidden_states**层会返回所有layer的向量,其形状为(batch_size, sequence_length, hidden_size),但要输出这个list需要在初始化bert时配置config.output_hidden_states=True,才会返回**hidden_states**

我们重新修改一下代码,尝试取出bert的倒数第三层transformer的输出向量

  1. class ClsBERT(BertPreTrainedModel):
  2. def __init__(self, config, args, label_lst):
  3. super(ClsBERT, self).__init__(config)
  4. self.args = args
  5. self.num_labels = len(label_lst)
  6. self.bert = BertModel(config=config) # Load pretrained bert
  7. self.classifier = IntentClassifier(config.hidden_size, self.num_labels, args.dropout_rate)
  8. def forward(self, input_ids, attention_mask, token_type_ids, label_ids):
  9. """添加 output_hidden_states = True
  10. """
  11. outputs = self.bert(input_ids, attention_mask=attention_mask,token_type_ids=token_type_ids,output_hidden_states = True)
  12. """修改pooled_outputs
  13. hidden_states[-3] 表示倒数第三层输出
  14. mean(dim=1) 表示 平均池化输出向量 得到(batch_size,hidden_layers)
  15. """
  16. pooled_output = outputs.hidden_states[-3].mean(dim=1)
  17. logits = self.classifier(pooled_output)
  18. outputs = ((logits),) + outputs[2:] # add hidden states and attention if they are here
  19. # 1. Intent Softmax
  20. if label_ids is not None:
  21. if self.num_labels == 1:
  22. loss_fct = nn.MSELoss()
  23. loss = loss_fct(logits.view(-1), label_ids.view(-1))
  24. else:
  25. loss_fct = nn.CrossEntropyLoss()
  26. loss = loss_fct(logits.view(-1, self.num_labels), label_ids.view(-1))
  27. outputs = (loss,) + outputs
  28. return outputs # (loss), logits, (hidden_states), (attentions)

既然可以取出每一层的向量,那么我们也可以完成不同层向量拼接,修改代码如下:
先创建一个空的tensor :torch.empty(0, dtype=torch.long).to(self.device)
使用循环和cat拼接向量
记得修改对应的linear的输入层数大小,要和pooled_output的hidden_size一致

  1. class ClsBERT(BertPreTrainedModel):
  2. def __init__(self, config, args, label_lst):
  3. super(ClsBERT, self).__init__(config)
  4. self.args = args
  5. self.num_labels = len(label_lst)
  6. self.bert = BertModel(config=config) # Load pretrained bert
  7. self.classifier = IntentClassifier(config.hidden_size, self.num_labels, args.dropout_rate)
  8. def forward(self, input_ids, attention_mask, token_type_ids, label_ids):
  9. """添加 output_hidden_states = True
  10. """
  11. outputs = self.bert(input_ids, attention_mask=attention_mask,token_type_ids=token_type_ids,output_hidden_states = True)
  12. """修改pooled_outputs
  13. torch.empty(0, dtype=torch.long).to(self.device)
  14. hidden_states[-3] 表示倒数第三层输出
  15. mean(dim=1) 表示 平均池化输出向量 得到(batch_size,hidden_layers)
  16. """
  17. pooled_output = torch.empty(0, dtype=torch.long).to(self.device)
  18. for layer in outputs.hidden_states[self.concatnum:]:
  19. pooled_output = torch.cat((pooled_output, layer.mean(dim=1)), dim=1)
  20. """修改 end
  21. """
  22. logits = self.classifier(pooled_output)
  23. outputs = ((logits),) + outputs[2:] # add hidden states and attention if they are here
  24. # 1. Intent Softmax
  25. if label_ids is not None:
  26. if self.num_labels == 1:
  27. loss_fct = nn.MSELoss()
  28. loss = loss_fct(logits.view(-1), label_ids.view(-1))
  29. else:
  30. loss_fct = nn.CrossEntropyLoss()
  31. loss = loss_fct(logits.view(-1, self.num_labels), label_ids.view(-1))
  32. outputs = (loss,) + outputs
  33. return outputs # (loss), logits, (hidden_states), (attentions)

到此为止,模型代码的讲解已经结束,并且我们还对bert模型的输出形式进行了讨论和改进
关于优化器、学习率、损失函数,将在下一篇文章模型训练代码中进行讲解。

预告:后续介绍调参和预测部分

(三) 训练

[13]高复用Bert模型文本分类代码详解 - 图4

  1. import os
  2. import logging
  3. import random
  4. import numpy as np
  5. import torch
  6. import pandas as pd
  7. from tqdm import tqdm, trange
  8. from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
  9. from transformers import BertConfig, AdamW, get_linear_schedule_with_warmup
  10. from adversarial_training import FGM,PGD
  11. from utils import MODEL_CLASSES, compute_metrics, get_labels
  12. logger = logging.getLogger(__name__)
  13. class Trainer(object):
  14. def __init__(self, args, train_dataset=None, dev_dataset=None, test_dataset=None):
  15. self.args = args
  16. self.train_dataset = train_dataset
  17. self.dev_dataset = dev_dataset
  18. self.test_dataset = test_dataset
  19. self.test_results = None
  20. self.label_lst = get_labels(args)
  21. # Use cross entropy ignore index as padding label id so that only real label ids contribute to the loss later
  22. self.config_class, self.model_class, _ = MODEL_CLASSES[args.model_type]
  23. self.config = self.config_class.from_pretrained(args.model_name_or_path, finetuning_task=args.task)
  24. self.model = self.model_class.from_pretrained(args.model_name_or_path,
  25. config=self.config,
  26. args=args,
  27. label_lst=self.label_lst,
  28. )
  29. # GPU or CPU
  30. self.device = "cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu"
  31. self.model.to(self.device)
  32. # for adversarial training
  33. self.adv_trainer = None
  34. if self.args.at_method:
  35. if self.args.at_method == "fgm":
  36. self.adv_trainer = FGM(
  37. self.model, epsilon=self.args.epsilon_for_at
  38. )
  39. elif self.args.at_method == "pgd":
  40. self.adv_trainer = PGD(
  41. self.model,
  42. epsilon=self.args.epsilon_for_at,
  43. alpha=self.args.alpha_for_at,
  44. )
  45. else:
  46. raise ValueError(
  47. "un-supported adversarial training method: {} !!!".format(self.args.at_method)
  48. )
  49. def train(self):
  50. train_sampler = RandomSampler(self.train_dataset)
  51. train_dataloader = DataLoader(self.train_dataset, sampler=train_sampler, batch_size=self.args.train_batch_size)
  52. if self.args.max_steps > 0:
  53. t_total = self.args.max_steps
  54. self.args.num_train_epochs = self.args.max_steps // (len(train_dataloader) // self.args.gradient_accumulation_steps) + 1
  55. else:
  56. t_total = len(train_dataloader) // self.args.gradient_accumulation_steps * self.args.num_train_epochs
  57. for n, p in self.model.named_parameters():
  58. print(n)
  59. # BERT部分参数,设置一个较低的学习率
  60. optimizer_grouped_parameters = []
  61. bert_params = list(self.model.bert.named_parameters())
  62. # Prepare optimizer and schedule (linear warmup and decay)
  63. no_decay = ['bias', 'LayerNorm.weight']
  64. optimizer_grouped_parameters += [
  65. {
  66. 'params': [p for n, p in bert_params if not any(nd in n for nd in no_decay)],
  67. 'weight_decay': self.args.weight_decay,
  68. "lr": self.args.learning_rate,
  69. },
  70. {
  71. 'params': [p for n, p in bert_params if any(nd in n for nd in no_decay)],
  72. 'weight_decay': 0.0,
  73. 'lr': self.args.learning_rate,
  74. }
  75. ]
  76. # 线性层参数
  77. linear_params = list(self.model.classifier.named_parameters())
  78. no_decay = ['bias', 'LayerNorm.weight']
  79. optimizer_grouped_parameters += [
  80. {
  81. 'params': [p for n, p in linear_params if not any(nd in n for nd in no_decay)],
  82. 'weight_decay': self.args.weight_decay,
  83. "lr": self.args.linear_learning_rate,
  84. },
  85. {
  86. 'params': [p for n, p in linear_params if any(nd in n for nd in no_decay)],
  87. 'weight_decay': 0.0,
  88. 'lr': self.args.linear_learning_rate,
  89. }
  90. ]
  91. optimizer = AdamW(optimizer_grouped_parameters, lr=self.args.learning_rate, eps=self.args.adam_epsilon)
  92. scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=self.args.warmup_steps, num_training_steps=t_total)
  93. # Train!
  94. logger.info("***** Running training *****")
  95. logger.info(" Num examples = %d", len(self.train_dataset))
  96. logger.info(" Num Epochs = %d", self.args.num_train_epochs)
  97. logger.info(" Total train batch size = %d", self.args.train_batch_size)
  98. logger.info(" Gradient Accumulation steps = %d", self.args.gradient_accumulation_steps)
  99. logger.info(" Total optimization steps = %d", t_total)
  100. logger.info(" Logging steps = %d", self.args.logging_steps)
  101. logger.info(" Save steps = %d", self.args.save_steps)
  102. wait = 0
  103. global_step = 0
  104. tr_loss = 0.0
  105. best_score = 0.0
  106. self.model.zero_grad()
  107. train_iterator = trange(int(self.args.num_train_epochs), desc="Epoch")
  108. for _ in train_iterator:
  109. epoch_iterator = tqdm(train_dataloader, desc="Iteration")
  110. for step, batch in enumerate(epoch_iterator):
  111. self.model.train()
  112. batch = tuple(t.to(self.device) for t in batch) # GPU or CPU
  113. inputs = {'input_ids': batch[0],
  114. 'attention_mask': batch[1],
  115. 'label_ids': batch[3],
  116. }
  117. if self.args.model_type != 'distilbert':
  118. inputs['token_type_ids'] = batch[2]
  119. outputs = self.model(**inputs)
  120. loss = outputs[0]
  121. if self.args.gradient_accumulation_steps > 1:
  122. loss = loss / self.args.gradient_accumulation_steps
  123. loss.backward()
  124. """# ad training start
  125. """
  126. if self.args.at_method is not None:
  127. if random.uniform(0, 1) > self.args.probs_for_at:
  128. logger.info("not to do adv training at this step!")
  129. else:
  130. logger.info(" do adv training at this step!")
  131. if self.args.at_method == "fgm":
  132. self.adv_trainer.attack() # 这个时候embedding被修改了
  133. # optimizer.zero_grad() # 如果不想累加梯度,就把这里的注释取消
  134. outputs_at = self.model(**inputs)
  135. loss_at = outputs_at[0]
  136. loss_at.backward() # 反向传播,在正常的grad基础上,累加对抗训练的梯度
  137. self.adv_trainer.restore() # 恢复Embedding的参数
  138. elif self.args.at_method == "pgd":
  139. self.adv_trainer.backup_grad() # 保存正常的grad
  140. # 对抗训练
  141. for t in range(self.args.steps_for_at):
  142. # 在embedding上添加对抗扰动, first attack时备份param.data
  143. self.adv_trainer.attack(is_first_attack=(t == 0))
  144. if t != self.args.steps_for_at - 1:
  145. optimizer.zero_grad()
  146. else:
  147. self.adv_trainer.restore_grad() # 恢复正常的grad
  148. outputs_at = self.model(**inputs)
  149. loss_at = outputs_at[0]
  150. loss_at.backward() # 反向传播,并在正常的grad基础上,累加对抗训练的梯度
  151. self.adv_trainer.restore() # 恢复embedding参数
  152. """# ad training end
  153. """
  154. tr_loss += loss.item()
  155. if (step + 1) % self.args.gradient_accumulation_steps == 0:
  156. torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.args.max_grad_norm)
  157. optimizer.step()
  158. scheduler.step() # Update learning rate schedule
  159. self.model.zero_grad()
  160. global_step += 1
  161. if self.args.logging_steps > 0 and global_step % self.args.logging_steps == 0:
  162. results = self.evaluate("dev")
  163. if best_score < results["kappa"]:
  164. wait = 0
  165. best_score = results["kappa"]
  166. self.save_model()
  167. else:
  168. wait += 1
  169. print("eraly stop {}/{}".format(wait,self.args.wait_patient))
  170. if wait >= self.args.wait_patient:
  171. break
  172. # if self.args.save_steps > 0 and global_step % self.args.save_steps == 0:
  173. # self.save_model()
  174. if 0 < self.args.max_steps < global_step:
  175. epoch_iterator.close()
  176. break
  177. if 0 < self.args.max_steps < global_step:
  178. train_iterator.close()
  179. break
  180. return global_step, tr_loss / global_step
  181. def evaluate(self, mode):
  182. if mode == 'test':
  183. dataset = self.test_dataset
  184. elif mode == 'dev':
  185. dataset = self.dev_dataset
  186. else:
  187. raise Exception("Only dev and test dataset available")
  188. eval_sampler = SequentialSampler(dataset)
  189. eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=self.args.eval_batch_size)
  190. # Eval!
  191. logger.info("***** Running evaluation on %s dataset *****", mode)
  192. logger.info(" Num examples = %d", len(dataset))
  193. logger.info(" Batch size = %d", self.args.eval_batch_size)
  194. eval_loss = 0.0
  195. nb_eval_steps = 0
  196. preds = None
  197. out_label_ids = None
  198. self.model.eval()
  199. for batch in tqdm(eval_dataloader, desc="Evaluating"):
  200. batch = tuple(t.to(self.device) for t in batch)
  201. with torch.no_grad():
  202. inputs = {'input_ids': batch[0],
  203. 'attention_mask': batch[1],
  204. 'label_ids': batch[3],
  205. }
  206. if self.args.model_type != 'distilbert':
  207. inputs['token_type_ids'] = batch[2]
  208. outputs = self.model(**inputs)
  209. tmp_eval_loss, logits = outputs[:2]
  210. eval_loss += tmp_eval_loss.mean().item()
  211. nb_eval_steps += 1
  212. # Intent prediction
  213. if preds is None:
  214. preds = logits.detach().cpu().numpy()
  215. out_label_ids = inputs['label_ids'].detach().cpu().numpy()
  216. else:
  217. preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
  218. out_label_ids = np.append(
  219. out_label_ids, inputs['label_ids'].detach().cpu().numpy(), axis=0)
  220. eval_loss = eval_loss / nb_eval_steps
  221. results = {
  222. "loss": round(eval_loss,7)
  223. }
  224. # Intent result
  225. preds = np.argmax(preds, axis=1)
  226. total_result = compute_metrics(preds, out_label_ids)
  227. results.update(total_result)
  228. if mode == 'test':
  229. self.test_results = results
  230. self.save_results()
  231. logger.info("***** Eval results *****")
  232. for key in sorted(results.keys()):
  233. logger.info(" %s = %s", key, str(results[key]))
  234. return results
  235. def save_model(self):
  236. # Save model checkpoint (Overwrite)
  237. if not os.path.exists(self.args.model_dir):
  238. os.makedirs(self.args.model_dir)
  239. model_to_save = self.model.module if hasattr(self.model, 'module') else self.model
  240. model_to_save.save_pretrained(self.args.model_dir)
  241. # Save training arguments together with the trained model
  242. torch.save(self.args, os.path.join(self.args.model_dir, 'training_args.bin'))
  243. logger.info("Saving model checkpoint to %s", self.args.model_dir)
  244. def load_model(self):
  245. # Check whether model exists
  246. if not os.path.exists(self.args.model_dir):
  247. raise Exception("Model doesn't exists! Train first!")
  248. try:
  249. self.model = self.model_class.from_pretrained(self.args.model_dir,
  250. args=self.args,
  251. label_lst=self.label_lst,)
  252. self.model.to(self.device)
  253. logger.info("***** Model Loaded *****")
  254. except:
  255. raise Exception("Some model files might be missing...")
  256. def save_results(self):
  257. if not os.path.exists(self.args.results_dir):
  258. os.makedirs(self.args.results_dir)
  259. var = [self.args.task, self.args.learning_rate, self.args.num_train_epochs, self.args.max_seq_len, self.args.seed]
  260. names = ['task', 'lr', 'epoch', 'max_len', 'seed']
  261. vars_dict = {k: v for k, v in zip(names, var)}
  262. results = dict(self.test_results, **vars_dict)
  263. keys = list(results.keys())
  264. values = list(results.values())
  265. file_name = 'results.csv'
  266. results_path = os.path.join(self.args.results_dir, file_name)
  267. if not os.path.exists(results_path):
  268. ori = []
  269. ori.append(values)
  270. df1 = pd.DataFrame(ori, columns=keys)
  271. df1.to_csv(results_path, index=False)
  272. else:
  273. df1 = pd.read_csv(results_path)
  274. new = pd.DataFrame(results, index=[1])
  275. df1 = df1.append(new, ignore_index=True)
  276. df1.to_csv(results_path, index=False)
  277. data_diagram = pd.read_csv(results_path)
  278. print('test_results', data_diagram)

(四) 调参

(五) 预测