本文(部分内容)来自文章《A Visual Guide to Using BERT for the First Time》其作者为Jay Alammar,可以作为那些不熟悉BERT的读者首次阅读。
本文是关于如何使用BERT的变异版本来进行句子分类的简单教程。该例子足够简单,因此可以作为首次使用BERT的介绍,当然,它也包含了一些关键性的概念。

英文实例

数据集:SST2

  本文中使用的数据集为SST2,它包含了电影评论的句子,每一句带有一个标签,或者标注为正面情感(取值为1),或者标注为负面情感(取值为0)。

文章中用到的数据集下载网址为:https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv
下载DistillBERT模型文件,网址为:https://www.kaggle.com/abhishek/distilbertbaseuncased

官方给出的完整代码:

  1. import numpy as np
  2. import pandas as pd
  3. from sklearn.model_selection import train_test_split,GridSearchCV,cross_val_score
  4. from sklearn.linear_model import LogisticRegression
  5. import torch
  6. import transformers as ppb
  7. import warnings
  8. warnings.filterwarnings("ignore")
  9. # 读取数据集
  10. df = pd.read_csv("train.txt", delimiter='\t', header=None)
  11. # 读取 DistillBert 模型并创建tokenizer
  12. pretrain_model_weights_name = 'pretrain_model/distilbert-base-uncased'
  13. model_class,tokenizer_class,pretrained_weights = (ppb.DistilBertModel,ppb.DistilBertTokenizer,pretrain_model_weights_name)
  14. tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
  15. model = model_class.from_pretrained(pretrained_weights)
  16. # tokenized编码
  17. tokenized = []
  18. for idx in range(len(df.values)):
  19. tokenized.append(tokenizer.encode(df[0][idx], add_special_tokens=True))
  20. # print(tokenized)
  21. # padding编码
  22. max_len = 0
  23. for i in tokenized:
  24. if len(i) > max_len:
  25. max_len = len(i)
  26. padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized])
  27. # print(padded)
  28. # 注意力机制
  29. attention_mask = np.where(padded != 0, 1, 0)
  30. # print(attention_mask)
  31. input_ids = torch.tensor(padded)
  32. attention_mask = torch.tensor(attention_mask)
  33. # 去最后一层隐层的向量
  34. with torch.no_grad():
  35. last_hidden_states = model(input_ids, attention_mask=attention_mask)
  36. # [:,0,:]表示取二维数组中第0维到最有一维的所有数据
  37. features = last_hidden_states[0][:,0,:].numpy()
  38. labels = df[1].values
  39. # print(features)
  40. train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
  41. lr_clf = LogisticRegression()
  42. lr_clf.fit(train_features, train_labels)
  43. score = lr_clf.score(test_features, test_labels)
  44. print(score)
  45. from sklearn.dummy import DummyClassifier
  46. clf = DummyClassifier()
  47. scores = cross_val_score(clf, train_features, train_labels)
  48. print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

中文实例

仿官方对中文进行尝试:

中文数据集采用clue的公开数据集下载地址

https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip

中文clue_albert_chinese_tiny预训练模型下载地址

https://huggingface.co/clue/albert_chinese_tiny/tree/main

  1. cn_df = pd.read_json("train.json",lines=True,encoding="utf-8")
  2. cn_df = cn_df[:800]
  3. cn_df.head()
  4. pretrain_model_weights_name = 'pretrain_model/clue_albert_chinese_tiny'
  5. model_class,tokenizer_class,pretrained_weights = (ppb.AlbertModel,ppb.BertTokenizer,pretrain_model_weights_name)
  6. tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
  7. model = model_class.from_pretrained(pretrained_weights)
  8. # tokenized编码
  9. tokenized = []
  10. for idx in range(len(cn_df.values)):
  11. tokenized.append(tokenizer.encode(cn_df["sentence"][idx], add_special_tokens=True))
  12. # print(tokenized)
  13. # padding编码
  14. max_len = 0
  15. for i in tokenized:
  16. if len(i) > max_len:
  17. max_len = len(i)
  18. padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized])
  19. # print(padded)
  20. # 注意力机制
  21. attention_mask = np.where(padded != 0, 1, 0)
  22. # print(attention_mask)
  23. input_ids = torch.tensor(padded)
  24. attention_mask = torch.tensor(attention_mask)
  25. # 去最后一层隐层的向量
  26. with torch.no_grad():
  27. last_hidden_states = model(input_ids, attention_mask=attention_mask)
  28. # [:,0,:]表示取二维数组中第0维到最有一维的所有数据
  29. features = last_hidden_states[0][:,0,:].numpy()
  30. labels = cn_df["label"].values
  31. # print(features)
  32. train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
  33. lr_clf = LogisticRegression()
  34. lr_clf.fit(train_features, train_labels)
  35. lr_clf.score(test_features, test_labels)
  36. from sklearn.dummy import DummyClassifier
  37. clf = DummyClassifier()
  38. scores = cross_val_score(clf, train_features, train_labels)
  39. print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))