本文(部分内容)来自文章《A Visual Guide to Using BERT for the First Time》其作者为Jay Alammar,可以作为那些不熟悉BERT的读者首次阅读。
本文是关于如何使用BERT的变异版本来进行句子分类的简单教程。该例子足够简单,因此可以作为首次使用BERT的介绍,当然,它也包含了一些关键性的概念。
英文实例
数据集:SST2
本文中使用的数据集为SST2,它包含了电影评论的句子,每一句带有一个标签,或者标注为正面情感(取值为1),或者标注为负面情感(取值为0)。
文章中用到的数据集下载网址为:https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv。
下载DistillBERT模型文件,网址为:https://www.kaggle.com/abhishek/distilbertbaseuncased 。
官方给出的完整代码:
import numpy as npimport pandas as pdfrom sklearn.model_selection import train_test_split,GridSearchCV,cross_val_scorefrom sklearn.linear_model import LogisticRegressionimport torchimport transformers as ppbimport warningswarnings.filterwarnings("ignore")# 读取数据集df = pd.read_csv("train.txt", delimiter='\t', header=None)# 读取 DistillBert 模型并创建tokenizerpretrain_model_weights_name = 'pretrain_model/distilbert-base-uncased'model_class,tokenizer_class,pretrained_weights = (ppb.DistilBertModel,ppb.DistilBertTokenizer,pretrain_model_weights_name)tokenizer = tokenizer_class.from_pretrained(pretrained_weights)model = model_class.from_pretrained(pretrained_weights)# tokenized编码tokenized = []for idx in range(len(df.values)):tokenized.append(tokenizer.encode(df[0][idx], add_special_tokens=True))# print(tokenized)# padding编码max_len = 0for i in tokenized:if len(i) > max_len:max_len = len(i)padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized])# print(padded)# 注意力机制attention_mask = np.where(padded != 0, 1, 0)# print(attention_mask)input_ids = torch.tensor(padded)attention_mask = torch.tensor(attention_mask)# 去最后一层隐层的向量with torch.no_grad():last_hidden_states = model(input_ids, attention_mask=attention_mask)# [:,0,:]表示取二维数组中第0维到最有一维的所有数据features = last_hidden_states[0][:,0,:].numpy()labels = df[1].values# print(features)train_features, test_features, train_labels, test_labels = train_test_split(features, labels)lr_clf = LogisticRegression()lr_clf.fit(train_features, train_labels)score = lr_clf.score(test_features, test_labels)print(score)from sklearn.dummy import DummyClassifierclf = DummyClassifier()scores = cross_val_score(clf, train_features, train_labels)print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
中文实例
仿官方对中文进行尝试:
中文数据集采用clue的公开数据集下载地址
https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip
中文clue_albert_chinese_tiny预训练模型下载地址
https://huggingface.co/clue/albert_chinese_tiny/tree/main
cn_df = pd.read_json("train.json",lines=True,encoding="utf-8")cn_df = cn_df[:800]cn_df.head()pretrain_model_weights_name = 'pretrain_model/clue_albert_chinese_tiny'model_class,tokenizer_class,pretrained_weights = (ppb.AlbertModel,ppb.BertTokenizer,pretrain_model_weights_name)tokenizer = tokenizer_class.from_pretrained(pretrained_weights)model = model_class.from_pretrained(pretrained_weights)# tokenized编码tokenized = []for idx in range(len(cn_df.values)):tokenized.append(tokenizer.encode(cn_df["sentence"][idx], add_special_tokens=True))# print(tokenized)# padding编码max_len = 0for i in tokenized:if len(i) > max_len:max_len = len(i)padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized])# print(padded)# 注意力机制attention_mask = np.where(padded != 0, 1, 0)# print(attention_mask)input_ids = torch.tensor(padded)attention_mask = torch.tensor(attention_mask)# 去最后一层隐层的向量with torch.no_grad():last_hidden_states = model(input_ids, attention_mask=attention_mask)# [:,0,:]表示取二维数组中第0维到最有一维的所有数据features = last_hidden_states[0][:,0,:].numpy()labels = cn_df["label"].values# print(features)train_features, test_features, train_labels, test_labels = train_test_split(features, labels)lr_clf = LogisticRegression()lr_clf.fit(train_features, train_labels)lr_clf.score(test_features, test_labels)from sklearn.dummy import DummyClassifierclf = DummyClassifier()scores = cross_val_score(clf, train_features, train_labels)print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
