本文(部分内容)来自文章《A Visual Guide to Using BERT for the First Time》其作者为Jay Alammar,可以作为那些不熟悉BERT的读者首次阅读。
本文是关于如何使用BERT的变异版本来进行句子分类的简单教程。该例子足够简单,因此可以作为首次使用BERT的介绍,当然,它也包含了一些关键性的概念。
英文实例
数据集:SST2
本文中使用的数据集为SST2,它包含了电影评论的句子,每一句带有一个标签,或者标注为正面情感(取值为1),或者标注为负面情感(取值为0)。
文章中用到的数据集下载网址为:https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv。
下载DistillBERT模型文件,网址为:https://www.kaggle.com/abhishek/distilbertbaseuncased 。
官方给出的完整代码:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split,GridSearchCV,cross_val_score
from sklearn.linear_model import LogisticRegression
import torch
import transformers as ppb
import warnings
warnings.filterwarnings("ignore")
# 读取数据集
df = pd.read_csv("train.txt", delimiter='\t', header=None)
# 读取 DistillBert 模型并创建tokenizer
pretrain_model_weights_name = 'pretrain_model/distilbert-base-uncased'
model_class,tokenizer_class,pretrained_weights = (ppb.DistilBertModel,ppb.DistilBertTokenizer,pretrain_model_weights_name)
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)
# tokenized编码
tokenized = []
for idx in range(len(df.values)):
tokenized.append(tokenizer.encode(df[0][idx], add_special_tokens=True))
# print(tokenized)
# padding编码
max_len = 0
for i in tokenized:
if len(i) > max_len:
max_len = len(i)
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized])
# print(padded)
# 注意力机制
attention_mask = np.where(padded != 0, 1, 0)
# print(attention_mask)
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)
# 去最后一层隐层的向量
with torch.no_grad():
last_hidden_states = model(input_ids, attention_mask=attention_mask)
# [:,0,:]表示取二维数组中第0维到最有一维的所有数据
features = last_hidden_states[0][:,0,:].numpy()
labels = df[1].values
# print(features)
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)
score = lr_clf.score(test_features, test_labels)
print(score)
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()
scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
中文实例
仿官方对中文进行尝试:
中文数据集采用clue的公开数据集下载地址
https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip
中文clue_albert_chinese_tiny预训练模型下载地址
https://huggingface.co/clue/albert_chinese_tiny/tree/main
cn_df = pd.read_json("train.json",lines=True,encoding="utf-8")
cn_df = cn_df[:800]
cn_df.head()
pretrain_model_weights_name = 'pretrain_model/clue_albert_chinese_tiny'
model_class,tokenizer_class,pretrained_weights = (ppb.AlbertModel,ppb.BertTokenizer,pretrain_model_weights_name)
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)
# tokenized编码
tokenized = []
for idx in range(len(cn_df.values)):
tokenized.append(tokenizer.encode(cn_df["sentence"][idx], add_special_tokens=True))
# print(tokenized)
# padding编码
max_len = 0
for i in tokenized:
if len(i) > max_len:
max_len = len(i)
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized])
# print(padded)
# 注意力机制
attention_mask = np.where(padded != 0, 1, 0)
# print(attention_mask)
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)
# 去最后一层隐层的向量
with torch.no_grad():
last_hidden_states = model(input_ids, attention_mask=attention_mask)
# [:,0,:]表示取二维数组中第0维到最有一维的所有数据
features = last_hidden_states[0][:,0,:].numpy()
labels = cn_df["label"].values
# print(features)
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)
lr_clf.score(test_features, test_labels)
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()
scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))