本文（部分内容）来自文章《A Visual Guide to Using BERT for the First Time》其作者为Jay Alammar，可以作为那些不熟悉BERT的读者首次阅读。
本文是关于如何使用BERT的变异版本来进行句子分类的简单教程。该例子足够简单，因此可以作为首次使用BERT的介绍，当然，它也包含了一些关键性的概念。

英文实例

数据集：SST2

本文中使用的数据集为SST2，它包含了电影评论的句子，每一句带有一个标签，或者标注为正面情感（取值为1），或者标注为负面情感（取值为0）。

文章中用到的数据集下载网址为：https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv。
下载DistillBERT模型文件，网址为：https://www.kaggle.com/abhishek/distilbertbaseuncased 。

官方给出的完整代码：

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split,GridSearchCV,cross_val_score
from sklearn.linear_model import LogisticRegression
import torch
import transformers as ppb
import warnings
warnings.filterwarnings("ignore")
# 读取数据集
df = pd.read_csv("train.txt", delimiter='\t', header=None)
# 读取 DistillBert 模型并创建tokenizer
pretrain_model_weights_name = 'pretrain_model/distilbert-base-uncased'
model_class,tokenizer_class,pretrained_weights = (ppb.DistilBertModel,ppb.DistilBertTokenizer,pretrain_model_weights_name)
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)
# tokenized编码
tokenized = []
for idx in range(len(df.values)):
    tokenized.append(tokenizer.encode(df[0][idx], add_special_tokens=True))
# print(tokenized)
# padding编码
max_len = 0
for i in tokenized:
    if len(i) > max_len:
        max_len = len(i)
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized])
# print(padded)
# 注意力机制
attention_mask = np.where(padded != 0, 1, 0)
# print(attention_mask)
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)
# 去最后一层隐层的向量
with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)
# [:,0,:]表示取二维数组中第0维到最有一维的所有数据
features = last_hidden_states[0][:,0,:].numpy()
labels = df[1].values
# print(features)
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)
score = lr_clf.score(test_features, test_labels)
print(score)
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()
scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

中文实例

仿官方对中文进行尝试：

中文数据集采用clue的公开数据集下载地址

https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip

中文clue_albert_chinese_tiny预训练模型下载地址

https://huggingface.co/clue/albert_chinese_tiny/tree/main

cn_df = pd.read_json("train.json",lines=True,encoding="utf-8")
cn_df = cn_df[:800]
cn_df.head()
pretrain_model_weights_name = 'pretrain_model/clue_albert_chinese_tiny'
model_class,tokenizer_class,pretrained_weights = (ppb.AlbertModel,ppb.BertTokenizer,pretrain_model_weights_name)
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)
# tokenized编码
tokenized = []
for idx in range(len(cn_df.values)):
    tokenized.append(tokenizer.encode(cn_df["sentence"][idx], add_special_tokens=True))
# print(tokenized)
# padding编码
max_len = 0
for i in tokenized:
    if len(i) > max_len:
        max_len = len(i)
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized])
# print(padded)
# 注意力机制
attention_mask = np.where(padded != 0, 1, 0)
# print(attention_mask)
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)
# 去最后一层隐层的向量
with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)
# [:,0,:]表示取二维数组中第0维到最有一维的所有数据
features = last_hidden_states[0][:,0,:].numpy()
labels = cn_df["label"].values
# print(features)
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)
lr_clf.score(test_features, test_labels)
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()
scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

📕Record

[11]初次使用BERT的可视化指导

英文实例

中文实例