- 1、AutoTokenizer
- tokenizer = AutoTokenizer.from_pretrained(“bert-base-uncased”)
- jy: [‘we’, ‘are’, ‘very’, ‘happy’, ‘to’, ‘show’, ‘you’, ‘the’, ‘transformers’, ‘library’, ‘.’]
- jy: [‘we’, ‘are’, ‘very’, ‘happy’, ‘to’, ‘show’, ‘you’, ‘the’, ‘transforme’, ‘##r’, ‘model’, ‘.’]
- 2、AutoModel
- jy: you can pass your preprocessed batch of inputs directly to the model. You
- just have to unpack the dictionary by adding **(pt_batch 的获取参见
- “(1)AutoTokenizer” 章节):
- jy: The model outputs the final activations in the logits attribute. Apply the
- softmax function to the logits to retrieve the probabilities:
- 3、Save a model
- To download and use any of the pretrained models on your given task, all it takes is three lines of code.
An AutoClass is a shortcut that automatically retrieves the architecture of a pretrained model from it’s name or path. You only need to select the appropriate AutoClass for your task and it’s associated tokenizer with AutoTokenizer.
- AutoClass:https://huggingface.co/docs/transformers/model_doc/auto
- AutoTokenizer:https://huggingface.co/docs/transformers/v4.19.2/en/model_doc/auto#transformers.AutoTokenizer
1、AutoTokenizer
A tokenizer is responsible for preprocessing text into a format that is understandable to the model.
- First, the tokenizer will split the text into words called tokens. There are multiple rules that govern the tokenization process, including how to split a word and at what level.
- Summary of the tokenizers:https://huggingface.co/docs/transformers/tokenizer_summary
- The most important thing to remember though is you need to instantiate the tokenizer with the same model name to ensure you’re using the same tokenization rules a model was pretrained with.
- Next, the tokenizer converts the tokens into numbers in order to construct a tensor as input to the model. This is known as the model’s vocabulary. ```python from transformers import AutoTokenizer
model_name = “nlptown/bert-base-multilingual-uncased-sentiment” tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(“bert-base-uncased”)
tokenize1 = tokenizer.tokenize(“We are very happy to show you the Transformers library.”)
jy: [‘we’, ‘are’, ‘very’, ‘happy’, ‘to’, ‘show’, ‘you’, ‘the’, ‘transformers’, ‘library’, ‘.’]
print(tokenize1)
tokenize2 = tokenizer.tokenize(“We are very happy to show you the Transformer model.”)
jy: [‘we’, ‘are’, ‘very’, ‘happy’, ‘to’, ‘show’, ‘you’, ‘the’, ‘transforme’, ‘##r’, ‘model’, ‘.’]
print(tokenize2)
encoding = tokenizer(“We are very happy to show you the Transformers library.”) print(encoding) “”” {‘input_ids’: [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], ‘token_type_ids’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} “””
- The tokenizer will return a dictionary containing:
- `input_ids`: numerical representions of your tokens.
- [https://huggingface.co/docs/transformers/glossary#input-ids](https://huggingface.co/docs/transformers/glossary#input-ids)
- `atttention_mask`: indicates which tokens should be attended to.
- Just like the `pipeline()`, the tokenizer will accept a list of inputs. In addition, the tokenizer can also pad and truncate the text to return a batch with uniform length:
- more details about tokenization:[https://huggingface.co/docs/transformers/preprocessing](https://huggingface.co/docs/transformers/preprocessing)
```python
# jy: for pytorch ---------------------------------------------------------
pt_batch = tokenizer(
["We are very happy to show you the Transformers library.",
"We hope you don't hate it."],
padding=True,
truncation=True,
max_length=512,
return_tensors="pt",
)
# jy: for tf ------------------------------------------------------------
tf_batch = tokenizer(
["We are very happy to show you the Transformers library.",
"We hope you don't hate it."],
padding=True,
truncation=True,
max_length=512,
return_tensors="tf",
)
2、AutoModel
(1)pytorch 版本
- Transformers provides a simple and unified way to load pretrained instances. This means you can load an
AutoModel
like you would load anAutoTokenizer
. The only difference is selecting the correctAutoModel
for the task. Since you are doing text - or sequence - classification, loadAutoModelForSequenceClassification
:AutoModel
:https://huggingface.co/docs/transformers/v4.19.2/en/model_doc/auto#transformers.AutoModel- task summary(which AutoModel class to use for which task)
- https://huggingface.co/docs/transformers/task_summary ```python from transformers import AutoModelForSequenceClassification
model_name = “nlptown/bert-base-multilingual-uncased-sentiment” pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
jy: you can pass your preprocessed batch of inputs directly to the model. You
just have to unpack the dictionary by adding **(pt_batch 的获取参见
“(1)AutoTokenizer” 章节):
pt_outputs = pt_model(**pt_batch)
jy: The model outputs the final activations in the logits attribute. Apply the
softmax function to the logits to retrieve the probabilities:
from torch import nn
pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
print(pt_predictions)
“””
tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
[0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=
<a name="cS4Sn"></a>
## (2)tf 版本
```python
from transformers import TFAutoModelForSequenceClassification
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
# jy: tf_batch 的获取参见 “(1)AutoTokenizer” 章节
tf_outputs = tf_model(tf_batch)
import tensorflow as tf
tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1)
tf_predictions
(3)代码示例与解析
# jy: AutoTokenizer 类在 /transformers/models/auto/tokenization_auto.py 中定义, 该类中
# 定义了各种 tokenizer 模型类, 通过该类的 from_pretrained 类方法即可加装相应模型的
# tokenizer 类;
from transformers import AutoTokenizer
# jy: AutoModel 类在 /transformers/models/auto/modeling_auto.py 中定义, 该类中定义了一
# 些 pytorch 版本的模型, 通过该类的 from_pretrained 类方法即可加装相应的模型;
from transformers import AutoModel
# jy: TFAutoModel 类在 /transformers/models/auto/modeling_tf_auto.py 中定义, 该类中定义
# 了一些 tensorflow 版本的模型, 通过该类的 from_pretrained 类方法即可加装相应的模型;
from transformers import TFAutoModel
# jy: The tokenizer is responsible for all the preprocessing the pretrained model
# expects, and can be called directly on a single string (as in this examples)
# or a list.
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# jy: 1) pytorch 版本 -------------------------------------------------
# The model is a regular Pytorch ``nn.Module``:
# https://pytorch.org/docs/stable/nn.html#torch.nn.Module
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello world!", return_tensors="pt")
# jy: 得到的 inputs 结果为一个字典, you can use in downstream code or simply
# directly pass to your model using the ** argument unpacking operator.
print(inputs)
outputs = model(**inputs)
print(outputs)
# jy: 2) tensorflow 版本 ---------------------------------------------
# The model is a TensorFlow ``tf.keras.Model``:
# https://www.tensorflow.org/api_docs/python/tf/keras/Model
tf_model = TFAutoModel.from_pretrained("bert-base-uncased")
tf_inputs = tokenizer("Hello world!", return_tensors="tf")
# jy: 得到的 inputs 结果为一个字典, you can use in downstream code or simply
# directly pass to your model using the ** argument unpacking operator.
print(tf_inputs)
tf_outputs = tf_model(**tf_inputs)
print(tf_outputs)
- 参考 SimCSE 代码流程分析。SimCSE 的代码分析中包含以下部分内容:
- tokenizer 类的加载(含相关文件的下载缓存)、对输入语句进行 tokenize 处理
- torch 版本的指定命名空间下的模型(”princeton-nlp/sup-simcse-bert-base-uncased”)的加载流程
tf 版本的相关模型加载流程类似 torch 版本的模型加载流程(暂未细化分析)
(4)小结
All Transformers models (PyTorch or TensorFlow) outputs the tensors before the final activation function (like softmax) because the final activation function is often fused with the loss.
- Models are a standard
torch.nn.Module
or atf.keras.Model
so you can use them in your usual training loop. - However, to make things easier, Transformers provides a
Trainer
class for PyTorch that adds functionality for distributed training, mixed precision, and more. - For TensorFlow, you can use the
fit
method fromKeras
. Transformers model outputs are special dataclasses so their attributes are autocompleted in an IDE. The model outputs also behave like a tuple or a dictionary (e.g., you can index with an integer, a slice or a string) in which case the attributes that are None are ignored.
3、Save a model
(1)pytorch 版本
Once your model is fine-tuned, you can save it with its tokenizer using
PreTrainedModel.save_pretrained()
:PreTrainedModel.save_pretrained()
:https://huggingface.co/docs/transformers/v4.19.2/en/main_classes/model#transformers.PreTrainedModel.save_pretrainedpt_save_directory = "./pt_save_pretrained" tokenizer.save_pretrained(pt_save_directory) pt_model.save_pretrained(pt_save_directory)
When you are ready to use the model again, reload it with
PreTrainedModel.from_pretrained()
:PreTrainedModel.from_pretrained()
:https://huggingface.co/docs/transformers/v4.19.2/en/main_classes/model#transformers.PreTrainedModel.from_pretrainedpt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained")
(2)tf 版本
```python tf_save_directory = “./tf_save_pretrained” tokenizer.save_pretrained(tf_save_directory) tf_model.save_pretrained(tf_save_directory)
tf_model = TFAutoModelForSequenceClassification.from_pretrained(“./tf_save_pretrained”)
<a name="yreee"></a>
## (3)tf 和 pt 相互切换
- One particularly cool Transformers feature is the ability to save a model and reload it as either a PyTorch or TensorFlow model. The `from_pt` or `from_tf` parameter can convert the model from one framework to the other:
```python
# jy: Pytorch ---------------------------------------------------------
from transformers import AutoModel
tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True)
# jy: TensorFlow ------------------------------------------------------
from transformers import TFAutoModel
tokenizer = AutoTokenizer.from_pretrained(pt_save_directory)
tf_model = TFAutoModelForSequenceClassification.from_pretrained(pt_save_directory, from_pt=True)