• To download and use any of the pretrained models on your given task, all it takes is three lines of code.
  • An AutoClass is a shortcut that automatically retrieves the architecture of a pretrained model from it’s name or path. You only need to select the appropriate AutoClass for your task and it’s associated tokenizer with AutoTokenizer.

  • A tokenizer is responsible for preprocessing text into a format that is understandable to the model.

  • First, the tokenizer will split the text into words called tokens. There are multiple rules that govern the tokenization process, including how to split a word and at what level.
  • The most important thing to remember though is you need to instantiate the tokenizer with the same model name to ensure you’re using the same tokenization rules a model was pretrained with.
  • Next, the tokenizer converts the tokens into numbers in order to construct a tensor as input to the model. This is known as the model’s vocabulary. ```python from transformers import AutoTokenizer

model_name = “nlptown/bert-base-multilingual-uncased-sentiment” tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(“bert-base-uncased”)

tokenize1 = tokenizer.tokenize(“We are very happy to show you the Transformers library.”)

jy: [‘we’, ‘are’, ‘very’, ‘happy’, ‘to’, ‘show’, ‘you’, ‘the’, ‘transformers’, ‘library’, ‘.’]

print(tokenize1)

tokenize2 = tokenizer.tokenize(“We are very happy to show you the Transformer model.”)

jy: [‘we’, ‘are’, ‘very’, ‘happy’, ‘to’, ‘show’, ‘you’, ‘the’, ‘transforme’, ‘##r’, ‘model’, ‘.’]

print(tokenize2)

encoding = tokenizer(“We are very happy to show you the Transformers library.”) print(encoding) “”” {‘input_ids’: [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], ‘token_type_ids’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} “””

  1. - The tokenizer will return a dictionary containing:
  2. - `input_ids`: numerical representions of your tokens.
  3. - [https://huggingface.co/docs/transformers/glossary#input-ids](https://huggingface.co/docs/transformers/glossary#input-ids)
  4. - `atttention_mask`: indicates which tokens should be attended to.
  5. - Just like the `pipeline()`, the tokenizer will accept a list of inputs. In addition, the tokenizer can also pad and truncate the text to return a batch with uniform length:
  6. - more details about tokenization:[https://huggingface.co/docs/transformers/preprocessing](https://huggingface.co/docs/transformers/preprocessing)
  7. ```python
  8. # jy: for pytorch ---------------------------------------------------------
  9. pt_batch = tokenizer(
  10. ["We are very happy to show you the Transformers library.",
  11. "We hope you don't hate it."],
  12. padding=True,
  13. truncation=True,
  14. max_length=512,
  15. return_tensors="pt",
  16. )
  17. # jy: for tf ------------------------------------------------------------
  18. tf_batch = tokenizer(
  19. ["We are very happy to show you the Transformers library.",
  20. "We hope you don't hate it."],
  21. padding=True,
  22. truncation=True,
  23. max_length=512,
  24. return_tensors="tf",
  25. )

2、AutoModel

(1)pytorch 版本

model_name = “nlptown/bert-base-multilingual-uncased-sentiment” pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)

jy: you can pass your preprocessed batch of inputs directly to the model. You

just have to unpack the dictionary by adding **(pt_batch 的获取参见

“(1)AutoTokenizer” 章节):

pt_outputs = pt_model(**pt_batch)

jy: The model outputs the final activations in the logits attribute. Apply the

softmax function to the logits to retrieve the probabilities:

from torch import nn

pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1) print(pt_predictions) “”” tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725], [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=) “””

<a name="cS4Sn"></a>
## (2)tf 版本
```python
from transformers import TFAutoModelForSequenceClassification

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

# jy: tf_batch 的获取参见 “(1)AutoTokenizer” 章节
tf_outputs = tf_model(tf_batch)


import tensorflow as tf

tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1)
tf_predictions

(3)代码示例与解析

# jy: AutoTokenizer 类在 /transformers/models/auto/tokenization_auto.py 中定义, 该类中
#     定义了各种 tokenizer 模型类,  通过该类的 from_pretrained 类方法即可加装相应模型的
#     tokenizer 类;
from transformers import AutoTokenizer

# jy: AutoModel 类在 /transformers/models/auto/modeling_auto.py 中定义, 该类中定义了一
#     些 pytorch 版本的模型, 通过该类的 from_pretrained 类方法即可加装相应的模型;
from transformers import AutoModel

# jy: TFAutoModel 类在 /transformers/models/auto/modeling_tf_auto.py 中定义, 该类中定义
#     了一些 tensorflow 版本的模型, 通过该类的 from_pretrained 类方法即可加装相应的模型;
from transformers import TFAutoModel



# jy: The tokenizer is responsible for all the preprocessing the pretrained model
#     expects, and can be called directly on a single string (as in this examples)
#     or a list.
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")



# jy: 1) pytorch 版本 -------------------------------------------------
#     The model is a regular Pytorch ``nn.Module``: 
#     https://pytorch.org/docs/stable/nn.html#torch.nn.Module
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello world!", return_tensors="pt")
# jy:  得到的 inputs 结果为一个字典, you can use in downstream code or simply 
#      directly pass to your model using the ** argument unpacking operator.
print(inputs)
outputs = model(**inputs)
print(outputs)



# jy: 2) tensorflow 版本 ---------------------------------------------
#     The model is a TensorFlow ``tf.keras.Model``: 
#     https://www.tensorflow.org/api_docs/python/tf/keras/Model
tf_model = TFAutoModel.from_pretrained("bert-base-uncased")
tf_inputs = tokenizer("Hello world!", return_tensors="tf")
# jy:  得到的 inputs 结果为一个字典, you can use in downstream code or simply 
#      directly pass to your model using the ** argument unpacking operator.
print(tf_inputs)
tf_outputs = tf_model(**tf_inputs)
print(tf_outputs)
  • 参考 SimCSE 代码流程分析。SimCSE 的代码分析中包含以下部分内容:
    • tokenizer 类的加载(含相关文件的下载缓存)、对输入语句进行 tokenize 处理
    • torch 版本的指定命名空间下的模型(”princeton-nlp/sup-simcse-bert-base-uncased”)的加载流程
  • tf 版本的相关模型加载流程类似 torch 版本的模型加载流程(暂未细化分析)

    (4)小结

  • All Transformers models (PyTorch or TensorFlow) outputs the tensors before the final activation function (like softmax) because the final activation function is often fused with the loss.

  • Models are a standard torch.nn.Module or a tf.keras.Model so you can use them in your usual training loop.
  • However, to make things easier, Transformers provides a Trainer class for PyTorch that adds functionality for distributed training, mixed precision, and more.
  • For TensorFlow, you can use the fit method from Keras.
  • Transformers model outputs are special dataclasses so their attributes are autocompleted in an IDE. The model outputs also behave like a tuple or a dictionary (e.g., you can index with an integer, a slice or a string) in which case the attributes that are None are ignored.

    3、Save a model

    (1)pytorch 版本

  • Once your model is fine-tuned, you can save it with its tokenizer using PreTrainedModel.save_pretrained():

  • When you are ready to use the model again, reload it with PreTrainedModel.from_pretrained():

tf_model = TFAutoModelForSequenceClassification.from_pretrained(“./tf_save_pretrained”)

<a name="yreee"></a>
## (3)tf 和 pt 相互切换

- One particularly cool Transformers feature is the ability to save a model and reload it as either a PyTorch or TensorFlow model. The `from_pt` or `from_tf` parameter can convert the model from one framework to the other:
```python
# jy: Pytorch ---------------------------------------------------------
from transformers import AutoModel

tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True)


# jy: TensorFlow ------------------------------------------------------
from transformers import TFAutoModel

tokenizer = AutoTokenizer.from_pretrained(pt_save_directory)
tf_model = TFAutoModelForSequenceClassification.from_pretrained(pt_save_directory, from_pt=True)