- 1、安装
- jy: This should download version 1 of the Stanford Question Answering Dataset
- (SQuAD), load the training split, and print the first training example:
- 3、quick start
- jy: 用于加载 pt 版本的模型
- jy: 用于加载 tf 版本的模型
- jy: 会依据相应数据集对应的脚步进行加载; 并默认进行本地缓存
- (缓存路径: ~/.cache/huggingface/datasets/)
- jy: import the pre-trained BERT model and its tokenizer from the Transformers library
- 不同版本的模型仅在加载 model 时不同, tokenizer 使用相同的加载方式;
- pytorch ==========================================================
- TensorFlow ==========================================================
1、安装
pip install datasets[==2.2.2]
- 或:
conda install -c huggingface -c conda-forge datasets
- 或:
- 确认安装成功 ```python from datasets import load_dataset
jy: This should download version 1 of the Stanford Question Answering Dataset
(SQuAD), load the training split, and print the first training example:
“”” {‘answers’: {‘answer_start’: [515], ‘text’: [‘Saint Bernadette Soubirous’]}, ‘context’: ‘Architecturally, the school has a Catholic character. Atop the Main Building\’s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend “Venite Ad Me Omnes”. Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.’, ‘id’: ‘5733be284776f41900661182’, ‘question’: ‘To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?’, ‘title’: ‘University_of_Notre_Dame’} “”” print(load_dataset(‘squad’, split=’train’)[0])
- Stanford Question Answering Dataset (SQuAD):[https://rajpurkar.github.io/SQuAD-explorer/](https://rajpurkar.github.io/SQuAD-explorer/)
<a name="A6bAT"></a>
# 2、使用
- The easiest way to load a dataset is from the Hugging Face Hub. There are already over 900 datasets in over 100 languages on the Hub. Choose from a wide category of datasets to use for NLP tasks like question answering, summarization, machine translation, and language modeling.
- Hugging Face Hub:[https://huggingface.co/datasets](https://huggingface.co/datasets)
- For a more in-depth look inside a dataset, use the live Datasets Viewer.
- Datasets Viewer:[https://huggingface.co/datasets/viewer/](https://huggingface.co/datasets/viewer/)
<a name="Ak12f"></a>
## (1)Load a dataset
- Before you take the time to download a dataset, it is often helpful to quickly get all the relevant information about a dataset. The `load_dataset_builder()` method allows you to inspect the attributes of a dataset without downloading it.
- `load_dataset_builder()`:[https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/loading_methods#datasets.load_dataset_builder](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/loading_methods#datasets.load_dataset_builder)
```python
from datasets import load_dataset_builder
dataset_builder = load_dataset_builder('imdb')
print(dataset_builder.cache_dir)
"""
/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1
"""
print(dataset_builder.info.features)
"""
{'text': Value(dtype='string', id=None),
'label': ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None)}
"""
print(dataset_builder.info.splits)
"""
{'train': SplitInfo(name='train', num_bytes=33432835, num_examples=25000,
dataset_name='imdb'),
'test': SplitInfo(name='test', num_bytes=32650697, num_examples=25000,
dataset_name='imdb'),
'unsupervised': SplitInfo(name='unsupervised', num_bytes=67106814, num_examples=50000,
dataset_name='imdb')}
"""
- Take a look at
DatasetInfo
for a full list of attributes you can use withdataset_builder
. Once you are happy with the dataset you want, load it in a single line withload_dataset()
:DatasetInfo
:https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.DatasetInfoload_dataset()
:https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/loading_methods#datasets.load_dataset ```python from datasets import load_dataset
dataset = load_dataset(‘glue’, ‘mrpc’, split=’train’)
<a name="cfzP8"></a>
## (2)Select a configuration
- Some datasets, like the General Language Understanding Evaluation (GLUE) benchmark, are actually made up of several datasets. These sub-datasets are called **configurations**, and you must explicitly select one when you load the dataset. If you don't provide a configuration name, Datasets will raise a ValueError and remind you to select a configuration.
- Use the `get_dataset_config_names()` function to retrieve a list of all the possible configurations available to your dataset:
- `get_dataset_config_names()`:[https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/loading_methods#datasets.get_dataset_config_names](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/loading_methods#datasets.get_dataset_config_names)
```python
from datasets import get_dataset_config_names
configs = get_dataset_config_names("glue")
# ['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'mnli_mismatched',
# 'mnli_matched', 'qnli', 'rte', 'wnli', 'ax']
print(configs)
- Incorrect way to load a configuration: ```python from datasets import load_dataset
dataset = load_dataset(‘glue’) “”” ValueError: Config name is missing. Please pick one among the available configs: [‘cola’, ‘sst2’, ‘mrpc’, ‘qqp’, ‘stsb’, ‘mnli’, ‘mnli_mismatched’, ‘mnli_matched’, ‘qnli’, ‘rte’, ‘wnli’, ‘ax’] Example of usage: load_dataset(‘glue’, ‘cola’) “””
- Correct way to load a configuration:
```python
dataset = load_dataset('glue', 'sst2')
"""
Downloading and preparing dataset glue/sst2 (download: 7.09 MiB, generated: 4.81 MiB,
total: 11.90 MiB) to /Users/thomwolf/.cache/huggingface/datasets/glue/sst2/1.0.0...
Downloading: 100%|██████████████████████████████████████████████████████████████|
7.44M/7.44M [00:01<00:00, 7.03MB/s]
Dataset glue downloaded and prepared to
/Users/thomwolf/.cache/huggingface/datasets/glue/sst2/1.0.0. Subsequent calls will
reuse this data.
"""
print(dataset)
"""
{'train': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'},
num_rows: 67349),
'validation': Dataset(schema: {'sentence': 'string', 'label': 'int64',
'idx': 'int32'},
num_rows: 872),
'test': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'},
num_rows: 1821)
}
"""
(3)Select a split
- A split is a specific subset of the dataset like train and test. Make sure you select a split when you load a dataset. If you don’t supply a split argument, Datasets will only return a dictionary containing the subsets of the dataset. ```python from datasets import load_dataset
datasets = load_dataset(‘glue’, ‘mrpc’) print(datasets) “”” {train: Dataset({ features: [‘idx’, ‘label’, ‘sentence1’, ‘sentence2’], num_rows: 3668}) validation: Dataset({ features: [‘idx’, ‘label’, ‘sentence1’, ‘sentence2’], num_rows: 408}) test: Dataset({ features: [‘idx’, ‘label’, ‘sentence1’, ‘sentence2’], num_rows: 1725})} “””
- You can list the split names for a dataset, or a specific configuration, with the `get_dataset_split_names()` method:
- `get_dataset_split_names()`:[https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/loading_methods#datasets.get_dataset_split_names](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/loading_methods#datasets.get_dataset_split_names)
```python
from datasets import get_dataset_split_names
# ['validation', 'train']
get_dataset_split_names('sent_comp')
# ['test', 'train', 'validation']
get_dataset_split_names('glue', 'cola')
3、quick start
In the quick start, you will walkthrough all the steps to fine-tune BERT on a paraphrase classification task(Loaded and processed the MRPC dataset to fine-tune BERT to determine whether sentence pairs have the same meaning). Depending on the specific dataset you use, these steps may vary, but the general steps of how to load a dataset and process it are the same.
- Detailed information on loading and processing a dataset(covers additional important topics like dynamic padding, and fine-tuning with the Trainer API)
- Chapter 3:https://huggingface.co/course/chapter3/1?fw=pt【undo】
(1)加载 dataset 和 model
- Chapter 3:https://huggingface.co/course/chapter3/1?fw=pt【undo】
- Detailed information on loading and processing a dataset(covers additional important topics like dynamic padding, and fine-tuning with the Trainer API)
Begin by loading the Microsoft Research Paraphrase Corpus (MRPC) training dataset from the General Language Understanding Evaluation (GLUE) benchmark.
- MRPC:a corpus of human annotated sentence pairs used to train a model to determine whether sentence pairs are semantically equivalent.
- GLUE
- https://huggingface.co/datasets/glue
```python
from datasets import load_dataset
from transformers import AutoTokenizer
jy: 用于加载 pt 版本的模型
from transformers import AutoModelForSequenceClassificationjy: 用于加载 tf 版本的模型
from transformers import TFAutoModelForSequenceClassification
- https://huggingface.co/datasets/glue
```python
from datasets import load_dataset
from transformers import AutoTokenizer
jy: 会依据相应数据集对应的脚步进行加载; 并默认进行本地缓存
(缓存路径: ~/.cache/huggingface/datasets/)
dataset = load_dataset(‘glue’, ‘mrpc’, split=’train’)
jy: import the pre-trained BERT model and its tokenizer from the Transformers library
不同版本的模型仅在加载 model 时不同, tokenizer 使用相同的加载方式;
pytorch ==========================================================
model = AutoModelForSequenceClassification.from_pretrained(‘bert-base-cased’) “”” Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: [‘cls.predictions.bias’, ‘cls.predictions.transform.dense.weight’, ‘cls.predictions.transform.dense.bias’, ‘cls.predictions.decoder.weight’, ‘cls.seq_relationship.weight’, ‘cls.seq_relationship.bias’, ‘cls.predictions.transform.LayerNorm.weight’, ‘cls.predictions.transform.LayerNorm.bias’]
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: [‘classifier.weight’, ‘classifier.bias’]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
“””
TensorFlow ==========================================================
model = TFAutoModelForSequenceClassification.from_pretrained(“bert-base-cased”) “”” Some weights of the model checkpoint at bert-base-cased were not used when initializing TFBertForSequenceClassification: [‘nspcls’, ‘mlmcls’] - This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: [‘dropout_37’, ‘classifier’] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. “””
tokenizer = AutoTokenizer.from_pretrained(‘bert-base-cased’)
<a name="NV67V"></a>
## (2)Tokenize the dataset
- Tokenize the text in order to build sequences of integers the model can understand.
- Encode the entire dataset with `Dataset.map()`, and truncate and pad the inputs to the maximum length of the model. This ensures the appropriate tensor batches are built.
- `Dataset.map()`:[https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Dataset.map](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Dataset.map)
```python
def encode(examples):
return tokenizer(examples['sentence1'],
examples['sentence2'],
truncation=True,
padding='max_length')
dataset = dataset.map(encode, batched=True)
dataset[0]
"""
{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
'label': 1,
'idx': 0,
'input_ids': array([101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 11336, 6732, 3384, 1106, 1140, 1112, 1178, 107, 1103, 7737, 107, 117, 7277, 2180, 5303, 4806, 1117, 1711, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102]),
'token_type_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
'attention_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}
"""
- Notice how there are three new columns in the dataset:
input_ids
token_type_ids
attention_mask
These columns are the inputs to the model.
(3)Format the dataset
Depending on whether you are using PyTorch, TensorFlow, or JAX, you will need to format the dataset accordingly.
There are three changes you need to make to the dataset:
(a)Pytorch
Rename the label column to labels(the expected input name in
BertForSequenceClassification
):BertForSequenceClassification
:https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForSequenceClassification.forwarddataset = dataset.map(lambda examples: {'labels': examples['label']}, batched=True)
Retrieve the actual tensors from the Dataset object instead of using the current Python objects.
- Filter the dataset to only return the model inputs:
- input_ids
- token_type_ids
- attention_mask
Dataset.set_format()
completes the last two steps on-the-fly. After you set the format, wrap the dataset intorch.utils.data.DataLoader
:Dataset.set_format()
:https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Dataset.set_format ```python import torch
dataset.set_format(type=’torch’, columns=[‘input_ids’, ‘token_type_ids’, ‘attention_mask’, ‘labels’]) dataloader = torch.utils.data.DataLoader(dataset, batch_size=32) next(iter(dataloader)) “”” {‘attention_mask’: tensor([[1, 1, 1, …, 0, 0, 0], [1, 1, 1, …, 0, 0, 0], [1, 1, 1, …, 0, 0, 0], …, [1, 1, 1, …, 0, 0, 0], [1, 1, 1, …, 0, 0, 0], [1, 1, 1, …, 0, 0, 0]]), ‘input_ids’: tensor([[ 101, 7277, 2180, …, 0, 0, 0], [ 101, 10684, 2599, …, 0, 0, 0], [ 101, 1220, 1125, …, 0, 0, 0], …, [ 101, 16944, 1107, …, 0, 0, 0], [ 101, 1109, 11896, …, 0, 0, 0], [ 101, 1109, 4173, …, 0, 0, 0]]), ‘label’: tensor([1, 0, 1, 0, 1, 1, 0, 1]), ‘token_type_ids’: tensor([[0, 0, 0, …, 0, 0, 0], [0, 0, 0, …, 0, 0, 0], [0, 0, 0, …, 0, 0, 0], …, [0, 0, 0, …, 0, 0, 0], [0, 0, 0, …, 0, 0, 0], [0, 0, 0, …, 0, 0, 0]])} “””
<a name="ULm5h"></a>
### (b)TensorFlow
- Rename the label column to labels, the expected input name in `TFBertForSequenceClassification`:
- `TFBertForSequenceClassification`:[https://huggingface.co/docs/transformers/model_doc/bert#tfbertforsequenceclassification](https://huggingface.co/docs/transformers/model_doc/bert#tfbertforsequenceclassification)
```python
dataset = dataset.map(lambda examples: {'labels': examples['label']}, batched=True)
import tensorflow as tf
dataset.set_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
features = {x: dataset[x].to_tensor(default_value=0, shape=[None, tokenizer.model_max_length]) for x in ['input_ids', 'token_type_ids', 'attention_mask']}
tfdataset = tf.data.Dataset.from_tensor_slices((features, dataset["labels"])).batch(32)
next(iter(tfdataset))
"""
({'input_ids': <tf.Tensor: shape=(32, 512), dtype=int32, numpy=
array([[ 101, 7277, 2180, ..., 0, 0, 0],
[ 101, 10684, 2599, ..., 0, 0, 0],
[ 101, 1220, 1125, ..., 0, 0, 0],
...,
[ 101, 1109, 2026, ..., 0, 0, 0],
[ 101, 22263, 1107, ..., 0, 0, 0],
[ 101, 142, 1813, ..., 0, 0, 0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(32, 512), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(32, 512), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
...,
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0]], dtype=int32)>}, <tf.Tensor: shape=(32,), dtype=int64, numpy=
array([1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1,
0, 1, 1, 1, 0, 0, 1, 1, 1, 0])>)
"""
(4)Train the model
(a)Pytorch
- Lastly, create a simple training loop and start training: ```python from tqdm import tqdm
device = ‘cuda’ if torch.cuda.is_available() else ‘cpu’ model.train().to(device) optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-5) for epoch in range(3): for i, batch in enumerate(tqdm(dataloader)): batch = {k: v.to(device) for k, v in batch.items()} outputs = model(**batch) loss = outputs[0] loss.backward() optimizer.step() optimizer.zero_grad() if i % 10 == 0: print(f”loss: {loss}”)
<a name="SAzrR"></a>
### (b)TensorFlow
- Lastly, compile the model and start training:
```python
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
reduction=tf.keras.losses.Reduction.NONE,
from_logits=True)
opt = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=opt, loss=loss_fn, metrics=["accuracy"])
model.fit(tfdataset, epochs=3)