1、安装

  • pip install datasets[==2.2.2]
    • 或:conda install -c huggingface -c conda-forge datasets
  • 确认安装成功 ```python from datasets import load_dataset

jy: This should download version 1 of the Stanford Question Answering Dataset

(SQuAD), load the training split, and print the first training example:

“”” {‘answers’: {‘answer_start’: [515], ‘text’: [‘Saint Bernadette Soubirous’]}, ‘context’: ‘Architecturally, the school has a Catholic character. Atop the Main Building\’s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend “Venite Ad Me Omnes”. Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.’, ‘id’: ‘5733be284776f41900661182’, ‘question’: ‘To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?’, ‘title’: ‘University_of_Notre_Dame’} “”” print(load_dataset(‘squad’, split=’train’)[0])

  1. - Stanford Question Answering Dataset (SQuAD):[https://rajpurkar.github.io/SQuAD-explorer/](https://rajpurkar.github.io/SQuAD-explorer/)
  2. <a name="A6bAT"></a>
  3. # 2、使用
  4. - The easiest way to load a dataset is from the Hugging Face Hub. There are already over 900 datasets in over 100 languages on the Hub. Choose from a wide category of datasets to use for NLP tasks like question answering, summarization, machine translation, and language modeling.
  5. - Hugging Face Hub:[https://huggingface.co/datasets](https://huggingface.co/datasets)
  6. - For a more in-depth look inside a dataset, use the live Datasets Viewer.
  7. - Datasets Viewer:[https://huggingface.co/datasets/viewer/](https://huggingface.co/datasets/viewer/)
  8. <a name="Ak12f"></a>
  9. ## (1)Load a dataset
  10. - Before you take the time to download a dataset, it is often helpful to quickly get all the relevant information about a dataset. The `load_dataset_builder()` method allows you to inspect the attributes of a dataset without downloading it.
  11. - `load_dataset_builder()`:[https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/loading_methods#datasets.load_dataset_builder](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/loading_methods#datasets.load_dataset_builder)
  12. ```python
  13. from datasets import load_dataset_builder
  14. dataset_builder = load_dataset_builder('imdb')
  15. print(dataset_builder.cache_dir)
  16. """
  17. /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1
  18. """
  19. print(dataset_builder.info.features)
  20. """
  21. {'text': Value(dtype='string', id=None),
  22. 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None)}
  23. """
  24. print(dataset_builder.info.splits)
  25. """
  26. {'train': SplitInfo(name='train', num_bytes=33432835, num_examples=25000,
  27. dataset_name='imdb'),
  28. 'test': SplitInfo(name='test', num_bytes=32650697, num_examples=25000,
  29. dataset_name='imdb'),
  30. 'unsupervised': SplitInfo(name='unsupervised', num_bytes=67106814, num_examples=50000,
  31. dataset_name='imdb')}
  32. """

dataset = load_dataset(‘glue’, ‘mrpc’, split=’train’)

  1. <a name="cfzP8"></a>
  2. ## (2)Select a configuration
  3. - Some datasets, like the General Language Understanding Evaluation (GLUE) benchmark, are actually made up of several datasets. These sub-datasets are called **configurations**, and you must explicitly select one when you load the dataset. If you don't provide a configuration name, Datasets will raise a ValueError and remind you to select a configuration.
  4. - Use the `get_dataset_config_names()` function to retrieve a list of all the possible configurations available to your dataset:
  5. - `get_dataset_config_names()`:[https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/loading_methods#datasets.get_dataset_config_names](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/loading_methods#datasets.get_dataset_config_names)
  6. ```python
  7. from datasets import get_dataset_config_names
  8. configs = get_dataset_config_names("glue")
  9. # ['cola', 'sst2', 'mrpc', 'qqp', 'stsb', 'mnli', 'mnli_mismatched',
  10. # 'mnli_matched', 'qnli', 'rte', 'wnli', 'ax']
  11. print(configs)
  • Incorrect way to load a configuration: ```python from datasets import load_dataset

dataset = load_dataset(‘glue’) “”” ValueError: Config name is missing. Please pick one among the available configs: [‘cola’, ‘sst2’, ‘mrpc’, ‘qqp’, ‘stsb’, ‘mnli’, ‘mnli_mismatched’, ‘mnli_matched’, ‘qnli’, ‘rte’, ‘wnli’, ‘ax’] Example of usage: load_dataset(‘glue’, ‘cola’) “””


- Correct way to load a configuration:
```python
dataset = load_dataset('glue', 'sst2')
"""
Downloading and preparing dataset glue/sst2 (download: 7.09 MiB, generated: 4.81 MiB,
  total: 11.90 MiB) to /Users/thomwolf/.cache/huggingface/datasets/glue/sst2/1.0.0...
Downloading: 100%|██████████████████████████████████████████████████████████████| 
  7.44M/7.44M [00:01<00:00, 7.03MB/s]
Dataset glue downloaded and prepared to 
  /Users/thomwolf/.cache/huggingface/datasets/glue/sst2/1.0.0. Subsequent calls will
  reuse this data.
"""

print(dataset)
"""
{'train': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'}, 
                  num_rows: 67349),
 'validation': Dataset(schema: {'sentence': 'string', 'label': 'int64', 
                                'idx': 'int32'}, 
                       num_rows: 872),
 'test': Dataset(schema: {'sentence': 'string', 'label': 'int64', 'idx': 'int32'},
                 num_rows: 1821)
}
"""

(3)Select a split

  • A split is a specific subset of the dataset like train and test. Make sure you select a split when you load a dataset. If you don’t supply a split argument, Datasets will only return a dictionary containing the subsets of the dataset. ```python from datasets import load_dataset

datasets = load_dataset(‘glue’, ‘mrpc’) print(datasets) “”” {train: Dataset({ features: [‘idx’, ‘label’, ‘sentence1’, ‘sentence2’], num_rows: 3668}) validation: Dataset({ features: [‘idx’, ‘label’, ‘sentence1’, ‘sentence2’], num_rows: 408}) test: Dataset({ features: [‘idx’, ‘label’, ‘sentence1’, ‘sentence2’], num_rows: 1725})} “””


- You can list the split names for a dataset, or a specific configuration, with the `get_dataset_split_names()` method:
   - `get_dataset_split_names()`:[https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/loading_methods#datasets.get_dataset_split_names](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/loading_methods#datasets.get_dataset_split_names)
```python
from datasets import get_dataset_split_names


# ['validation', 'train']
get_dataset_split_names('sent_comp')

# ['test', 'train', 'validation']
get_dataset_split_names('glue', 'cola')

3、quick start

  • In the quick start, you will walkthrough all the steps to fine-tune BERT on a paraphrase classification task(Loaded and processed the MRPC dataset to fine-tune BERT to determine whether sentence pairs have the same meaning). Depending on the specific dataset you use, these steps may vary, but the general steps of how to load a dataset and process it are the same.

    • Detailed information on loading and processing a dataset(covers additional important topics like dynamic padding, and fine-tuning with the Trainer API)
  • Begin by loading the Microsoft Research Paraphrase Corpus (MRPC) training dataset from the General Language Understanding Evaluation (GLUE) benchmark.

    • MRPC:a corpus of human annotated sentence pairs used to train a model to determine whether sentence pairs are semantically equivalent.
    • GLUE
      • https://huggingface.co/datasets/glue ```python from datasets import load_dataset from transformers import AutoTokenizer

        jy: 用于加载 pt 版本的模型

        from transformers import AutoModelForSequenceClassification

        jy: 用于加载 tf 版本的模型

        from transformers import TFAutoModelForSequenceClassification

jy: 会依据相应数据集对应的脚步进行加载; 并默认进行本地缓存

(缓存路径: ~/.cache/huggingface/datasets/)

dataset = load_dataset(‘glue’, ‘mrpc’, split=’train’)

jy: import the pre-trained BERT model and its tokenizer from the Transformers library

不同版本的模型仅在加载 model 时不同, tokenizer 使用相同的加载方式;

pytorch ==========================================================

model = AutoModelForSequenceClassification.from_pretrained(‘bert-base-cased’) “”” Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: [‘cls.predictions.bias’, ‘cls.predictions.transform.dense.weight’, ‘cls.predictions.transform.dense.bias’, ‘cls.predictions.decoder.weight’, ‘cls.seq_relationship.weight’, ‘cls.seq_relationship.bias’, ‘cls.predictions.transform.LayerNorm.weight’, ‘cls.predictions.transform.LayerNorm.bias’]

  • This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
  • This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: [‘classifier.weight’, ‘classifier.bias’] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. “””

    TensorFlow ==========================================================

    model = TFAutoModelForSequenceClassification.from_pretrained(“bert-base-cased”) “”” Some weights of the model checkpoint at bert-base-cased were not used when initializing TFBertForSequenceClassification: [‘nspcls’, ‘mlmcls’]
  • This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
  • This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: [‘dropout_37’, ‘classifier’] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. “””

tokenizer = AutoTokenizer.from_pretrained(‘bert-base-cased’)

<a name="NV67V"></a>
## (2)Tokenize the dataset

- Tokenize the text in order to build sequences of integers the model can understand. 
- Encode the entire dataset with `Dataset.map()`, and truncate and pad the inputs to the maximum length of the model. This ensures the appropriate tensor batches are built.
   - `Dataset.map()`:[https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Dataset.map](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Dataset.map)
```python
def encode(examples):
    return tokenizer(examples['sentence1'], 
                     examples['sentence2'], 
                     truncation=True, 
                     padding='max_length')

dataset = dataset.map(encode, batched=True)
dataset[0]
"""
{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
'label': 1,
'idx': 0,
'input_ids': array([101,  7277,  2180,  5303,  4806,  1117,  1711,   117,  2292, 1119,  1270,   107,  1103,  7737,   107,   117,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102, 11336,  6732, 3384,  1106,  1140,  1112,  1178,   107,  1103,  7737,   107, 117,  7277,  2180,  5303,  4806,  1117,  1711,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102]),
'token_type_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
'attention_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}
"""
  • Notice how there are three new columns in the dataset:
    • input_ids
    • token_type_ids
    • attention_mask
  • These columns are the inputs to the model.

    (3)Format the dataset

  • Depending on whether you are using PyTorch, TensorFlow, or JAX, you will need to format the dataset accordingly.

  • There are three changes you need to make to the dataset:

    (a)Pytorch

  • Rename the label column to labels(the expected input name in BertForSequenceClassification):

  • Retrieve the actual tensors from the Dataset object instead of using the current Python objects.

  • Filter the dataset to only return the model inputs:
    • input_ids
    • token_type_ids
    • attention_mask
  • Dataset.set_format() completes the last two steps on-the-fly. After you set the format, wrap the dataset in torch.utils.data.DataLoader:

dataset.set_format(type=’torch’, columns=[‘input_ids’, ‘token_type_ids’, ‘attention_mask’, ‘labels’]) dataloader = torch.utils.data.DataLoader(dataset, batch_size=32) next(iter(dataloader)) “”” {‘attention_mask’: tensor([[1, 1, 1, …, 0, 0, 0], [1, 1, 1, …, 0, 0, 0], [1, 1, 1, …, 0, 0, 0], …, [1, 1, 1, …, 0, 0, 0], [1, 1, 1, …, 0, 0, 0], [1, 1, 1, …, 0, 0, 0]]), ‘input_ids’: tensor([[ 101, 7277, 2180, …, 0, 0, 0], [ 101, 10684, 2599, …, 0, 0, 0], [ 101, 1220, 1125, …, 0, 0, 0], …, [ 101, 16944, 1107, …, 0, 0, 0], [ 101, 1109, 11896, …, 0, 0, 0], [ 101, 1109, 4173, …, 0, 0, 0]]), ‘label’: tensor([1, 0, 1, 0, 1, 1, 0, 1]), ‘token_type_ids’: tensor([[0, 0, 0, …, 0, 0, 0], [0, 0, 0, …, 0, 0, 0], [0, 0, 0, …, 0, 0, 0], …, [0, 0, 0, …, 0, 0, 0], [0, 0, 0, …, 0, 0, 0], [0, 0, 0, …, 0, 0, 0]])} “””

<a name="ULm5h"></a>
### (b)TensorFlow

- Rename the label column to labels, the expected input name in `TFBertForSequenceClassification`:
   -  `TFBertForSequenceClassification`:[https://huggingface.co/docs/transformers/model_doc/bert#tfbertforsequenceclassification](https://huggingface.co/docs/transformers/model_doc/bert#tfbertforsequenceclassification)
```python
dataset = dataset.map(lambda examples: {'labels': examples['label']}, batched=True)
import tensorflow as tf

dataset.set_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
features = {x: dataset[x].to_tensor(default_value=0, shape=[None, tokenizer.model_max_length]) for x in ['input_ids', 'token_type_ids', 'attention_mask']}
tfdataset = tf.data.Dataset.from_tensor_slices((features, dataset["labels"])).batch(32)
next(iter(tfdataset))
"""
({'input_ids': <tf.Tensor: shape=(32, 512), dtype=int32, numpy=
array([[  101,  7277,  2180, ...,     0,     0,     0],
   [  101, 10684,  2599, ...,     0,     0,     0],
   [  101,  1220,  1125, ...,     0,     0,     0],
   ...,
   [  101,  1109,  2026, ...,     0,     0,     0],
   [  101, 22263,  1107, ...,     0,     0,     0],
   [  101,   142,  1813, ...,     0,     0,     0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(32, 512), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
   [0, 0, 0, ..., 0, 0, 0],
   [0, 0, 0, ..., 0, 0, 0],
   ...,
   [0, 0, 0, ..., 0, 0, 0],
   [0, 0, 0, ..., 0, 0, 0],
   [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(32, 512), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
   [1, 1, 1, ..., 0, 0, 0],
   [1, 1, 1, ..., 0, 0, 0],
   ...,
   [1, 1, 1, ..., 0, 0, 0],
   [1, 1, 1, ..., 0, 0, 0],
   [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>}, <tf.Tensor: shape=(32,), dtype=int64, numpy=
array([1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1,
   0, 1, 1, 1, 0, 0, 1, 1, 1, 0])>)
"""

(4)Train the model

(a)Pytorch

  • Lastly, create a simple training loop and start training: ```python from tqdm import tqdm

device = ‘cuda’ if torch.cuda.is_available() else ‘cpu’ model.train().to(device) optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-5) for epoch in range(3): for i, batch in enumerate(tqdm(dataloader)): batch = {k: v.to(device) for k, v in batch.items()} outputs = model(**batch) loss = outputs[0] loss.backward() optimizer.step() optimizer.zero_grad() if i % 10 == 0: print(f”loss: {loss}”)

<a name="SAzrR"></a>
### (b)TensorFlow

- Lastly, compile the model and start training:
```python
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
    reduction=tf.keras.losses.Reduction.NONE, 
    from_logits=True)
opt = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=opt, loss=loss_fn, metrics=["accuracy"])
model.fit(tfdataset, epochs=3)