- 1、从 Hugging Face Hub 中加载
- jy: This dataset repository contains CSV files, and this code loads all the data
- from the CSV files.
- 2、Local loading script
- 3、加载 Local and remote files
- jy: If you have more than one CSV file:
- jy: You can also map the training and test splits to specific CSV files:
- jy: To load remote CSV files via HTTP, you can pass the URLs:
- jy: To load zipped CSV files:
- jy: 使用示例:
- jy: To load remote TXT files via HTTP, you can pass the URLs:
- Assuming train split contains 999 records.
- 19 records, from 500 (included) to 519 (excluded).
- 20 records, from 519 (included) to 539 (excluded).
- 7、Troubleshooting
- 8、Metrics
- Example of typical usage
- You have already seen how to load a dataset from the Hugging Face Hub. But datasets are stored in a variety of places, and sometimes you won’t find the one you want on the Hub.
- A dataset can be on disk on your local machine, in a Github repository, and in in-memory data structures like Python dictionaries and Pandas DataFrames.
- Wherever your dataset may be stored, Datasets provides a way for you to load and use it for training.
- This guide will show you how to load a dataset from:
- The Hub without a dataset loading script
- Local loading script
- Local files
- In-memory data
- Offline
- A specific slice of a split
You will also learn how to troubleshoot common errors, and how to load specific configurations of a metric.
1、从 Hugging Face Hub 中加载
This method relies on a dataset loading script that downloads and builds the dataset. However, you can also load a dataset from any dataset repository on the Hub without a loading script!
- First, create a dataset repository and upload your data files. Then you can use
load_dataset()
like you learned in the tutorial. - For example, load the files from demo repository by providing the repository namespace and dataset name:
- demo repository:https://huggingface.co/datasets/lhoestq/demo1 ```python from datasets import load_dataset
jy: This dataset repository contains CSV files, and this code loads all the data
from the CSV files.
dataset = load_dataset(‘lhoestq/demo1’)
- Some datasets may have more than one version, based on Git tags, branches or commits. Use the `revision` flag to specify which dataset version you want to load:
```python
# jy: revison 参数传入: tag name, or branch name, or commit hash
dataset = load_dataset( "lhoestq/custom_squad", revision="main")
- Refer to the Upload guide for more instructions on how to create a dataset repository on the Hub, and how to upload your data files.
- Upload guide:https://huggingface.co/docs/datasets/upload_dataset
If the dataset doesn’t have a dataset loading script, then by default, all the data will be loaded in the train split. Use the
data_files
parameter to map data files to splits like train, validation and test:data_files = {"train": "train.csv", "test": "test.csv"}
dataset = load_dataset("namespace/your_dataset_name", data_files=data_files)
If you don’t specify which data files to use,
load_dataset
will return all the data files. This can take a long time if you are loading a large dataset like C4, which is approximately 13TB of data.- You can also load a specific subset of the files with the
data_files
parameter. The example below loads files from the C4 dataset:- C4 dataset:https://huggingface.co/datasets/allenai/c4 ```python from datasets import load_dataset
c4_subset = load_dataset(‘allenai/c4’, data_files=’en/c4-train.0000*-of-01024.json.gz’)
- Specify a custom split with the `split` parameter:
```python
data_files = {"validation": "en/c4-validation.*.json.gz"}
c4_validation = load_dataset("allenai/c4", data_files=data_files, split="validation")
2、Local loading script
You may have a Datasets loading script locally on your computer. You can load this dataset by passing to
load_dataset()
one of the following paths:- The local path to the loading script file.
- The local path to the directory containing the loading script file (only if the script file has the same name as the directory).
dataset = load_dataset("path/to/local/loading_script/loading_script.py", split="train") # equivalent because the file has the same name as the directory dataset = load_dataset("path/to/local/loading_script", split="train")
3、加载 Local and remote files
Datasets can be loaded from local files stored on your computer, and also from remote files. The datasets are most likely stored as a csv, json, txt or parquet file. The
load_dataset()
method is able to load each of these file types.Curious about how to load datasets for vision?
- 参考 image loading guide:https://huggingface.co/docs/datasets/image_process
(1)CSV
- 参考 image loading guide:https://huggingface.co/docs/datasets/image_process
Datasets can read a dataset made up of one or several CSV files: ```python from datasets import load_dataset
dataset = load_dataset(‘csv’, data_files=’my_file.csv’)
jy: If you have more than one CSV file:
dataset = load_dataset(‘csv’, data_files=[‘my_file_1.csv’, ‘my_file_2.csv’, ‘my_file_3.csv’])
jy: You can also map the training and test splits to specific CSV files:
dataset = load_dataset(‘csv’, data_files={‘train’: [‘my_train_file_1.csv’, ‘my_train_file_2.csv’], ‘test’: ‘my_test_file.csv’})
jy: To load remote CSV files via HTTP, you can pass the URLs:
base_url = “https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/“ dataset = load_dataset(‘csv’, data_files={‘train’: base_url + ‘train.csv’, ‘test’: base_url + ‘test.csv’})
jy: To load zipped CSV files:
url = “https://domain.org/train_data.zip“ data_files = {“train”: url} dataset = load_dataset(“csv”, data_files=data_files)
jy: 使用示例:
“”” dataset = load_dataset(“csv”, data_files=data_files, cache_dir=”/path/to/cache_dir”, delimiter=”\t” if “tsv” in “f_name_tsv” else “,”) “””
<a name="nlXeR"></a>
## (2)JSON
- JSON files are loaded directly with `load_dataset()` as shown below:
```python
from datasets import load_dataset
dataset = load_dataset('json', data_files='my_file.json')
JSON files can have diverse formats, but we think the most efficient format is to have multiple JSON objects; each line represents an individual row of data. For example:
{"a": 1, "b": 2.0, "c": "foo", "d": false} {"a": 4, "b": -5.5, "c": null, "d": true}
Another JSON format you may encounter is a nested field, in which case you will need to specify the field argument as shown in the following: ```json {“version”: “0.1.0”, “data”: [{“a”: 1, “b”: 2.0, “c”: “foo”, “d”: false},
{"a": 4, "b": -5.5, "c": null, "d": true}]
}
```python
from datasets import load_dataset
dataset = load_dataset('json', data_files='my_file.json', field='data')
To load remote JSON files via HTTP, you can pass the URLs:
base_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/" dataset = load_dataset('json', data_files={'train': base_url + 'train-v1.1.json', 'validation': base_url + 'dev-v1.1.json'}, field="data")
While these are the most common JSON formats, you will see other datasets that are formatted differently. Datasets recognizes these other formats, and will fallback accordingly on the Python JSON loading methods to handle them.
(3)Text files
Text files are one of the most common file types for storing a dataset. Datasets will read the text file line by line to build the dataset. ```python from datasets import load_dataset
dataset = load_dataset(‘text’, data_files={‘train’: [‘my_text_1.txt’, ‘my_text_2.txt’], ‘test’: ‘my_test_file.txt’})
jy: To load remote TXT files via HTTP, you can pass the URLs:
dataset = load_dataset(‘text’, data_files=’https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt‘)
<a name="z72Gp"></a>
## (4)Parquet
- Parquet files are stored in a columnar format unlike row-based files like CSV. Large datasets may be stored in a Parquet file because it is more efficient, and faster at returning your query. Load a Parquet file as shown in the following example:
```python
from datasets import load_dataset
dataset = load_dataset("parquet",
data_files={'train': 'train.parquet',
'test': 'test.parquet'})
- To load remote parquet files via HTTP, you can pass the URLs: ```python base_url = “https://storage.googleapis.com/huggingface-nlp/cache/datasets/wikipedia/20200501.en/1.0.0/“
data_files = {“train”: base_url + “wikipedia-train.parquet”} wiki = load_dataset(“parquet”, data_files=data_files, split=”train”)
<a name="aTItG"></a>
# 4、加载 In-memory data
- Datasets will also allow you to create a `Dataset` directly from in-memory data structures like Python dictionaries and Pandas DataFrames.
<a name="M2c0g"></a>
## (1)Python dictionary
- Load Python dictionaries with `Dataset.from_dict()`:
- `Dataset.from_dict()`:[https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Dataset.from_dict](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Dataset.from_dict)
```python
from datasets import Dataset
my_dict = {"a": [1, 2, 3]}
dataset = Dataset.from_dict(my_dict)
(2)Pandas DataFrame
- Load Pandas DataFrames with
Dataset.from_pandas()
:Dataset.from_pandas()
:https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Dataset.from_pandas ```python from datasets import Dataset import pandas as pd
df = pd.DataFrame({“a”: [1, 2, 3]}) dataset = Dataset.from_pandas(df)
- An object data type in `pandas.Series` doesn't always carry enough information for Arrow to automatically infer a data type. For example, if a DataFrame is of length 0 or the Series only contains None/nan objects, the type is set to null. Avoid potential errors by constructing an explicit schema with `Features` using the `from_dict` or `from_pandas` methods.
- `pandas.Series`:[https://pandas.pydata.org/docs/reference/api/pandas.Series.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)
- `Features`:[https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Features](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Features)
- For more details on how to explicitly specify your own features:参考第 7 章节。
<a name="QEHQ4"></a>
# 5、加载 Offline 数据
- Even if you don't have an internet connection, it is still possible to load a dataset. As long as you've downloaded a dataset from the Hub or Datasets GitHub repository before, it should be cached. This means you can reload the dataset from the cache and use it offline.
- If you know you won't have internet access, you can run Datasets in full offline mode. This saves time because instead of waiting for the Dataset builder download to time out, Datasets will look directly in the cache. Set the environment variable `HF_DATASETS_OFFLINE` to 1 to enable full offline mode.
<a name="xS5M7"></a>
# 6、Slice splits
- For even greater control over how to load a split, you can choose to only load specific slices of a split. There are two options for slicing a split: using strings or `ReadInstruction`.
- Strings are more compact and readable for simple cases
- `ReadInstruction`:is easier to use with variable slicing parameters.
- `ReadInstruction`:[https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/builder_classes#datasets.ReadInstruction](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/builder_classes#datasets.ReadInstruction)
```python
# jy: Concatenate the train and test split by:
train_test_ds = datasets.load_dataset('bookcorpus', split='train+test')
# jy: Select specific rows of the train split:
train_10_20_ds = datasets.load_dataset('bookcorpus', split='train[10:20]')
# jy: Or select a percentage of the split with:
train_10pct_ds = datasets.load_dataset('bookcorpus', split='train[:10%]')
# jy: You can even select a combination of percentages from each split:
train_10_80pct_ds = datasets.load_dataset('bookcorpus',
split='train[:10%]+train[-80%:]')
# jy: Finally, create cross-validated dataset splits by: 10-fold cross-validation
# (see also next section on rounding behavior):
# The validation datasets are each going to be 10%:
# [0%:10%], [10%:20%], ..., [90%:100%].
# And the training datasets are each going to be the complementary 90%:
# [10%:100%] (for a corresponding validation set of [0%:10%]),
# [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ...,
# [0%:90%] (for a validation set of [90%:100%]).
vals_ds = datasets.load_dataset('bookcorpus',
split=[f'train[{k}%:{k+10}%]' for k in \
range(0, 100, 10)])
trains_ds = datasets.load_dataset('bookcorpus',
split=[f'train[:{k}%]+train[{k+10}%:]' for k in \
range(0, 100, 10)])
Percent slicing and rounding
- For datasets where the requested slice boundaries do not divide evenly by 100, the default behavior is to round the boundaries to the nearest integer. As a result, some slices may contain more examples than others as shown in the following example:
```python
Assuming train split contains 999 records.
19 records, from 500 (included) to 519 (excluded).
train_50_52_ds = datasets.load_dataset(‘bookcorpus’, split=’train[50%:52%]’)
20 records, from 519 (included) to 539 (excluded).
train_52_54_ds = datasets.load_dataset(‘bookcorpus’, split=’train[52%:54%]’)
- If you want equal sized splits, use `pct1_dropremainder` rounding instead. This will treat the specified percentage boundaries as multiples of 1%.
```python
# 18 records, from 450 (included) to 468 (excluded).
train_50_52pct1_ds = datasets.load_dataset('bookcorpus',
split=datasets.ReadInstruction(
'train',
from_=50, to=52,
unit='%',
rounding='pct1_dropremainder'))
# 18 records, from 468 (included) to 486 (excluded).
train_52_54pct1_ds = datasets.load_dataset('bookcorpus',
split=datasets.ReadInstruction(
'train',
from_=52, to=54,
unit='%',
rounding='pct1_dropremainder'))
# Or equivalently:
train_50_52pct1_ds = datasets.load_dataset('bookcorpus',
split='train[50%:52%](pct1_dropremainder)')
train_52_54pct1_ds = datasets.load_dataset('bookcorpus',
split='train[52%:54%](pct1_dropremainder)')
Using
pct1_dropremainder
rounding may truncate the last examples in a dataset if the number of examples in your dataset don’t divide evenly by 100.7、Troubleshooting
Sometimes, you may get unexpected results when you load a dataset. In this section, you will learn how to solve two common issues you may encounter when you load a dataset: manually download a dataset, and specify features of a dataset.
(1)Manual download
Certain datasets require you to manually download the dataset files due to licensing incompatibility, or if the files are hidden behind a login page. This will cause
load_dataset()
to throw an AssertionError.- But Datasets provides detailed instructions for downloading the missing files. After you have downloaded the files, use the
data_dir
argument to specify the path to the files you just downloaded. - For example, if you try to download a configuration from the MATINF dataset:
- MATINF:https://huggingface.co/datasets/matinf ```python dataset = load_dataset(“matinf”, “summarization”) “”” Downloading and preparing dataset matinf/summarization (download: Unknown size, generated: 246.89 MiB, post-processed: Unknown size, total: 246.89 MiB) to /root/.cache/huggingface/datasets/matinf/summarization/1.0.0/82eee5e71c3ceaf20d909bca36ff237452b4e4ab195d3be7ee1c78b53e6f540e…
AssertionError: The dataset matinf with config summarization requires manual data.
Please follow the manual download instructions: To use MATINF you have to download it manually. Please fill this google form (https://forms.gle/nkH4LVE4iNQeDzsc9). You will receive a download link and a password once you complete the form. Please extract all files in one folder and load the dataset with: datasets.load_dataset(‘matinf’, data_dir=’path/to/folder/folder_name’).
Manual data can be loaded with:
`datasets.load_dataset(matinf, data_dir=’
<a name="iM6vu"></a>
## (2)Specify features
- When you create a dataset from local files, the `Features` are automatically inferred by Apache Arrow. However, the features of the dataset may not always align with your expectations or you may want to define the features yourself.
- `Features`:[https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Features](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Features)
- Apache Arrow:[https://arrow.apache.org/docs/](https://arrow.apache.org/docs/)
- The following example shows how you can add custom labels with `ClassLabel`. First, define your own labels using the `Features` class:
- `ClassLabel`:[https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.ClassLabel](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.ClassLabel)
```python
class_names = ["sadness", "joy", "love", "anger", "fear", "surprise"]
emotion_features = Features({'text': Value('string'),
'label': ClassLabel(names=class_names)})
Next, specify the features argument in
load_dataset()
with the features you just created:dataset = load_dataset('csv', data_files=file_dict, delimiter=';', column_names=['text', 'label'], features=emotion_features)
Now when you look at your dataset features, you can see it uses the custom labels you defined:
dataset['train'].features """ {'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=6, names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], names_file=None, id=None)} """
8、Metrics
When the metric you want to use is not supported by Datasets, you can write and use your own metric script. Load your metric by providing the path to your local metric loading script: ```python from datasets import load_metric
metric = load_metric(‘PATH/TO/MY/METRIC/SCRIPT’)
Example of typical usage
for batch in dataset: inputs, references = batch predictions = model(inputs) metric.add_batch(predictions=predictions, references=references)
score = metric.compute()
- For more details on how to write your own metric loading script:参考后续的 Metrics 章节。
<a name="QIg7H"></a>
## (1)Load configurations
- It is possible for a metric to have different configurations. The configurations are stored in the `config_name` parameter in `MetricInfo` attribute. When you load a metric, provide the configuration name as shown in the following:
- `MetricInfo`:[https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.MetricInfo](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.MetricInfo)
```python
from datasets import load_metric
metric = load_metric('bleurt', name='bleurt-base-128')
metric = load_metric('bleurt', name='bleurt-base-512')
(2)Distributed setup
- When you work in a distributed or parallel processing environment, loading and computing a metric can be tricky because these processes are executed in parallel on separate subsets of the data.
- Datasets supports distributed usage with a few additional arguments when you load a metric.
- For example, imagine you are training and evaluating on eight parallel processes. Here’s how you would load a metric in this distributed setting:
- Define the total number of processes with the
num_process
argument. - Set the process rank as an integer between zero and num_process - 1.
- Load your metric with
load_metric()
with these arguments: ```python from datasets import load_metric
- Define the total number of processes with the
metric = load_metric(‘glue’, ‘mrpc’, num_process=num_process, process_id=rank)
- Once you've loaded a metric for distributed usage, you can compute the metric as usual. Behind the scenes, `Metric.compute()` gathers all the predictions and references from the nodes, and computes the final metric.
- `Metric.compute()`:[https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Metric.compute](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Metric.compute)
- In some instances, you may be simultaneously running multiple independent distributed evaluations on the same server and files. To avoid any conflicts, it is important to provide an `experiment_id` to distinguish the separate evaluations:
```python
from datasets import load_metric
metric = load_metric('glue', 'mrpc',
num_process=num_process,
process_id=process_id,
experiment_id="My_experiment_10")