【07】datasets - （02）The Dataset object - 《【03】机器学习、深度学习》

jy: NamedSplit(‘train’)
- 3、Rows, slices, batches, and columns

This section will familiarize you with the Dataset object. You will learn about the metadata stored inside a Dataset object, and the basics of querying a Dataset object to return rows and columns.
- Dataset：https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Dataset
A Dataset object is returned when you load an instance of a dataset. This object behaves like a normal Python container. ```python from datasets import load_dataset

dataset = load_dataset(‘glue’, ‘mrpc’, split=’train’)

<a name="qUdVb"></a>
## 1、Metadata
- The `Dataset` object contains a lot of useful information about your dataset. For example, access `DatasetInfo` to return a short description of the dataset, the authors, and even the dataset size. This will give you a quick snapshot of the datasets most important attributes.
   - [https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.DatasetInfo](https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.DatasetInfo)
```python
dataset.info
"""
DatasetInfo(
    description='GLUE, the General Language Understanding Evaluation benchmark\n(https://gluebenchmark.com/) is a collection of resources for training,\nevaluating, and analyzing natural language understanding systems.\n\n', 
    citation='@inproceedings{dolan2005automatically,\n  title={Automatically constructing a corpus of sentential paraphrases},\n  author={Dolan, William B and Brockett, Chris},\n  booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)},\n  year={2005}\n}\n@inproceedings{wang2019glue,\n  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\n  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\n  note={In the Proceedings of ICLR.},\n  year={2019}\n}\n', homepage='https://www.microsoft.com/en-us/download/details.aspx?id=52398', 
    license='', 
    features={'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), 'idx': Value(dtype='int32', id=None)}, post_processed=None, supervised_keys=None, builder_name='glue', config_name='mrpc', version=1.0.0, splits={'train': SplitInfo(name='train', num_bytes=943851, num_examples=3668, dataset_name='glue'), 'validation': SplitInfo(name='validation', num_bytes=105887, num_examples=408, dataset_name='glue'), 'test': SplitInfo(name='test', num_bytes=442418, num_examples=1725, dataset_name='glue')}, 
    download_checksums={'https://dl.fbaipublicfiles.com/glue/data/mrpc_dev_ids.tsv': {'num_bytes': 6222, 'checksum': '971d7767d81b997fd9060ade0ec23c4fc31cbb226a55d1bd4a1bac474eb81dc7'}, 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt': {'num_bytes': 1047044, 'checksum': '60a9b09084528f0673eedee2b69cb941920f0b8cd0eeccefc464a98768457f89'}, 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt': {'num_bytes': 441275, 'checksum': 'a04e271090879aaba6423d65b94950c089298587d9c084bf9cd7439bd785f784'}}, 
    download_size=1494541, 
    post_processing_size=None, 
    dataset_size=1492156, 
    size_in_bytes=2986697
)
"""

You can request specific attributes of the dataset, like description, citation, and homepage, by calling them directly. Take a look at DatasetInfo for a complete list of attributes you can return. ```python
jy: NamedSplit(‘train’)
dataset.split

dataset.description “”” ‘GLUE, the General Language Understanding Evaluation benchmark\n(https://gluebenchmark.com/) is a collection of resources for training,\nevaluating, and analyzing natural language understanding systems.\n\n’ “””

dataset.citation “”” ‘@inproceedings{dolan2005automatically,\n title={Automatically constructing a corpus of sentential paraphrases},\n author={Dolan, William B and Brockett, Chris},\n booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)},\n year={2005}\n}\n@inproceedings{wang2019glue,\n title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\n author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\n note={In the Proceedings of ICLR.},\n year={2019}\n}\n\nNote that each GLUE dataset has its own citation. Please see the source to see\nthe correct citation for each contained dataset.’ “””

dataset.homepage “”” ‘https://www.microsoft.com/en-us/download/details.aspx?id=52398‘ “””

<a name="QS1z1"></a>
## 2、Features and columns
- A dataset is a table of rows and typed columns. Querying a dataset returns a Python dictionary where the keys correspond to column names, and the values correspond to column values:
```python
dataset[0]
{'idx': 0,
'label': 1,
'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

Return the number of rows and columns with the following standard attributes:

dataset.shape          # (3668, 4)
dataset.num_columns    # 4
dataset.num_rows       # 3668
len(dataset)           # 3668

List the columns names with Dataset.column_names()：
- Dataset.column_names()：https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Dataset.column_names
```
dataset.column_names   # ['idx', 'label', 'sentence1', 'sentence2']
```

Get detailed information about the columns with Features：

Features：https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Features

dataset.features
"""
{'idx': Value(dtype='int32', id=None),
'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
}
"""

Return even more specific information about a feature like ClassLabel, by calling its parameters num_classes and names:
- ClassLabel：https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.ClassLabel
```
dataset.features['label'].num_classes  # 2
dataset.features['label'].names        # ['not_equivalent', 'equivalent']
```
  3、Rows, slices, batches, and columns
Get several rows of your dataset at a time with slice notation or a list of indices： ```python dataset[:3] “”” {‘idx’: [0, 1, 2], ‘label’: [1, 0, 1], ‘sentence1’: [‘Amrozi accused his brother , whom he called “ the witness “ , of deliberately distorting his evidence .’, “Yucaipa owned Dominick ‘s before selling the chain to Safeway in 1998 for $ 2.5 billion .”, ‘They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .’], ‘sentence2’: [‘Referring to him as only “ the witness “ , Amrozi accused his brother of deliberately distorting his evidence .’, “Yucaipa bought Dominick ‘s in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .”, “On June 10 , the ship ‘s owners had published an advertisement on the Internet , offering the explosives for sale .”] } “””

dataset[[1, 3, 5]] “”” {‘idx’: [1, 3, 5], ‘label’: [0, 0, 1], ‘sentence1’: [“Yucaipa owned Dominick ‘s before selling the chain to Safeway in 1998 for $ 2.5 billion .”, ‘Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .’, ‘Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier .’], ‘sentence2’: [“Yucaipa bought Dominick ‘s in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .”, ‘Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .’, “With the scandal hanging over Stewart ‘s company , revenue the first quarter of the year dropped 15 percent from the same period a year earlier .”] } “””


- Querying by the column name will return its values. For example, if you want to only return the first three examples：
```python
dataset['sentence1'][:3]
"""
['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .']
"""

Depending on how a Dataset object is queried, the format returned will be different:
- A single row like dataset[0] returns a Python dictionary of values.
- A batch like dataset[5:10] returns a Python dictionary of lists of values.
- A column like dataset['sentence1'] returns a Python list of values.

（02）The Dataset object

jy: NamedSplit(‘train’)

3、Rows, slices, batches, and columns