Natural Language Inference and the Dataset

:label:sec_natural-language-inference-and-dataset

In :numref:sec_sentiment, we discussed the problem of sentiment analysis. This task aims to classify a single text sequence into predefined categories, such as a set of sentiment polarities. However, when there is a need to decide whether one sentence can be inferred form another, or eliminate redundancy by identifying sentences that are semantically equivalent, knowing how to classify one text sequence is insufficient. Instead, we need to be able to reason over pairs of text sequences.

Natural Language Inference

Natural language inference studies whether a hypothesis can be inferred from a premise, where both are a text sequence. In other words, natural language inference determines the logical relationship between a pair of text sequences. Such relationships usually fall into three types:

  • Entailment: the hypothesis can be inferred from the premise.
  • Contradiction: the negation of the hypothesis can be inferred from the premise.
  • Neutral: all the other cases.

Natural language inference is also known as the recognizing textual entailment task. For example, the following pair will be labeled as entailment because “showing affection” in the hypothesis can be inferred from “hugging one another” in the premise.

Premise: Two women are hugging each other.

Hypothesis: Two women are showing affection.

The following is an example of contradiction as “running the coding example” indicates “not sleeping” rather than “sleeping”.

Premise: A man is running the coding example from Dive into Deep Learning.

Hypothesis: The man is sleeping.

The third example shows a neutrality relationship because neither “famous” nor “not famous” can be inferred from the fact that “are performing for us”.

Premise: The musicians are performing for us.

Hypothesis: The musicians are famous.

Natural language inference has been a central topic for understanding natural language. It enjoys wide applications ranging from information retrieval to open-domain question answering. To study this problem, we will begin by investigating a popular natural language inference benchmark dataset.

The Stanford Natural Language Inference (SNLI) Dataset

Stanford Natural Language Inference (SNLI) Corpus is a collection of over 500000 labeled English sentence pairs :cite:Bowman.Angeli.Potts.ea.2015. We download and store the extracted SNLI dataset in the path ../data/snli_1.0.

```{.python .input} from d2l import mxnet as d2l from mxnet import gluon, np, npx import os import re

npx.set_np()

@save

d2l.DATA_HUB[‘SNLI’] = ( ‘https://nlp.stanford.edu/projects/snli/snli_1.0.zip‘, ‘9fcde07509c7e87ec61c640c1b2753d9041758e4’)

data_dir = d2l.download_extract(‘SNLI’)

  1. ```{.python .input}
  2. #@tab pytorch
  3. from d2l import torch as d2l
  4. import torch
  5. from torch import nn
  6. import os
  7. import re
  8. #@save
  9. d2l.DATA_HUB['SNLI'] = (
  10. 'https://nlp.stanford.edu/projects/snli/snli_1.0.zip',
  11. '9fcde07509c7e87ec61c640c1b2753d9041758e4')
  12. data_dir = d2l.download_extract('SNLI')

Reading the Dataset

The original SNLI dataset contains much richer information than what we really need in our experiments. Thus, we define a function read_snli to only extract part of the dataset, then return lists of premises, hypotheses, and their labels.

```{.python .input}

@tab all

@save

def read_snli(data_dir, is_train): “””Read the SNLI dataset into premises, hypotheses, and labels.””” def extract_text(s):

  1. # Remove information that will not be used by us
  2. s = re.sub('\\(', '', s)
  3. s = re.sub('\\)', '', s)
  4. # Substitute two or more consecutive whitespace with space
  5. s = re.sub('\\s{2,}', ' ', s)
  6. return s.strip()
  7. label_set = {'entailment': 0, 'contradiction': 1, 'neutral': 2}
  8. file_name = os.path.join(data_dir, 'snli_1.0_train.txt'
  9. if is_train else 'snli_1.0_test.txt')
  10. with open(file_name, 'r') as f:
  11. rows = [row.split('\t') for row in f.readlines()[1:]]
  12. premises = [extract_text(row[1]) for row in rows if row[0] in label_set]
  13. hypotheses = [extract_text(row[2]) for row in rows if row[0] in label_set]
  14. labels = [label_set[row[0]] for row in rows if row[0] in label_set]
  15. return premises, hypotheses, labels
  1. Now let's print the first 3 pairs of premise and hypothesis, as well as their labels ("0", "1", and "2" correspond to "entailment", "contradiction", and "neutral", respectively ).
  2. ```{.python .input}
  3. #@tab all
  4. train_data = read_snli(data_dir, is_train=True)
  5. for x0, x1, y in zip(train_data[0][:3], train_data[1][:3], train_data[2][:3]):
  6. print('premise:', x0)
  7. print('hypothesis:', x1)
  8. print('label:', y)

The training set has about 550000 pairs, and the testing set has about 10000 pairs. The following shows that the three labels “entailment”, “contradiction”, and “neutral” are balanced in both the training set and the testing set.

```{.python .input}

@tab all

test_data = read_snli(data_dir, is_train=False) for data in [train_data, test_data]: print([[row for row in data[2]].count(i) for i in range(3)])

  1. ### Defining a Class for Loading the Dataset
  2. Below we define a class for loading the SNLI dataset by inheriting from the `Dataset` class in Gluon. The argument `num_steps` in the class constructor specifies the length of a text sequence so that each minibatch of sequences will have the same shape.
  3. In other words,
  4. tokens after the first `num_steps` ones in longer sequence are trimmed, while special tokens “<pad>” will be appended to shorter sequences until their length becomes `num_steps`.
  5. By implementing the `__getitem__` function, we can arbitrarily access the premise, hypothesis, and label with the index `idx`.
  6. ```{.python .input}
  7. #@save
  8. class SNLIDataset(gluon.data.Dataset):
  9. """A customized dataset to load the SNLI dataset."""
  10. def __init__(self, dataset, num_steps, vocab=None):
  11. self.num_steps = num_steps
  12. all_premise_tokens = d2l.tokenize(dataset[0])
  13. all_hypothesis_tokens = d2l.tokenize(dataset[1])
  14. if vocab is None:
  15. self.vocab = d2l.Vocab(all_premise_tokens + all_hypothesis_tokens,
  16. min_freq=5, reserved_tokens=['<pad>'])
  17. else:
  18. self.vocab = vocab
  19. self.premises = self._pad(all_premise_tokens)
  20. self.hypotheses = self._pad(all_hypothesis_tokens)
  21. self.labels = np.array(dataset[2])
  22. print('read ' + str(len(self.premises)) + ' examples')
  23. def _pad(self, lines):
  24. return np.array([d2l.truncate_pad(
  25. self.vocab[line], self.num_steps, self.vocab['<pad>'])
  26. for line in lines])
  27. def __getitem__(self, idx):
  28. return (self.premises[idx], self.hypotheses[idx]), self.labels[idx]
  29. def __len__(self):
  30. return len(self.premises)

```{.python .input}

@tab pytorch

@save

class SNLIDataset(torch.utils.data.Dataset): “””A customized dataset to load the SNLI dataset.””” def init(self, dataset, num_steps, vocab=None): self.num_steps = num_steps all_premise_tokens = d2l.tokenize(dataset[0]) all_hypothesis_tokens = d2l.tokenize(dataset[1]) if vocab is None: self.vocab = d2l.Vocab(all_premise_tokens + all_hypothesis_tokens, min_freq=5, reserved_tokens=[‘‘]) else: self.vocab = vocab self.premises = self._pad(all_premise_tokens) self.hypotheses = self._pad(all_hypothesis_tokens) self.labels = torch.tensor(dataset[2]) print(‘read ‘ + str(len(self.premises)) + ‘ examples’)

  1. def _pad(self, lines):
  2. return torch.tensor([d2l.truncate_pad(
  3. self.vocab[line], self.num_steps, self.vocab['<pad>'])
  4. for line in lines])
  5. def __getitem__(self, idx):
  6. return (self.premises[idx], self.hypotheses[idx]), self.labels[idx]
  7. def __len__(self):
  8. return len(self.premises)
  1. ### Putting All Things Together
  2. Now we can invoke the `read_snli` function and the `SNLIDataset` class to download the SNLI dataset and return `DataLoader` instances for both training and testing sets, together with the vocabulary of the training set.
  3. It is noteworthy that we must use the vocabulary constructed from the training set
  4. as that of the testing set.
  5. As a result, any new token from the testing set will be unknown to the model trained on the training set.
  6. ```{.python .input}
  7. #@save
  8. def load_data_snli(batch_size, num_steps=50):
  9. """Download the SNLI dataset and return data iterators and vocabulary."""
  10. num_workers = d2l.get_dataloader_workers()
  11. data_dir = d2l.download_extract('SNLI')
  12. train_data = read_snli(data_dir, True)
  13. test_data = read_snli(data_dir, False)
  14. train_set = SNLIDataset(train_data, num_steps)
  15. test_set = SNLIDataset(test_data, num_steps, train_set.vocab)
  16. train_iter = gluon.data.DataLoader(train_set, batch_size, shuffle=True,
  17. num_workers=num_workers)
  18. test_iter = gluon.data.DataLoader(test_set, batch_size, shuffle=False,
  19. num_workers=num_workers)
  20. return train_iter, test_iter, train_set.vocab

```{.python .input}

@tab pytorch

@save

def load_data_snli(batch_size, num_steps=50): “””Download the SNLI dataset and return data iterators and vocabulary.””” num_workers = d2l.get_dataloader_workers() data_dir = d2l.download_extract(‘SNLI’) train_data = read_snli(data_dir, True) test_data = read_snli(data_dir, False) train_set = SNLIDataset(train_data, num_steps) test_set = SNLIDataset(test_data, num_steps, train_set.vocab) train_iter = torch.utils.data.DataLoader(train_set, batch_size, shuffle=True, num_workers=num_workers) test_iter = torch.utils.data.DataLoader(test_set, batch_size, shuffle=False, num_workers=num_workers) return train_iter, test_iter, train_set.vocab

  1. Here we set the batch size to 128 and sequence length to 50,
  2. and invoke the `load_data_snli` function to get the data iterators and vocabulary.
  3. Then we print the vocabulary size.
  4. ```{.python .input}
  5. #@tab all
  6. train_iter, test_iter, vocab = load_data_snli(128, 50)
  7. len(vocab)

Now we print the shape of the first minibatch. Contrary to sentiment analysis, we have two inputs X[0] and X[1] representing pairs of premises and hypotheses.

```{.python .input}

@tab all

for X, Y in train_iter: print(X[0].shape) print(X[1].shape) print(Y.shape) break ```

Summary

  • Natural language inference studies whether a hypothesis can be inferred from a premise, where both are a text sequence.
  • In natural language inference, relationships between premises and hypotheses include entailment, contradiction, and neutral.
  • Stanford Natural Language Inference (SNLI) Corpus is a popular benchmark dataset of natural language inference.

Exercises

  1. Machine translation has long been evaluated based on superficial $n$-gram matching between an output translation and a ground-truth translation. Can you design a measure for evaluating machine translation results by using natural language inference?
  2. How can we change hyperparameters to reduce the vocabulary size?

:begin_tab:mxnet Discussions :end_tab:

:begin_tab:pytorch Discussions :end_tab: