
  • 获取数据,存放在相应的位置
  • 使用torchtext建立数据集
  • 使用torchtext将词转下标,下标转词,词转词向量
  • 如何建立相应的迭代器


  1. 定义Field:声明如何处理数据
  2. 定义Dataset:得到数据集,此时数据集里每一个样本是一个 经过 Field声明的预处理 预处理后的 wordlist
  3. 建立vocab:在这一步建立词汇表,词向量(word embeddings)
  4. 构造迭代器:构造迭代器,用来分批次训练模型

    1. 下载数据:

    kaggle:Movie Review Sentiment Analysis (Kernels Only)
    train.tsv contains the phrases and their associated sentiment labels. We have additionally provided a SentenceId so that you can track which phrases belong to a single sentence.
    test.tsv contains just phrases. You must assign a sentiment label to each phrase.
    The sentiment labels are:
    0 - negative
    1 - somewhat negative
    2 - neutral
    3 - somewhat positive
    4 - positive

    1.1 读取文件,查看文件

    1. import pandas as pd
    2. data = pd.read_csv('train.tsv', sep='\t')
    3. test = pd.read_csv('test.tsv', sep='\t')

    1.2 data.tsv

    1. data[:5]
    torchtext使用教程 - 图1

    1.3 test.tsv

    1. test[:5]
    torchtext使用教程 - 图2

    1.4 划分验证集

    1. from sklearn.model_selection import train_test_split
    2. # create train and validation set
    3. train, val = train_test_split(data, test_size=0.2)
    4. train.to_csv("train.csv", index=False)
    5. val.to_csv("val.csv", index=False)

    3. 定义Field

    1. import spacy
    2. import torch
    3. from torchtext import data, datasets
    4. from torchtext.vocab import Vectors
    5. from torch.nn import init
    6. DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  • sequential: Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.
  • use_vocab: Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
  • init_token: A token that will be prepended to every example using this field, or None for no initial token. Default: None.
  • eos_token: A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
  • fix_length: A fixed length that all examples using this field will be padded to, or None for flexible sequence lengths. Default: None.
  • dtype: The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
  • preprocessing: The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.
  • postprocessing: A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default: None.
  • lower: Whether to lowercase the text in this field. Default: False.
  • tokenize: The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.
  • tokenizer_language: The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
  • include_lengths: Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
  • batch_first: Whether to produce tensors with the batch dimension first. Default: False.
  • pad_token: The string token used as padding. Default: “”.
  • unk_token: The string token used to represent OOV words. Default: “”.
  • pad_first: Do the padding of the sequence at the beginning. Default: False.
  • truncate_first: Do the truncating of the sequence at the beginning. Default: False
  • stop_words: Tokens to discard during the preprocessing step. Default: None
  • is_target: Whether this field is a target variable. Affects iteration over batches. Default: False


  1. spacy_en = spacy.load('en')
  2. def tokenizer(text): # create a tokenizer function
  3. """
  4. 定义分词操作
  5. """
  6. return [tok.text for tok in spacy_en.tokenizer(text)]
  7. """
  8. field在默认的情况下都期望一个输入是一组单词的序列,并且将单词映射成整数。
  9. 这个映射被称为vocab。如果一个field已经被数字化了并且不需要被序列化,
  10. 可以将参数设置为use_vocab=False以及sequential=False
  11. """
  12. LABEL = data.Field(sequential=False, use_vocab=False)
  13. TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True)

4. 定义Dataset

The fields知道当给定原始数据的时候要做什么。现在,我们需要告诉fields它需要处理什么样的数据。这个功能利用Datasets来实现。
TabularDataset官网介绍: Defines a Dataset of columns stored in CSV, TSV, or JSON format.

  1. """
  2. 我们不需要 'PhraseId' 'SentenceId'这两列, 所以我们给他们的field传递 None
  3. 如果你的数据有列名,如我们这里的'Phrase','Sentiment',...
  4. 设置skip_header=True,不然它会把列名也当一个数据处理
  5. """
  6. train,val = data.TabularDataset.splits(
  7. path='.', train='train.csv',validation='val.csv', format='csv',skip_header=True,
  8. fields=[('PhraseId',None),('SentenceId',None),('Phrase', TEXT), ('Sentiment', LABEL)])
  9. test = data.TabularDataset('test.tsv', format='tsv',skip_header=True,
  10. fields=[('PhraseId',None),('SentenceId',None),('Phrase', TEXT)])

注意:传入的(name, field)必须与列的顺序相同。

  1. print(train[5])
  2. print(train[5].__dict__.keys())
  3. print(train[5].Phrase,train[0].Sentiment)

torchtext使用教程 - 图3

5. 建立vocab


  1. TEXT.build_vocab(train, vectors='glove.6B.100d')#, max_size=30000)
  2. # 当 corpus 中有的 token 在 vectors 中不存在时 的初始化方式.
  3. TEXT.vocab.vectors.unk_init = init.xavier_uniform

这行代码使得 Torchtext遍历训练集中的绑定TEXT field的数据,将单词注册到vocabulary,并自动构建embedding矩阵。
’glove.6B.100d’ 为torchtext支持的词向量名字,第一次使用是会自动下载并保存在当前目录的 .vector_cache里面。

  • charngram.100d
  • fasttext.en.300d
  • fasttext.simple.300d
  • glove.42B.300d
  • glove.840B.300d
  • glove.twitter.27B.25d
  • glove.twitter.27B.50d
  • glove.twitter.27B.100d
  • glove.twitter.27B.200d
  • glove.6B.50d
  • glove.6B.100d
  • glove.6B.200d
  • glove.6B.300d


  1. TEXT.build_vocab(train, vectors='fasttext.en.300d')


  1. print(TEXT.vocab.itos[1510])
  2. print(TEXT.vocab.stoi['bore'])
  3. # 词向量矩阵: TEXT.vocab.vectors
  4. print(TEXT.vocab.vectors.shape)
  5. word_vec = TEXT.vocab.vectors[TEXT.vocab.stoi['bore']]
  6. print(word_vec.shape)
  7. print(word_vec)

torchtext使用教程 - 图4

6. 构造迭代器


  • Defines an iterator that batches examples of similar lengths together.
  • Minimizes amount of padding needed while producing freshly shuffled batches for each new epoch.

    1. train_iter = data.BucketIterator(train, batch_size=128, sort_key=lambda x: len(x.Phrase),
    2. shuffle=True,device=DEVICE)
    3. val_iter = data.BucketIterator(val, batch_size=128, sort_key=lambda x: len(x.Phrase),
    4. shuffle=True,device=DEVICE)
    5. # 在 test_iter , sort一定要设置成 False, 要不然会被 torchtext 搞乱样本顺序
    6. test_iter = data.Iterator(dataset=test, batch_size=128, train=False,
    7. sort=False, device=DEVICE)

    6.1 使用方法一

    1. batch = next(iter(train_iter))
    2. data = batch.Phrase
    3. label = batch.Sentiment
    4. print(batch.Phrase.shape)
    5. print(batch.Phrase)

    torchtext使用教程 - 图5
    可以发现,它输出的是word index,后面的128是batch size

    6.2 使用方法二

    1. for batch in train_iter:
    2. data = batch.Phrase
    3. label = batch.Sentiment

    7. 完整代码

    1. import spacy
    2. import torch
    3. from torchtext import data, datasets
    4. from torchtext.vocab import Vectors
    5. from torch.nn import init
    6. import torch.nn as nn
    7. import torch.nn.functional as F
    8. import torch.optim as optim
    9. import numpy as np
    10. from sklearn.model_selection import train_test_split
    11. import pandas as pd
    12. DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    13. data = pd.read_csv('train.tsv', sep='\t')
    14. test = pd.read_csv('test.tsv', sep='\t')
    15. # create train and validation set
    16. train, val = train_test_split(data, test_size=0.2)
    17. train.to_csv("train.csv", index=False)
    18. val.to_csv("val.csv", index=False)
    19. spacy_en = spacy.load('en')
    20. def tokenizer(text): # create a tokenizer function
    21. return [tok.text for tok in spacy_en.tokenizer(text)]
    22. # Field
    23. TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True)
    24. LABEL = data.Field(sequential=False, use_vocab=False)
    25. # Dataset
    26. train,val = data.TabularDataset.splits(
    27. path='.', train='train.csv',validation='val.csv', format='csv',skip_header=True,
    28. fields=[('PhraseId',None),('SentenceId',None),('Phrase', TEXT), ('Sentiment', LABEL)])
    29. test = data.TabularDataset('test.tsv', format='tsv',skip_header=True,
    30. fields=[('PhraseId',None),('SentenceId',None),('Phrase', TEXT)])
    31. # build vocab
    32. TEXT.build_vocab(train, vectors='glove.6B.100d')#, max_size=30000)
    33. TEXT.vocab.vectors.unk_init = init.xavier_uniform
    34. # Iterator
    35. train_iter = data.BucketIterator(train, batch_size=128, sort_key=lambda x: len(x.Phrase),
    36. shuffle=True,device=DEVICE)
    37. val_iter = data.BucketIterator(val, batch_size=128, sort_key=lambda x: len(x.Phrase),
    38. shuffle=True,device=DEVICE)
    39. # 在 test_iter , sort一定要设置成 False, 要不然会被 torchtext 搞乱样本顺序
    40. test_iter = data.Iterator(dataset=test, batch_size=128, train=False,
    41. sort=False, device=DEVICE)
    42. """
    43. 由于目的是学习torchtext的使用,所以只定义了一个简单模型
    44. """
    45. len_vocab = len(TEXT.vocab)
    46. class Enet(nn.Module):
    47. def __init__(self):
    48. super(Enet, self).__init__()
    49. self.embedding = nn.Embedding(len_vocab,100)
    50. self.lstm = nn.LSTM(100,128,3,batch_first=True)#,bidirectional=True)
    51. self.linear = nn.Linear(256,5)
    52. def forward(self, x):
    53. batch_size,seq_num = x.shape
    54. vec = self.embedding(x)
    55. out, (hn, cn) = self.lstm(vec)
    56. out = self.linear(out[:,-1,:])
    57. out = F.softmax(out,-1)
    58. return out
    59. model = Enet()
    60. """
    61. 将前面生成的词向量矩阵拷贝到模型的embedding
    62. 这样就自动的可以将输入的word index转为词向量
    63. """
    64. model.embedding.weight.data.copy_(TEXT.vocab.vectors)
    65. model.to(DEVICE)
    66. # 训练
    67. optimizer = optim.Adam(model.parameters())#,lr=0.000001)
    68. n_epoch = 20
    69. best_val_acc = 0
    70. for epoch in range(n_epoch):
    71. for batch_idx, batch in enumerate(train_iter):
    72. data = batch.Phrase
    73. target = batch.Sentiment
    74. target = torch.sparse.torch.eye(5).index_select(dim=0, index=target.cpu().data)
    75. target = target.to(DEVICE)
    76. data = data.permute(1,0)
    77. optimizer.zero_grad()
    78. out = model(data)
    79. loss = -target*torch.log(out)-(1-target)*torch.log(1-out)
    80. loss = loss.sum(-1).mean()
    81. loss.backward()
    82. optimizer.step()
    83. if (batch_idx+1) %200 == 0:
    84. _,y_pre = torch.max(out,-1)
    85. acc = torch.mean((torch.tensor(y_pre == batch.Sentiment,dtype=torch.float)))
    86. print('epoch: %d \t batch_idx : %d \t loss: %.4f \t train acc: %.4f'
    87. %(epoch,batch_idx,loss,acc))
    88. val_accs = []
    89. for batch_idx, batch in enumerate(val_iter):
    90. data = batch.Phrase
    91. target = batch.Sentiment
    92. target = torch.sparse.torch.eye(5).index_select(dim=0, index=target.cpu().data)
    93. target = target.to(DEVICE)
    94. data = data.permute(1,0)
    95. out = model(data)
    96. _,y_pre = torch.max(out,-1)
    97. acc = torch.mean((torch.tensor(y_pre == batch.Sentiment,dtype=torch.float)))
    98. val_accs.append(acc)
    99. acc = np.array(val_accs).mean()
    100. if acc > best_val_acc:
    101. print('val acc : %.4f > %.4f saving model'%(acc,best_val_acc))
    102. torch.save(model.state_dict(), 'params.pkl')
    103. best_val_acc = acc
    104. print('val acc: %.4f'%(acc))