主要内容:

  • 获取数据,存放在相应的位置
  • 使用torchtext建立数据集
  • 使用torchtext将词转下标,下标转词,词转词向量
  • 如何建立相应的迭代器

    torchtext预处理流程:

  1. 定义Field:声明如何处理数据
  2. 定义Dataset:得到数据集,此时数据集里每一个样本是一个 经过 Field声明的预处理 预处理后的 wordlist
  3. 建立vocab:在这一步建立词汇表,词向量(word embeddings)
  4. 构造迭代器:构造迭代器,用来分批次训练模型

    1. 下载数据:

    kaggle:Movie Review Sentiment Analysis (Kernels Only)
    train.tsv contains the phrases and their associated sentiment labels. We have additionally provided a SentenceId so that you can track which phrases belong to a single sentence.
    test.tsv contains just phrases. You must assign a sentiment label to each phrase.
    The sentiment labels are:
    0 - negative
    1 - somewhat negative
    2 - neutral
    3 - somewhat positive
    4 - positive
    下载得到:train.tsv和test.tsv

    1.1 读取文件,查看文件

    1. import pandas as pd
    2. data = pd.read_csv('train.tsv', sep='\t')
    3. test = pd.read_csv('test.tsv', sep='\t')

    1.2 data.tsv

    1. data[:5]
    torchtext使用教程 - 图1

    1.3 test.tsv

    1. test[:5]
    torchtext使用教程 - 图2

    1.4 划分验证集

    1. from sklearn.model_selection import train_test_split
    2. # create train and validation set
    3. train, val = train_test_split(data, test_size=0.2)
    4. train.to_csv("train.csv", index=False)
    5. val.to_csv("val.csv", index=False)

    3. 定义Field

    首先导入需要的包和定义pytorch张量使用的DEVICE
    1. import spacy
    2. import torch
    3. from torchtext import data, datasets
    4. from torchtext.vocab import Vectors
    5. from torch.nn import init
    6. DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    Torchtext采用了一种声明式的方法来加载数据:你来告诉Torchtext你希望的数据是什么样子的,剩下的由torchtext来处理。
    实现这种声明的是Field,Field确定了一种你想要怎么去处理数据。
    data.Field(…)
    Field的参数如下:
  • sequential: Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.
  • use_vocab: Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
  • init_token: A token that will be prepended to every example using this field, or None for no initial token. Default: None.
  • eos_token: A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
  • fix_length: A fixed length that all examples using this field will be padded to, or None for flexible sequence lengths. Default: None.
  • dtype: The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
  • preprocessing: The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.
  • postprocessing: A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default: None.
  • lower: Whether to lowercase the text in this field. Default: False.
  • tokenize: The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.
  • tokenizer_language: The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
  • include_lengths: Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
  • batch_first: Whether to produce tensors with the batch dimension first. Default: False.
  • pad_token: The string token used as padding. Default: “”.
  • unk_token: The string token used to represent OOV words. Default: “”.
  • pad_first: Do the padding of the sequence at the beginning. Default: False.
  • truncate_first: Do the truncating of the sequence at the beginning. Default: False
  • stop_words: Tokens to discard during the preprocessing step. Default: None
  • is_target: Whether this field is a target variable. Affects iteration over batches. Default: False

例:

  1. spacy_en = spacy.load('en')
  2. def tokenizer(text): # create a tokenizer function
  3. """
  4. 定义分词操作
  5. """
  6. return [tok.text for tok in spacy_en.tokenizer(text)]
  7. """
  8. field在默认的情况下都期望一个输入是一组单词的序列,并且将单词映射成整数。
  9. 这个映射被称为vocab。如果一个field已经被数字化了并且不需要被序列化,
  10. 可以将参数设置为use_vocab=False以及sequential=False
  11. """
  12. LABEL = data.Field(sequential=False, use_vocab=False)
  13. TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True)

4. 定义Dataset

The fields知道当给定原始数据的时候要做什么。现在,我们需要告诉fields它需要处理什么样的数据。这个功能利用Datasets来实现。
Torchtext有大量内置的Datasets去处理各种数据格式。
TabularDataset官网介绍: Defines a Dataset of columns stored in CSV, TSV, or JSON format.
对于csv/tsv类型的文件,TabularDataset很容易进行处理,故我们选它来生成Dataset

  1. """
  2. 我们不需要 'PhraseId' 'SentenceId'这两列, 所以我们给他们的field传递 None
  3. 如果你的数据有列名,如我们这里的'Phrase','Sentiment',...
  4. 设置skip_header=True,不然它会把列名也当一个数据处理
  5. """
  6. train,val = data.TabularDataset.splits(
  7. path='.', train='train.csv',validation='val.csv', format='csv',skip_header=True,
  8. fields=[('PhraseId',None),('SentenceId',None),('Phrase', TEXT), ('Sentiment', LABEL)])
  9. test = data.TabularDataset('test.tsv', format='tsv',skip_header=True,
  10. fields=[('PhraseId',None),('SentenceId',None),('Phrase', TEXT)])

注意:传入的(name, field)必须与列的顺序相同。
查看生成的dataset:

  1. print(train[5])
  2. print(train[5].__dict__.keys())
  3. print(train[5].Phrase,train[0].Sentiment)

输出:
torchtext使用教程 - 图3

5. 建立vocab

我们可以看到第6行的输入,它是一个Example对象。Example对象绑定了一行中的所有属性,可以看到,句子已经被分词了,但是没有转化为数字。
这是因为我们还没有建立vocab,我们将在下一步建立vocab。
Torchtext可以将词转化为数字,但是它需要被告知需要被处理的全部范围的词。我们可以用下面这行代码:

  1. TEXT.build_vocab(train, vectors='glove.6B.100d')#, max_size=30000)
  2. # 当 corpus 中有的 token 在 vectors 中不存在时 的初始化方式.
  3. TEXT.vocab.vectors.unk_init = init.xavier_uniform

这行代码使得 Torchtext遍历训练集中的绑定TEXT field的数据,将单词注册到vocabulary,并自动构建embedding矩阵。
’glove.6B.100d’ 为torchtext支持的词向量名字,第一次使用是会自动下载并保存在当前目录的 .vector_cache里面。
torchtext支持的词向量

  • charngram.100d
  • fasttext.en.300d
  • fasttext.simple.300d
  • glove.42B.300d
  • glove.840B.300d
  • glove.twitter.27B.25d
  • glove.twitter.27B.50d
  • glove.twitter.27B.100d
  • glove.twitter.27B.200d
  • glove.6B.50d
  • glove.6B.100d
  • glove.6B.200d
  • glove.6B.300d

例:
如果打算使用fasttext.en.300d词向量,只需把上面的代码里的vector=’…’里面的词向量名字换一下即可,具体如下:

  1. TEXT.build_vocab(train, vectors='fasttext.en.300d')

到这一步,我们已经可以把词转为数字,数字转为词,词转为词向量

  1. print(TEXT.vocab.itos[1510])
  2. print(TEXT.vocab.stoi['bore'])
  3. # 词向量矩阵: TEXT.vocab.vectors
  4. print(TEXT.vocab.vectors.shape)
  5. word_vec = TEXT.vocab.vectors[TEXT.vocab.stoi['bore']]
  6. print(word_vec.shape)
  7. print(word_vec)

输出:
torchtext使用教程 - 图4

6. 构造迭代器

我们日常使用pytorch训练网络时,每次训练都是输入一个batch,那么,我们怎么把前面得到的dataset转为迭代器,然后遍历迭代器获取batch输入呢?下面将介绍torchtext时怎么实现这一功能的。
和Dataset一样,torchtext有大量内置的迭代器,我们这里选择的是BucketIterator,官网对它的介绍如下:

  • Defines an iterator that batches examples of similar lengths together.
  • Minimizes amount of padding needed while producing freshly shuffled batches for each new epoch.

    1. train_iter = data.BucketIterator(train, batch_size=128, sort_key=lambda x: len(x.Phrase),
    2. shuffle=True,device=DEVICE)
    3. val_iter = data.BucketIterator(val, batch_size=128, sort_key=lambda x: len(x.Phrase),
    4. shuffle=True,device=DEVICE)
    5. # 在 test_iter , sort一定要设置成 False, 要不然会被 torchtext 搞乱样本顺序
    6. test_iter = data.Iterator(dataset=test, batch_size=128, train=False,
    7. sort=False, device=DEVICE)

    6.1 使用方法一

    1. batch = next(iter(train_iter))
    2. data = batch.Phrase
    3. label = batch.Sentiment
    4. print(batch.Phrase.shape)
    5. print(batch.Phrase)

    输出结果:
    torchtext使用教程 - 图5
    可以发现,它输出的是word index,后面的128是batch size

    6.2 使用方法二

    1. for batch in train_iter:
    2. data = batch.Phrase
    3. label = batch.Sentiment

    7. 完整代码

    1. import spacy
    2. import torch
    3. from torchtext import data, datasets
    4. from torchtext.vocab import Vectors
    5. from torch.nn import init
    6. import torch.nn as nn
    7. import torch.nn.functional as F
    8. import torch.optim as optim
    9. import numpy as np
    10. from sklearn.model_selection import train_test_split
    11. import pandas as pd
    12. DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    13. data = pd.read_csv('train.tsv', sep='\t')
    14. test = pd.read_csv('test.tsv', sep='\t')
    15. # create train and validation set
    16. train, val = train_test_split(data, test_size=0.2)
    17. train.to_csv("train.csv", index=False)
    18. val.to_csv("val.csv", index=False)
    19. spacy_en = spacy.load('en')
    20. def tokenizer(text): # create a tokenizer function
    21. return [tok.text for tok in spacy_en.tokenizer(text)]
    22. # Field
    23. TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True)
    24. LABEL = data.Field(sequential=False, use_vocab=False)
    25. # Dataset
    26. train,val = data.TabularDataset.splits(
    27. path='.', train='train.csv',validation='val.csv', format='csv',skip_header=True,
    28. fields=[('PhraseId',None),('SentenceId',None),('Phrase', TEXT), ('Sentiment', LABEL)])
    29. test = data.TabularDataset('test.tsv', format='tsv',skip_header=True,
    30. fields=[('PhraseId',None),('SentenceId',None),('Phrase', TEXT)])
    31. # build vocab
    32. TEXT.build_vocab(train, vectors='glove.6B.100d')#, max_size=30000)
    33. TEXT.vocab.vectors.unk_init = init.xavier_uniform
    34. # Iterator
    35. train_iter = data.BucketIterator(train, batch_size=128, sort_key=lambda x: len(x.Phrase),
    36. shuffle=True,device=DEVICE)
    37. val_iter = data.BucketIterator(val, batch_size=128, sort_key=lambda x: len(x.Phrase),
    38. shuffle=True,device=DEVICE)
    39. # 在 test_iter , sort一定要设置成 False, 要不然会被 torchtext 搞乱样本顺序
    40. test_iter = data.Iterator(dataset=test, batch_size=128, train=False,
    41. sort=False, device=DEVICE)
    42. """
    43. 由于目的是学习torchtext的使用,所以只定义了一个简单模型
    44. """
    45. len_vocab = len(TEXT.vocab)
    46. class Enet(nn.Module):
    47. def __init__(self):
    48. super(Enet, self).__init__()
    49. self.embedding = nn.Embedding(len_vocab,100)
    50. self.lstm = nn.LSTM(100,128,3,batch_first=True)#,bidirectional=True)
    51. self.linear = nn.Linear(256,5)
    52. def forward(self, x):
    53. batch_size,seq_num = x.shape
    54. vec = self.embedding(x)
    55. out, (hn, cn) = self.lstm(vec)
    56. out = self.linear(out[:,-1,:])
    57. out = F.softmax(out,-1)
    58. return out
    59. model = Enet()
    60. """
    61. 将前面生成的词向量矩阵拷贝到模型的embedding
    62. 这样就自动的可以将输入的word index转为词向量
    63. """
    64. model.embedding.weight.data.copy_(TEXT.vocab.vectors)
    65. model.to(DEVICE)
    66. # 训练
    67. optimizer = optim.Adam(model.parameters())#,lr=0.000001)
    68. n_epoch = 20
    69. best_val_acc = 0
    70. for epoch in range(n_epoch):
    71. for batch_idx, batch in enumerate(train_iter):
    72. data = batch.Phrase
    73. target = batch.Sentiment
    74. target = torch.sparse.torch.eye(5).index_select(dim=0, index=target.cpu().data)
    75. target = target.to(DEVICE)
    76. data = data.permute(1,0)
    77. optimizer.zero_grad()
    78. out = model(data)
    79. loss = -target*torch.log(out)-(1-target)*torch.log(1-out)
    80. loss = loss.sum(-1).mean()
    81. loss.backward()
    82. optimizer.step()
    83. if (batch_idx+1) %200 == 0:
    84. _,y_pre = torch.max(out,-1)
    85. acc = torch.mean((torch.tensor(y_pre == batch.Sentiment,dtype=torch.float)))
    86. print('epoch: %d \t batch_idx : %d \t loss: %.4f \t train acc: %.4f'
    87. %(epoch,batch_idx,loss,acc))
    88. val_accs = []
    89. for batch_idx, batch in enumerate(val_iter):
    90. data = batch.Phrase
    91. target = batch.Sentiment
    92. target = torch.sparse.torch.eye(5).index_select(dim=0, index=target.cpu().data)
    93. target = target.to(DEVICE)
    94. data = data.permute(1,0)
    95. out = model(data)
    96. _,y_pre = torch.max(out,-1)
    97. acc = torch.mean((torch.tensor(y_pre == batch.Sentiment,dtype=torch.float)))
    98. val_accs.append(acc)
    99. acc = np.array(val_accs).mean()
    100. if acc > best_val_acc:
    101. print('val acc : %.4f > %.4f saving model'%(acc,best_val_acc))
    102. torch.save(model.state_dict(), 'params.pkl')
    103. best_val_acc = acc
    104. print('val acc: %.4f'%(acc))