实验1 - Simple Sentiment Analysis

网络结构大致是:one-hot encoder => embedding => rnn => fullly-connection => sigmod
image.png

实验结果:效果一般,Test Loss: 0.676 | Test Acc: 60.82%

实验2 - Upgraded Sentiment Analysis

Preparing Data

  • 为了告诉RNN实际的句子有多长,我们通过设置include_lengths = True 给TEXT field

    1. TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
  • 使用预训练的词向量

    • 一般来说,TorchText对在在词表但是不在预训练词向量中的未登录词汇初始化为0,我们可以给他初始化为一个标准正态分布。

      1. TEXT.build_vocab(train_data,
      2. max_size = MAX_VOCAB_SIZE,
      3. vectors = "glove.6B.100d",
      4. unk_init = torch.Tensor.normal_)
    • 另一件事就是对于所有packed padded sequences,都需要根据其长度排序。

      1. train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
      2. (train_data, valid_data, test_data),
      3. batch_size = BATCH_SIZE,
      4. sort_within_batch = True,
      5. device = device)

      Build the Model

      LSTM

      image.png

      Bidirectional RNN

      image.png

      Deep RNNs

      image.png

Regularization

一般来讲,参数越多,过拟合的可能性越大,使用Dropout来进行正则化,Dropout的思想就是随机drop一些神经单元,用一个超参数来控制dropout rate。为什么Dropout会work?一个解释是经过dropout后的model可以看作一个弱学习器weaker,此时预测的过程就是从这些弱学习器中集成获得较好的performance。

Implementation Details

  • Dropout在forward方法中使用,但是dropout不能用在input或者output层,即只能在中间层使用;
  • 将embeddings送入RNN的之前,需要先将其用其nn.utils.rnn.packed_padded_sequence处理,然后送入RNN,这样就可以让RNN只处理真实的sequence
  • 接下来继续unpack output sequence,通过使用nn.utils.rnn.pad_packed_sequence

**

Result

实验3 - Faster Sentiment Analysis

  1. def generate_bigrams(x):
  2. n_grams = set(zip(*[x[i:] for i in range(2)]))
  3. for n_gram in n_grams:
  4. x.append(' '.join(n_gram))
  5. return x
  1. generate_bigrams(['This', 'film', 'is', 'terrible'])
  2. output:
  3. ['This', 'film', 'is', 'terrible', 'film is', 'is terrible', 'This film']

FastText

image.pngimage.png
改变数据的shape,然后将其做平均,来表示整个句子…

  1. class FastText(nn.Module):
  2. def __init__(self, vocab_size, embedding_dim, output_dim, pad_idx):
  3. super().__init__()
  4. self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
  5. self.fc = nn.Linear(embedding_dim, output_dim)
  6. def forward(self, text):
  7. #text = [sent len, batch size]
  8. embedded = self.embedding(text)
  9. #embedded = [sent len, batch size, emb dim]
  10. embedded = embedded.permute(1, 0, 2)
  11. #embedded = [batch size, sent len, emb dim]
  12. pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1)
  13. #pooled = [batch size, embedding_dim]
  14. return self.fc(pooled)

Result

  • epoch:15 Train Acc:93% Val. Acc:89% Test Loss: 0.394 | Test Acc: 86.45%

“This film is great” 这个case没问题,加个not就坏掉了。。。
image.png

实验5 - Multi-class Sentiment Analysis

同实验4,更换了数据集,6分类任务,使用softmax,而非BCEWithLogitsLoss+sigmoid

References

实验4 - Convolutional Sentiment Analysis

TextCNN

经典的TextCNN:https://arxiv.org/abs/1408.5882

image.pngimage.pngimage.png

  1. import torch.nn as nn
  2. import torch.nn.functional as F
  3. class CNN(nn.Module):
  4. def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim,
  5. dropout, pad_idx):
  6. super().__init__()
  7. self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
  8. self.conv_0 = nn.Conv2d(in_channels = 1,
  9. out_channels = n_filters,
  10. kernel_size = (filter_sizes[0], embedding_dim))
  11. self.conv_1 = nn.Conv2d(in_channels = 1,
  12. out_channels = n_filters,
  13. kernel_size = (filter_sizes[1], embedding_dim))
  14. self.conv_2 = nn.Conv2d(in_channels = 1,
  15. out_channels = n_filters,
  16. kernel_size = (filter_sizes[2], embedding_dim))
  17. self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
  18. self.dropout = nn.Dropout(dropout)
  19. def forward(self, text):
  20. #text = [batch size, sent len]
  21. embedded = self.embedding(text)
  22. #embedded = [batch size, sent len, emb dim]
  23. embedded = embedded.unsqueeze(1)
  24. #embedded = [batch size, 1, sent len, emb dim]
  25. conved_0 = F.relu(self.conv_0(embedded).squeeze(3))
  26. conved_1 = F.relu(self.conv_1(embedded).squeeze(3))
  27. conved_2 = F.relu(self.conv_2(embedded).squeeze(3))
  28. #conved_n = [batch size, n_filters, sent len - filter_sizes[n] + 1]
  29. pooled_0 = F.max_pool1d(conved_0, conved_0.shape[2]).squeeze(2)
  30. pooled_1 = F.max_pool1d(conved_1, conved_1.shape[2]).squeeze(2)
  31. pooled_2 = F.max_pool1d(conved_2, conved_2.shape[2]).squeeze(2)
  32. #pooled_n = [batch size, n_filters]
  33. cat = self.dropout(torch.cat((pooled_0, pooled_1, pooled_2), dim = 1))
  34. #cat = [batch size, n_filters * len(filter_sizes)]
  35. return self.fc(cat)

Result

  • Epoch: 5 Train Acc: 87.80% Val. Acc: 85.92% Test Loss: 0.341 Test Acc: 85.20%

image.png
说明还是有明显效果的。

References

Colab

由于不可抗拒的因素,我们必须会用到一些外力才可以正常访问Google等网站,如果你的工具不是很稳定,那么使用Colab的时候就不会那么愉快。有时候本想着安安稳稳挂机一夜,第二天早上一看电脑才发现真的是“挂”了一夜:电脑开着,Colab却挂了。其实问题也很简单,只需要一段小小的代码就可以解决这个问题:

  1. function ClickConnect(){
  2. console.log("Working");
  3. document.querySelector("colab-toolbar-button#connect").click()
  4. }
  5. setInterval(ClickConnect,60000)

打开你的浏览器,执行F12或者Ctrl + Shift + i , 将上面代码复制粘贴到Console框里按回车即可。

Pre-train Word Vectors

  • charngram.100d
  • fasttext.en.300d
  • fasttext.simple.300d
  • glove.42B.300d
  • glove.840B.300d
  • glove.twitter.27B.25d
  • glove.twitter.27B.50d
  • glove.twitter.27B.100d
  • glove.twitter.27B.200d
  • glove.6B.50d
  • glove.6B.100d
  • glove.6B.200d
  • glove.6B.300d