- Simple Sentiment Analysis">实验1 - Simple Sentiment Analysis
- Upgraded Sentiment Analysis">实验2 - Upgraded Sentiment Analysis
- Faster Sentiment Analysis">实验3 - Faster Sentiment Analysis
- Multi-class Sentiment Analysis">实验5 - Multi-class Sentiment Analysis
- Convolutional Sentiment Analysis">实验4 - Convolutional Sentiment Analysis
- Colab
- Pre-train Word Vectors
实验1 - Simple Sentiment Analysis
网络结构大致是:one-hot encoder => embedding => rnn => fullly-connection => sigmod
实验结果:效果一般,Test Loss: 0.676 | Test Acc: 60.82%
实验2 - Upgraded Sentiment Analysis
Preparing Data
为了告诉RNN实际的句子有多长,我们通过设置
include_lengths = True
给TEXT fieldTEXT = data.Field(tokenize = 'spacy', include_lengths = True)
使用预训练的词向量
一般来说,TorchText对在在词表但是不在预训练词向量中的未登录词汇初始化为0,我们可以给他初始化为一个标准正态分布。
TEXT.build_vocab(train_data,
max_size = MAX_VOCAB_SIZE,
vectors = "glove.6B.100d",
unk_init = torch.Tensor.normal_)
另一件事就是对于所有packed padded sequences,都需要根据其长度排序。
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
(train_data, valid_data, test_data),
batch_size = BATCH_SIZE,
sort_within_batch = True,
device = device)
Build the Model
LSTM
Bidirectional RNN
Deep RNNs
Regularization
一般来讲,参数越多,过拟合的可能性越大,使用Dropout来进行正则化,Dropout的思想就是随机drop一些神经单元,用一个超参数来控制dropout rate。为什么Dropout会work?一个解释是经过dropout后的model可以看作一个弱学习器weaker,此时预测的过程就是从这些弱学习器中集成获得较好的performance。
Implementation Details
- Dropout在forward方法中使用,但是dropout不能用在input或者output层,即只能在中间层使用;
- 将embeddings送入RNN的之前,需要先将其用其nn.utils.rnn.packed_padded_sequence处理,然后送入RNN,这样就可以让RNN只处理真实的sequence
- 接下来继续unpack output sequence,通过使用nn.utils.rnn.pad_packed_sequence
Result
实验3 - Faster Sentiment Analysis
def generate_bigrams(x):
n_grams = set(zip(*[x[i:] for i in range(2)]))
for n_gram in n_grams:
x.append(' '.join(n_gram))
return x
generate_bigrams(['This', 'film', 'is', 'terrible'])
output:
['This', 'film', 'is', 'terrible', 'film is', 'is terrible', 'This film']
FastText
改变数据的shape,然后将其做平均,来表示整个句子…
class FastText(nn.Module):
def __init__(self, vocab_size, embedding_dim, output_dim, pad_idx):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
self.fc = nn.Linear(embedding_dim, output_dim)
def forward(self, text):
#text = [sent len, batch size]
embedded = self.embedding(text)
#embedded = [sent len, batch size, emb dim]
embedded = embedded.permute(1, 0, 2)
#embedded = [batch size, sent len, emb dim]
pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1)
#pooled = [batch size, embedding_dim]
return self.fc(pooled)
Result
- epoch:15 Train Acc:93% Val. Acc:89% Test Loss: 0.394 | Test Acc: 86.45%
“This film is great” 这个case没问题,加个not就坏掉了。。。
实验5 - Multi-class Sentiment Analysis
同实验4,更换了数据集,6分类任务,使用softmax,而非BCEWithLogitsLoss+sigmoid
References
- Enriching Word Vectors with Subword Information
- Bag of Tricks for Efficient Text Classification
- Pytorch中的cat、stack、tranpose、permute、unsqeeze
- Pytorch文本情感分析
实验4 - Convolutional Sentiment Analysis
TextCNN
经典的TextCNN:https://arxiv.org/abs/1408.5882
import torch.nn as nn
import torch.nn.functional as F
class CNN(nn.Module):
def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim,
dropout, pad_idx):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
self.conv_0 = nn.Conv2d(in_channels = 1,
out_channels = n_filters,
kernel_size = (filter_sizes[0], embedding_dim))
self.conv_1 = nn.Conv2d(in_channels = 1,
out_channels = n_filters,
kernel_size = (filter_sizes[1], embedding_dim))
self.conv_2 = nn.Conv2d(in_channels = 1,
out_channels = n_filters,
kernel_size = (filter_sizes[2], embedding_dim))
self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, text):
#text = [batch size, sent len]
embedded = self.embedding(text)
#embedded = [batch size, sent len, emb dim]
embedded = embedded.unsqueeze(1)
#embedded = [batch size, 1, sent len, emb dim]
conved_0 = F.relu(self.conv_0(embedded).squeeze(3))
conved_1 = F.relu(self.conv_1(embedded).squeeze(3))
conved_2 = F.relu(self.conv_2(embedded).squeeze(3))
#conved_n = [batch size, n_filters, sent len - filter_sizes[n] + 1]
pooled_0 = F.max_pool1d(conved_0, conved_0.shape[2]).squeeze(2)
pooled_1 = F.max_pool1d(conved_1, conved_1.shape[2]).squeeze(2)
pooled_2 = F.max_pool1d(conved_2, conved_2.shape[2]).squeeze(2)
#pooled_n = [batch size, n_filters]
cat = self.dropout(torch.cat((pooled_0, pooled_1, pooled_2), dim = 1))
#cat = [batch size, n_filters * len(filter_sizes)]
return self.fc(cat)
Result
- Epoch: 5 Train Acc: 87.80% Val. Acc: 85.92% Test Loss: 0.341 Test Acc: 85.20%
说明还是有明显效果的。
References
- https://cs231n.github.io/convolutional-networks/
- https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
Colab
由于不可抗拒的因素,我们必须会用到一些外力才可以正常访问Google等网站,如果你的工具不是很稳定,那么使用Colab的时候就不会那么愉快。有时候本想着安安稳稳挂机一夜,第二天早上一看电脑才发现真的是“挂”了一夜:电脑开着,Colab却挂了。其实问题也很简单,只需要一段小小的代码就可以解决这个问题:
function ClickConnect(){
console.log("Working");
document.querySelector("colab-toolbar-button#connect").click()
}
setInterval(ClickConnect,60000)
打开你的浏览器,执行F12或者Ctrl + Shift + i , 将上面代码复制粘贴到Console框里按回车即可。
Pre-train Word Vectors
- charngram.100d
- fasttext.en.300d
- fasttext.simple.300d
- glove.42B.300d
- glove.840B.300d
- glove.twitter.27B.25d
- glove.twitter.27B.50d
- glove.twitter.27B.100d
- glove.twitter.27B.200d
- glove.6B.50d
- glove.6B.100d
- glove.6B.200d
- glove.6B.300d