从人工智能到NLP

人工智能:为机器赋予人的智能

早在1956年夏天那次会议,人工智能的先驱们就梦想着用当时刚刚出现的计算机来构造复杂的、拥有与人类智慧同样本质特性的机器。这就是我们现在所说的“强人工智能”(General AI)。这个无所不能的机器,它有着我们所有的感知(甚至比人更多),我们所有的理性,可以像我们一样思考。
我们目前能实现的,一般被称为“弱人工智能”(Narrow AI)。弱人工智能是能够与人一样,甚至比人更好地执行特定任务的技术。例如,Pinterest上的图像分类;或者Facebook的人脸识别。这些是弱人工智能在实践中的例子。这些技术实现的是人类智能的一些具体的局部。但它们是如何实现的?这种智能是从何而来?这就带我们来到下一层,机器学习。

机器学习:一种实现人工智能的方法

机器学习(Machine Learning, ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域,它主要使用归纳、综合而不是演绎。
机器学习最基本的做法,是使用算法来解析数据、从中学习,然后对真实世界中的事件做出决策和预测。与传统的为解决特定任务、硬编码的软件程序不同,机器学习是用大量的数据来“训练”,通过各种算法从数据中学习如何完成任务。
机器学习最成功的应用领域是计算机视觉,虽然也还是需要大量的手工编码来完成工作。人们需要手工编写分类器、边缘检测滤波器,以便让程序能识别物体从哪里开始,到哪里结束;写形状检测程序来判断检测对象是不是有八条边;写分类器来识别字母“STOP”。使用以上这些手工编写的分类器,人们总算可以开发算法来感知图像,判断图像是不是一个停止标志牌。

机器学习有三类:
第一类是无监督学习,指的是从信息出发自动寻找规律,并将其分成各种类别,有时也称”聚类问题”。
第二类是监督学习,监督学习指的是给历史一个标签,运用模型预测结果。如有一个水果,我们根据水果的形状和颜色去判断到底是香蕉还是苹果,这就是一个监督学习的例子。
最后一类为强化学习,是指可以用来支持人们去做决策和规划的一个学习方式,它是对人的一些动作、行为产生奖励的回馈机制,通过这个回馈机制促进学习,这与人类的学习相似,所以强化学习是目前研究的重要方向之一。

深度学习:一种实现机器学习的技术

机器学习同深度学习之间是有区别的,机器学习是指计算机的算法能够像人一样,从数据中找到信息,从而学习一些规律。虽然深度学习是机器学习的一种,但深度学习是利用深度的神经网络,将模型处理得更为复杂,从而使模型对数据的理解更加深入。
深度学习是机器学习中一种基于对数据进行表征学习的方法。深度学习是机器学习研究中的一个新的领域,其动机在于建立、模拟人脑进行分析学习的神经网络,它模仿人脑的机制来解释数据,例如图像,声音和文本。
同机器学习方法一样,深度机器学习方法也有监督学习与无监督学习之分。不同的学习框架下建立的学习模型很是不同。例如,卷积神经网络(Convolutional neural networks,简称CNNs)就是一种深度的监督学习下的机器学习模型,而深度置信网(Deep Belief Nets,简称DBNs)就是一种无监督学习下的机器学习模型。

自然语言处理

自然语言处理是指利用人类交流所使用的自然语言与机器进行交互通讯的技术。通过人为的对自然语言的处理,使得计算机对其能够可读并理解。自然语言处理的相关研究始于人类对机器翻译的探索。虽然自然语言处理涉及语音、语法、语义、语用等多维度的操作,但简单而言,自然语言处理的基本任务是基于本体词典、词频统计、上下文语义分析等方式对待处理语料进行分词,形成以最小词性为单位,且富含语义的词项单元。

发展阶段

  • 早期自然语言处理:基于规则
  • 统计自然语言处理:基于统计的机器学习
  • 神经网络自然语言处理:基于深度学习计算特征和建模

    相关技术

    神经网络语言模型(NNLM)

    神经网络语言模型(Neural Network Language Model,NNLM)最早由 Bengio 等人[27]提出,其核心思路是用一个 K 维的向量来表示词语,被称为词向量( Word Embedding),使得语义相似的词在向量空间中处于相近的位置,并基于神经网络模型将输入的上下文词向量序列转换成成固定长度的上下文隐藏向量,使得语言模型不必存储所有不同词语的排列组合信息,从而改进传统语言模型受词典规模限制的不足。
    image.png

    自编码器(AE)

    自编码器(Autoencoder, AE)是一种无监督的学习模型,由 Rumelhart 等人[28]最早提出。自编码器由编码器和解码器两部分组成,先用编码器对输入数据进行压缩,将高维数据映射到低维空间,再用解码器解压缩,对输入数据进行还原,从而来实现输入到输出的复现
    image.png

    卷积神经网络(CNN)

    核心思想是设计局部特征抽取器运用到全局,利用空间相对关系共享参数,来提高训练性能。
    卷积层和池化层是卷积神经网络的重要组成部分。其中,卷积层的作用是从固定大小的窗口中读取输入层数据,经过卷积计算,实现特征提取。卷积神经网络在同一层共享卷积计算模型来控制参数规模,降低模型复杂度。池化层的作用是对特征信号进行抽象,用于缩减输入数据的规模,按一定方法将特征压缩。池化的方法包括加和池化、最大池化、均值池化、最小值池化和随机池化。最后一个池化层通常连接到全连接层,来计算最终的输出。
    image.png

    循环神经网络(RNN)

    自环的网络对前面 的信息进行记忆并应用于当前输出的计算中,即当前时刻隐藏层的输入包括输入层变量和上一时刻的隐藏层变量。
    由于可以无限循环,所以理论上循环神经网络能够对任何长度的序列数据进行处理。
    循环神经网络在实际应用时有梯度消失等问题。后来提出了LSTM和GRU解决这一问题。

Seq2Seq

编码器-解码器模型
输入一个序列,输出另一个序列。
基本模型利用两个RNN:一个循环神经网络作为编码器,将输入序列转换成定长的向量,将向量视为输
入序列的语义表示;另一个循环神经网络作为解码器,根据输入序列的语义表示生成输出序列。
image.png

注意力机制

注意力机制可以理解为回溯策略。它在当前解码时刻,将解码器 RNN 前一个时刻的隐藏向量与输入序列关联起来,计算输入的每一步对当前解码的影响程度作为权重。通过 softmax 函数归一化,得到概率分布权重对输入序列做加权,重点考虑输入数据中对当前解码影响最大的输入词。
image.png

自注意力机制,多头注意力机制

强化学习

观察值,奖励,动作一起构成智能体的经验数据。
智能体的目标是依据经验获取最大累计奖励。
image.png

Transformer模型

17transformer.pdf

Pytorch入门

从0到1实现两层全连接神经网络

  1. 完全使用numpy实现 ```python

    两层全连接神经网络

    使用numpy

    import numpy as np N,D_in,H,D_out=64,1000,100,10

    随机生成训练数据

    x=np.random.randn(N,D_in) y=np.random.randn(N,D_out)

w1=np.random.randn(D_in,H) w2=np.random.randn(H,D_out)

lr=1e-6 for it in range(500):

  1. #Forward pass
  2. h=x.dot(w1)
  3. h_relu=np.maximum(h,0)#激活函数
  4. y_pred=h_relu.dot(w2)#预测y
  5. #loss
  6. loss=np.square(y_pred - y).sum()
  7. print(it,loss)
  8. #Backward pass
  9. #compute the gardient
  10. grad_y_pred=2.0*(y_pred-y)
  11. grad_w2=h_relu.T.dot(grad_y_pred)
  12. grad_h_relu=grad_y_pred.dot(w2.T)
  13. grad_h=grad_h_relu
  14. grad_h[h<0]=0
  15. grad_w1=x.T.dot(grad_h).copy()
  16. #update
  17. w1 -= lr*grad_w1
  18. w2 -= lr*grad_w2
  1. 2. **仅替换为torch中对应的函数,如mm表示矩阵乘,clamp(min=0)表示激活函数**
  2. ```python
  3. #改成torch版本
  4. import torch
  5. #两层全连接神经网络
  6. #使用numpy
  7. import numpy as np
  8. N,D_in,H,D_out=64,1000,100,10
  9. #随机生成训练数据
  10. x=torch.randn(N,D_in)
  11. y=torch.randn(N,D_out)
  12. w1=torch.randn(D_in,H)
  13. w2=torch.randn(H,D_out)
  14. lr=1e-6
  15. for it in range(500):
  16. #Forward pass
  17. h=x.mm(w1)#矩阵的乘
  18. h_relu=h.clamp(min=0)#激活函数
  19. y_pred=h_relu.mm(w2)#预测y
  20. #loss
  21. loss=(y_pred - y).pow(2).sum().item()
  22. print(it,loss)
  23. #Backward pass
  24. #compute the gardient
  25. grad_y_pred=2.0*(y_pred-y)
  26. grad_w2=h_relu.t().mm(grad_y_pred)
  27. grad_h_relu=grad_y_pred.mm(w2.t())
  28. grad_h=grad_h_relu
  29. grad_h[h<0]=0
  30. grad_w1=x.t().mm(grad_h).clone()
  31. #update
  32. w1 -= lr*grad_w1
  33. w2 -= lr*grad_w2
  1. 引入pytorch中的自动求导和梯度清零 ```python

    改成torch版本!!!

import torch

两层全连接神经网络

import numpy as np N,D_in,H,D_out=64,1000,100,10

随机生成训练数据

x=torch.randn(N,D_in) y=torch.randn(N,D_out)

w1=torch.randn(D_in,H,requires_grad=True) w2=torch.randn(H,D_out,requires_grad=True)

lr=1e-6 for it in range(500):

  1. #Forward pass
  2. y_pred=x.mm(w1).clamp(min=0).mm(w2)
  3. #loss
  4. loss=(y_pred - y).pow(2).sum()
  5. print(it,loss.item())
  6. #Backward pass
  7. #compute the gardient
  8. loss.backward()
  9. #update
  10. with torch.no_grad(): #声明不是计算图,可以节省内存
  11. w1 -= lr*w1.grad
  12. w2 -= lr*w2.grad
  13. w1.grad.zero_()#梯度清零
  14. w2.grad.zero_()
  1. 4. **引入pytorch中封装好的nn.Sequential模型**
  2. ```python
  3. #改成torch带有nn.Sequential的版本!!!
  4. import torch
  5. import torch.nn as nn
  6. N,D_in,H,D_out=64,1000,100,10
  7. #随机生成训练数据
  8. x=torch.randn(N,D_in)
  9. y=torch.randn(N,D_out)
  10. model=nn.Sequential(
  11. nn.Linear(D_in,H,bias=False),
  12. nn.ReLU(),
  13. nn.Linear(H,D_out),
  14. )
  15. #如果不初始化权重为正态分布,效果很会差
  16. nn.init.normal_(model[0].weight)
  17. nn.init.normal_(model[2].weight)
  18. loss_fn=nn.MSELoss(reduction='sum')
  19. #model=model.cuda()
  20. lr=1e-6
  21. for it in range(500):
  22. #Forward pass
  23. y_pred=model(x)
  24. #loss
  25. loss=loss_fn(y_pred , y)
  26. print(it,loss.item())
  27. #Backward pass
  28. #compute the gardient
  29. loss.backward()
  30. #update
  31. with torch.no_grad():
  32. for param in model.parameters():
  33. param -= lr * param.grad
  34. model.zero_grad()#清零所有的参数
  1. 引入优化器**optim**包括**Adam,SGD**等,将所有待更新的权重参数放进优化器中自动更新;引入torch中的损失函数**MSELoss** ```python

    优化器 optim

import torch import torch.nn as nn N,D_in,H,D_out=64,1000,100,10

随机生成训练数据

x=torch.randn(N,D_in) y=torch.randn(N,D_out)

model=nn.Sequential( nn.Linear(D_in,H,bias=False), nn.ReLU(),#激活函数 nn.Linear(H,D_out), )

nn.init.normal(model[0].weight) nn.init.normal(model[2].weight)

loss_fn=nn.MSELoss(reduction=’sum’)

model=model.cuda()

lr=1e-4

optmizer=torch.optim.Adam(model.parameters(),lr=lr)#此时不需要手动更新,参数群全部放进optm,自动更新

lr=1e-6 optmizer=torch.optim.SGD(model.parameters(),lr=lr)#此时不需要手动更新,参数群全部放进optm,自动更新

for it in range(500):

  1. #Forward pass
  2. y_pred=model(x)
  3. #loss
  4. loss=loss_fn(y_pred , y)
  5. print(it,loss.item())
  6. optmizer.zero_grad()#求导之前清零
  7. #Backward pass
  8. loss.backward()#求导
  9. #更新
  10. optmizer.step()#求导之后更新
  1. 6. **实现将定义的模型封装在class中:主要包括__init__()和forward()两个方法的定义**
  2. ```python
  3. #使用class声明模型 从Module中继承模型,定义更复杂的模型
  4. import torch
  5. import torch.nn as nn
  6. N,D_in,H,D_out=64,1000,100,10
  7. #随机生成训练数据
  8. x=torch.randn(N,D_in)
  9. y=torch.randn(N,D_out)
  10. class TwoLayerNet(nn.Module):
  11. def __init__(self,D_in,H,D_out):
  12. super(TwoLayerNet,self).__init__()
  13. self.linear1=nn.Linear(D_in,H,bias=False)
  14. self.linear2=nn.Linear(H,D_out,bias=False)
  15. def forward(self,x):
  16. y_pred=self.linear2(self.linear1(x).clamp(min=0))
  17. return y_pred
  18. model=TwoLayerNet(D_in,H,D_out)
  19. loss_fn=nn.MSELoss(reduction='sum')
  20. #model=model.cuda()
  21. lr=1e-4
  22. optmizer=torch.optim.Adam(model.parameters(),lr=lr)#此时不需要手动更新,参数群全部放进optm,自动更新
  23. for it in range(500):
  24. #Forward pass
  25. y_pred=model(x)
  26. #loss
  27. loss=loss_fn(y_pred , y)
  28. print(it,loss.item())
  29. optmizer.zero_grad()#求导之前清零
  30. #Backward pass
  31. loss.backward()#求导
  32. #更新
  33. optmizer.step()#求导之后更新

FizzBuzz利用神经网络训练

  1. #FizzBuzz
  2. def fizz_buss_encode(i):
  3. if i%15==0:return 3
  4. elif i%5==0:return 2
  5. elif i%3==0:return 1
  6. else:return 0
  7. def fizz_buzz_decode(i,predict):
  8. return [str(i),'fizz','buzz','fizzbuzz'][predict]
  9. def helper(i):
  10. print(fizz_buzz_decode(i,fizz_buss_encode(i)))
  11. for i in range(1,30):
  12. helper(i)
  13. import numpy as np
  14. import torch
  15. NUM_DIDITS = 10
  16. def binary_encode(i,num_digits):
  17. return np.array([i>>d & 1 for d in range(num_digits)][::-1])
  18. #binary_encode(15,NUM_DIDITS)
  19. #训练数据
  20. trX=torch.Tensor([binary_encode(i,NUM_DIDITS) for i in range(101,2**NUM_DIDITS)])
  21. trY=torch.LongTensor([ fizz_buss_encode(i) for i in range(101,2**NUM_DIDITS)])
  22. NUM_HIDDEN=100
  23. model=nn.Sequential(
  24. nn.Linear(NUM_DIDITS,NUM_HIDDEN),
  25. nn.ReLU(),
  26. nn.Linear(NUM_HIDDEN,4),
  27. )
  28. if torch.cuda.is_available():
  29. model=model.cuda()
  30. loss_fn=torch.nn.CrossEntropyLoss()#四分类问题,需要用交叉验证
  31. optimizer=torch.optim.SGD(model.parameters(),lr=0.01)
  32. BATCH_SIZE=128
  33. for epoch in range(10000):
  34. for start in range(0,len(trX),BATCH_SIZE):
  35. end=start+BATCH_SIZE
  36. batchX=trX[start:end]
  37. batchY=trY[start:end]
  38. if torch.cuda.is_available():
  39. batchX=batchX.cuda()
  40. batchY=batchY.cuda()
  41. y_pred=model(batchX) #forward
  42. loss=loss_fn(y_pred,batchY)
  43. print("Epoch",epoch,loss.item())
  44. optimizer.zero_grad()
  45. loss.backward() #backpass
  46. optimizer.step() #gradient descent
  47. #测试训练效果
  48. #101以上的数据为训练数据,100以内为测试数据
  49. testX=torch.Tensor([binary_encode(i,NUM_DIDITS) for i in range(1,101)])
  50. with torch.no_grad():
  51. testY=model(testX)
  52. testY.max(1)[1].data.tolist()
  53. predictions=zip(range(0,101),testY.max(1)[1].data.tolist())
  54. print([fizz_buzz_decode(i,x)for i,x in predictions ])

Mnist数据集的识别

mnist数据识别简单程序

  1. import numpy as np
  2. from torch import nn,optim
  3. from torch.autograd import Variable
  4. from torchvision import datasets, transforms
  5. from torch.utils.data import DataLoader
  6. import torch
  7. # 训练集
  8. train_dataset = datasets.MNIST(root='./',
  9. train=True,
  10. transform=transforms.ToTensor(),
  11. download=True)
  12. # 测试集
  13. test_dataset = datasets.MNIST(root='./',
  14. train=False,
  15. transform=transforms.ToTensor(),
  16. download=True)
  17. # 批次大小,每次训练传入64个数字
  18. batch_size = 64
  19. # 装载训练集,dataloader是数据生成器,可以打乱。在循环中使用,每次生成特定批量的数据
  20. train_loader = DataLoader(dataset=train_dataset,
  21. batch_size=batch_size,
  22. shuffle=True)
  23. # 装载训练集
  24. test_loader = DataLoader(dataset=test_dataset,
  25. batch_size=batch_size,
  26. shuffle=True)
  27. #enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标,一般用在 for 循环当中。
  28. for i,data in enumerate(train_loader):
  29. inputs,labels = data
  30. print(inputs.shape)
  31. print(labels.shape)
  32. break
  33. # 定义网络结构
  34. class Net(nn.Module):
  35. def __init__(self):
  36. super(Net, self).__init__()
  37. self.fc1 = nn.Linear(784,10)#简单神经网络无隐藏层,只有输入输出层
  38. self.softmax = nn.Softmax(dim=1)#激活函数,dim=1为对第一个维度进行转换,输出为(64,10
  39. def forward(self,x):
  40. # ([64, 1, 28, 28])->(64,784) 通道数=1,因为是黑白的图片
  41. #全连接层必须转化为二维的数据,-1表示自动匹配
  42. x = x.view(x.size()[0], -1)
  43. x = self.fc1(x)
  44. x = self.softmax(x)
  45. return x
  46. LR = 0.5
  47. # 定义模型
  48. model = Net()
  49. # 定义代价函数
  50. mse_loss = nn.MSELoss()
  51. # 定义优化器
  52. optimizer = optim.SGD(model.parameters(), LR)
  53. def train():
  54. for i,data in enumerate(train_loader):
  55. # 获得一个批次的数据和标签
  56. inputs, labels = data
  57. # 获得模型预测结果(64,10)
  58. out = model(inputs)
  59. # to onehot,把数据标签变成独热编码
  60. # (64)-(64,1)
  61. labels = labels.reshape(-1,1)
  62. # tensor.scatter(dim, index, src)
  63. # dim:对哪个维度进行独热编码
  64. # index:要将src中对应的值放到tensor的哪个位置。
  65. # src:插入index的数值
  66. one_hot = torch.zeros(inputs.shape[0],10).scatter(1, labels, 1)
  67. # 计算loss,mes_loss的两个数据的shape要一致
  68. loss = mse_loss(out, one_hot)
  69. # 梯度清0
  70. optimizer.zero_grad()
  71. # 计算梯度
  72. loss.backward()
  73. # 修改权值
  74. optimizer.step()
  75. def test():
  76. correct = 0
  77. for i,data in enumerate(test_loader):
  78. # 获得一个批次的数据和标签
  79. inputs, labels = data
  80. # 获得模型预测结果(64,10)
  81. out = model(inputs)
  82. # 获得最大值,以及最大值所在的位置
  83. _, predicted = torch.max(out, 1)#out的第一个维度
  84. # 预测正确的数量
  85. correct += (predicted == labels).sum()# list==list会得到一个listtrue or false
  86. print("Test acc:{0}".format(correct.item()/len(test_dataset)))
  87. for epoch in range(10):
  88. print('epoch:',epoch)
  89. train()
  90. test()

引入dropout

  1. # 定义网络结构
  2. class Net(nn.Module):
  3. def __init__(self):
  4. super(Net, self).__init__()
  5. #全连接层;p=0.5表示50%神经元不工作
  6. self.layer1 = nn.Sequential(nn.Linear(784,500), nn.Dropout(p=0.5), nn.Tanh())
  7. self.layer2 = nn.Sequential(nn.Linear(500,300), nn.Dropout(p=0.5), nn.Tanh())
  8. self.layer3 = nn.Sequential(nn.Linear(300,10), nn.Softmax(dim=1))
  9. def forward(self,x):
  10. # ([64, 1, 28, 28])->(64,784)
  11. x = x.view(x.size()[0], -1)
  12. x = self.layer1(x)
  13. x = self.layer2(x)
  14. x = self.layer3(x)
  15. return x

引入正则化

  1. LR = 0.5
  2. # 定义模型
  3. model = Net()
  4. # 定义代价函数
  5. mse_loss = nn.CrossEntropyLoss()
  6. # 定义优化器,设置L2正则化
  7. optimizer = optim.SGD(model.parameters(), LR, weight_decay=0.001)

使用卷积神经网络

  1. # 定义网络结构
  2. class Net(nn.Module):
  3. def __init__(self):
  4. super(Net, self).__init__()
  5. self.conv1 = nn.Sequential(nn.Conv2d(1, 32, 5, 1, 2), nn.ReLU(), nn.MaxPool2d(2, 2))
  6. self.conv2 = nn.Sequential(nn.Conv2d(32, 64, 5, 1, 2), nn.ReLU(), nn.MaxPool2d(2, 2))
  7. self.fc1 = nn.Sequential(nn.Linear(64 * 7 * 7, 1000), nn.Dropout(p=0.4), nn.ReLU())
  8. self.fc2 = nn.Sequential(nn.Linear(1000, 10), nn.Softmax(dim=1))
  9. #卷积生成的是四维数据,而全连接要求输入2维数据
  10. def forward(self, x):
  11. # ([64, 1, 28, 28])
  12. x = self.conv1(x)
  13. x = self.conv2(x)
  14. x = x.view(x.size()[0], -1)
  15. x = self.fc1(x)
  16. x = self.fc2(x)
  17. return x

使用LSTM

  1. # 定义网络结构
  2. # input_size输入特征的大小
  3. # hidden_size,LSTM模块的数量
  4. # num_layers,LSTM的层数
  5. # LSTM默认input(seq_len, batch, feature)
  6. # batch_first=True,input和output(batch, seq_len, feature)
  7. class LSTM(nn.Module):
  8. def __init__(self):
  9. super(LSTM, self).__init__()
  10. self.lstm = torch.nn.LSTM(
  11. input_size=28,
  12. hidden_size=64,
  13. num_layers=1,
  14. batch_first=True
  15. )
  16. self.out = torch.nn.Linear(in_features=64,out_features=10)
  17. self.softmax = torch.nn.Softmax(dim=1)
  18. def forward(self, x):
  19. # (batch, seq_len, feature)
  20. x = x.view(-1,28,28)
  21. # output:[batch, seq_len, hidden_size]包含每个序列的输出结果
  22. # 虽然LSTM的batch_first为True,但是h_n,c_n的第0个维度还是num_layers
  23. # h_n:[num_layers, batch, hidden_size]只包含最后一个序列的输出结果
  24. # c_n:[num_layers, batch, hidden_size]只包含最后一个序列的输出结果
  25. output,(h_n,c_n) = self.lstm(x)
  26. output_in_last_timestep = h_n[-1,:,:]
  27. x = self.out(output_in_last_timestep)
  28. x = self.softmax(x)
  29. return x

代码复现

Transformers

  1. import math
  2. import torch
  3. import numpy as np
  4. import torch.nn as nn
  5. import torch.optim as optim
  6. import torch.utils.data as Data
  7. # S: Symbol that shows starting of decoding input
  8. # E: Symbol that shows starting of decoding output
  9. # P: Symbol that will fill in blank sequence if current batch data size is short than time steps
  10. sentences = [
  11. # enc_input dec_input dec_output
  12. ['ich mochte ein bier P', 'S i want a beer .', 'i want a beer . E'],
  13. ['ich mochte ein cola P', 'S i want a coke .', 'i want a coke . E']
  14. ]
  15. # Padding Should be Zero
  16. src_vocab = {'P' : 0, 'ich' : 1, 'mochte' : 2, 'ein' : 3, 'bier' : 4, 'cola' : 5}
  17. src_vocab_size = len(src_vocab)
  18. tgt_vocab = {'P' : 0, 'i' : 1, 'want' : 2, 'a' : 3, 'beer' : 4, 'coke' : 5, 'S' : 6, 'E' : 7, '.' : 8}
  19. idx2word = {i: w for i, w in enumerate(tgt_vocab)}
  20. tgt_vocab_size = len(tgt_vocab)
  21. src_len = 5 # enc_input max sequence length
  22. tgt_len = 6 # dec_input(=dec_output) max sequence length
  23. # Transformer Parameters
  24. d_model = 512 # Embedding Size
  25. d_ff = 2048 # FeedForward dimension
  26. d_k = d_v = 64 # dimension of K(=Q), V
  27. n_layers = 6 # number of Encoder of Decoder Layer
  28. n_heads = 8 # number of heads in Multi-Head Attention
  29. def make_data(sentences):
  30. enc_inputs, dec_inputs, dec_outputs = [], [], []
  31. for i in range(len(sentences)):
  32. enc_input = [[src_vocab[n] for n in sentences[i][0].split()]] # [[1, 2, 3, 4, 0], [1, 2, 3, 5, 0]]
  33. dec_input = [[tgt_vocab[n] for n in sentences[i][1].split()]] # [[6, 1, 2, 3, 4, 8], [6, 1, 2, 3, 5, 8]]
  34. dec_output = [[tgt_vocab[n] for n in sentences[i][2].split()]] # [[1, 2, 3, 4, 8, 7], [1, 2, 3, 5, 8, 7]]
  35. enc_inputs.extend(enc_input)
  36. dec_inputs.extend(dec_input)
  37. dec_outputs.extend(dec_output)
  38. return torch.LongTensor(enc_inputs), torch.LongTensor(dec_inputs), torch.LongTensor(dec_outputs)
  39. enc_inputs, dec_inputs, dec_outputs = make_data(sentences)
  40. class MyDataSet(Data.Dataset):
  41. def __init__(self, enc_inputs, dec_inputs, dec_outputs):
  42. super(MyDataSet, self).__init__()
  43. self.enc_inputs = enc_inputs
  44. self.dec_inputs = dec_inputs
  45. self.dec_outputs = dec_outputs
  46. def __len__(self):
  47. return self.enc_inputs.shape[0]
  48. def __getitem__(self, idx):
  49. return self.enc_inputs[idx], self.dec_inputs[idx], self.dec_outputs[idx]
  50. loader = Data.DataLoader(MyDataSet(enc_inputs, dec_inputs, dec_outputs), 2, True)
  51. class PositionalEncoding(nn.Module):
  52. def __init__(self, d_model, dropout=0.1, max_len=5000):
  53. super(PositionalEncoding, self).__init__()
  54. self.dropout = nn.Dropout(p=dropout)
  55. pe = torch.zeros(max_len, d_model)
  56. position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
  57. div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
  58. pe[:, 0::2] = torch.sin(position * div_term)
  59. pe[:, 1::2] = torch.cos(position * div_term)
  60. pe = pe.unsqueeze(0).transpose(0, 1)
  61. self.register_buffer('pe', pe)
  62. def forward(self, x):
  63. '''
  64. x: [seq_len, batch_size, d_model]
  65. '''
  66. x = x + self.pe[:x.size(0), :]
  67. return self.dropout(x)
  68. def get_attn_pad_mask(seq_q, seq_k):
  69. batch_size, len_q = seq_q.size()
  70. batch_size, len_k = seq_k.size()
  71. # eq(zero) is PAD token
  72. pad_attn_mask = seq_k.data.eq(0).unsqueeze(1) # [batch_size, 1, len_k], False is masked
  73. return pad_attn_mask.expand(batch_size, len_q, len_k) # [batch_size, len_q, len_k]
  74. def get_attn_subsequence_mask(seq):
  75. attn_shape = [seq.size(0), seq.size(1), seq.size(1)]
  76. subsequence_mask = np.triu(np.ones(attn_shape), k=1) # Upper triangular matrix
  77. subsequence_mask = torch.from_numpy(subsequence_mask).byte()
  78. return subsequence_mask # [batch_size, tgt_len, tgt_len]
  79. class ScaledDotProductAttention(nn.Module):
  80. def __init__(self):
  81. super(ScaledDotProductAttention, self).__init__()
  82. def forward(self, Q, K, V, attn_mask):
  83. '''
  84. Q: [batch_size, n_heads, len_q, d_k]
  85. K: [batch_size, n_heads, len_k, d_k]
  86. V: [batch_size, n_heads, len_v(=len_k), d_v]
  87. attn_mask: [batch_size, n_heads, seq_len, seq_len]
  88. '''
  89. scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(d_k) # scores : [batch_size, n_heads, len_q, len_k]
  90. scores.masked_fill_(attn_mask, -1e9) # Fills elements of self tensor with value where mask is True.
  91. attn = nn.Softmax(dim=-1)(scores)
  92. context = torch.matmul(attn, V) # [batch_size, n_heads, len_q, d_v]
  93. return context, attn
  94. class MultiHeadAttention(nn.Module):
  95. def __init__(self):
  96. super(MultiHeadAttention, self).__init__()
  97. self.W_Q = nn.Linear(d_model, d_k * n_heads, bias=False)
  98. self.W_K = nn.Linear(d_model, d_k * n_heads, bias=False)
  99. self.W_V = nn.Linear(d_model, d_v * n_heads, bias=False)
  100. self.fc = nn.Linear(n_heads * d_v, d_model, bias=False)
  101. def forward(self, input_Q, input_K, input_V, attn_mask):
  102. '''
  103. input_Q: [batch_size, len_q, d_model]
  104. input_K: [batch_size, len_k, d_model]
  105. input_V: [batch_size, len_v(=len_k), d_model]
  106. attn_mask: [batch_size, seq_len, seq_len]
  107. '''
  108. residual, batch_size = input_Q, input_Q.size(0)
  109. # (B, S, D) -proj-> (B, S, D_new) -split-> (B, S, H, W) -trans-> (B, H, S, W)
  110. Q = self.W_Q(input_Q).view(batch_size, -1, n_heads, d_k).transpose(1,2) # Q: [batch_size, n_heads, len_q, d_k]
  111. K = self.W_K(input_K).view(batch_size, -1, n_heads, d_k).transpose(1,2) # K: [batch_size, n_heads, len_k, d_k]
  112. V = self.W_V(input_V).view(batch_size, -1, n_heads, d_v).transpose(1,2) # V: [batch_size, n_heads, len_v(=len_k), d_v]
  113. attn_mask = attn_mask.unsqueeze(1).repeat(1, n_heads, 1, 1) # attn_mask : [batch_size, n_heads, seq_len, seq_len]
  114. # context: [batch_size, n_heads, len_q, d_v], attn: [batch_size, n_heads, len_q, len_k]
  115. context, attn = ScaledDotProductAttention()(Q, K, V, attn_mask)
  116. context = context.transpose(1, 2).reshape(batch_size, -1, n_heads * d_v) # context: [batch_size, len_q, n_heads * d_v]
  117. output = self.fc(context) # [batch_size, len_q, d_model]
  118. return nn.LayerNorm(d_model).cuda()(output + residual), attn
  119. class PoswiseFeedForwardNet(nn.Module):
  120. def __init__(self):
  121. super(PoswiseFeedForwardNet, self).__init__()
  122. self.fc = nn.Sequential(
  123. nn.Linear(d_model, d_ff, bias=False),
  124. nn.ReLU(),
  125. nn.Linear(d_ff, d_model, bias=False)
  126. )
  127. def forward(self, inputs):
  128. '''
  129. inputs: [batch_size, seq_len, d_model]
  130. '''
  131. residual = inputs
  132. output = self.fc(inputs)
  133. return nn.LayerNorm(d_model).cuda()(output + residual) # [batch_size, seq_len, d_model]
  134. class EncoderLayer(nn.Module):
  135. def __init__(self):
  136. super(EncoderLayer, self).__init__()
  137. self.enc_self_attn = MultiHeadAttention()
  138. self.pos_ffn = PoswiseFeedForwardNet()
  139. def forward(self, enc_inputs, enc_self_attn_mask):
  140. '''
  141. enc_inputs: [batch_size, src_len, d_model]
  142. enc_self_attn_mask: [batch_size, src_len, src_len]
  143. '''
  144. # enc_outputs: [batch_size, src_len, d_model], attn: [batch_size, n_heads, src_len, src_len]
  145. enc_outputs, attn = self.enc_self_attn(enc_inputs, enc_inputs, enc_inputs, enc_self_attn_mask) # enc_inputs to same Q,K,V
  146. enc_outputs = self.pos_ffn(enc_outputs) # enc_outputs: [batch_size, src_len, d_model]
  147. return enc_outputs, attn
  148. class DecoderLayer(nn.Module):
  149. def __init__(self):
  150. super(DecoderLayer, self).__init__()
  151. self.dec_self_attn = MultiHeadAttention()
  152. self.dec_enc_attn = MultiHeadAttention()
  153. self.pos_ffn = PoswiseFeedForwardNet()
  154. def forward(self, dec_inputs, enc_outputs, dec_self_attn_mask, dec_enc_attn_mask):
  155. '''
  156. dec_inputs: [batch_size, tgt_len, d_model]
  157. enc_outputs: [batch_size, src_len, d_model]
  158. dec_self_attn_mask: [batch_size, tgt_len, tgt_len]
  159. dec_enc_attn_mask: [batch_size, tgt_len, src_len]
  160. '''
  161. # dec_outputs: [batch_size, tgt_len, d_model], dec_self_attn: [batch_size, n_heads, tgt_len, tgt_len]
  162. dec_outputs, dec_self_attn = self.dec_self_attn(dec_inputs, dec_inputs, dec_inputs, dec_self_attn_mask)
  163. # dec_outputs: [batch_size, tgt_len, d_model], dec_enc_attn: [batch_size, h_heads, tgt_len, src_len]
  164. dec_outputs, dec_enc_attn = self.dec_enc_attn(dec_outputs, enc_outputs, enc_outputs, dec_enc_attn_mask)
  165. dec_outputs = self.pos_ffn(dec_outputs) # [batch_size, tgt_len, d_model]
  166. return dec_outputs, dec_self_attn, dec_enc_attn
  167. class Encoder(nn.Module):
  168. def __init__(self):
  169. super(Encoder, self).__init__()
  170. self.src_emb = nn.Embedding(src_vocab_size, d_model)
  171. self.pos_emb = PositionalEncoding(d_model)
  172. self.layers = nn.ModuleList([EncoderLayer() for _ in range(n_layers)])
  173. def forward(self, enc_inputs):
  174. enc_outputs = self.src_emb(enc_inputs) # [batch_size, src_len, d_model]
  175. enc_outputs = self.pos_emb(enc_outputs.transpose(0, 1)).transpose(0, 1) # [batch_size, src_len, d_model]
  176. enc_self_attn_mask = get_attn_pad_mask(enc_inputs, enc_inputs) # [batch_size, src_len, src_len]
  177. enc_self_attns = []
  178. for layer in self.layers:
  179. # enc_outputs: [batch_size, src_len, d_model], enc_self_attn: [batch_size, n_heads, src_len, src_len]
  180. enc_outputs, enc_self_attn = layer(enc_outputs, enc_self_attn_mask)
  181. enc_self_attns.append(enc_self_attn)
  182. return enc_outputs, enc_self_attns
  183. class Decoder(nn.Module):
  184. def __init__(self):
  185. super(Decoder, self).__init__()
  186. self.tgt_emb = nn.Embedding(tgt_vocab_size, d_model)
  187. self.pos_emb = PositionalEncoding(d_model)
  188. self.layers = nn.ModuleList([DecoderLayer() for _ in range(n_layers)])
  189. def forward(self, dec_inputs, enc_inputs, enc_outputs):
  190. dec_outputs = self.tgt_emb(dec_inputs) # [batch_size, tgt_len, d_model]
  191. dec_outputs = self.pos_emb(dec_outputs.transpose(0, 1)).transpose(0, 1).cuda() # [batch_size, tgt_len, d_model]
  192. dec_self_attn_pad_mask = get_attn_pad_mask(dec_inputs, dec_inputs).cuda() # [batch_size, tgt_len, tgt_len]
  193. dec_self_attn_subsequence_mask = get_attn_subsequence_mask(dec_inputs).cuda() # [batch_size, tgt_len, tgt_len]
  194. dec_self_attn_mask = torch.gt((dec_self_attn_pad_mask + dec_self_attn_subsequence_mask), 0).cuda() # [batch_size, tgt_len, tgt_len]
  195. dec_enc_attn_mask = get_attn_pad_mask(dec_inputs, enc_inputs) # [batc_size, tgt_len, src_len]
  196. dec_self_attns, dec_enc_attns = [], []
  197. for layer in self.layers:
  198. # dec_outputs: [batch_size, tgt_len, d_model], dec_self_attn: [batch_size, n_heads, tgt_len, tgt_len], dec_enc_attn: [batch_size, h_heads, tgt_len, src_len]
  199. dec_outputs, dec_self_attn, dec_enc_attn = layer(dec_outputs, enc_outputs, dec_self_attn_mask, dec_enc_attn_mask)
  200. dec_self_attns.append(dec_self_attn)
  201. dec_enc_attns.append(dec_enc_attn)
  202. return dec_outputs, dec_self_attns, dec_enc_attns
  203. class Transformer(nn.Module):
  204. def __init__(self):
  205. super(Transformer, self).__init__()
  206. self.encoder = Encoder().cuda()
  207. self.decoder = Decoder().cuda()
  208. self.projection = nn.Linear(d_model, tgt_vocab_size, bias=False).cuda()
  209. def forward(self, enc_inputs, dec_inputs):
  210. '''
  211. enc_inputs: [batch_size, src_len]
  212. dec_inputs: [batch_size, tgt_len]
  213. '''
  214. # tensor to store decoder outputs
  215. # outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size).to(self.device)
  216. # enc_outputs: [batch_size, src_len, d_model], enc_self_attns: [n_layers, batch_size, n_heads, src_len, src_len]
  217. enc_outputs, enc_self_attns = self.encoder(enc_inputs)
  218. # dec_outpus: [batch_size, tgt_len, d_model], dec_self_attns: [n_layers, batch_size, n_heads, tgt_len, tgt_len], dec_enc_attn: [n_layers, batch_size, tgt_len, src_len]
  219. dec_outputs, dec_self_attns, dec_enc_attns = self.decoder(dec_inputs, enc_inputs, enc_outputs)
  220. dec_logits = self.projection(dec_outputs) # dec_logits: [batch_size, tgt_len, tgt_vocab_size]
  221. return dec_logits.view(-1, dec_logits.size(-1)), enc_self_attns, dec_self_attns, dec_enc_attns
  222. model = Transformer().cuda()
  223. criterion = nn.CrossEntropyLoss(ignore_index=0)
  224. optimizer = optim.SGD(model.parameters(), lr=1e-3, momentum=0.99)
  225. for epoch in range(1000):
  226. for enc_inputs, dec_inputs, dec_outputs in loader:
  227. '''
  228. enc_inputs: [batch_size, src_len]
  229. dec_inputs: [batch_size, tgt_len]
  230. dec_outputs: [batch_size, tgt_len]
  231. '''
  232. enc_inputs, dec_inputs, dec_outputs = enc_inputs.cuda(), dec_inputs.cuda(), dec_outputs.cuda()
  233. # outputs: [batch_size * tgt_len, tgt_vocab_size]
  234. outputs, enc_self_attns, dec_self_attns, dec_enc_attns = model(enc_inputs, dec_inputs)
  235. loss = criterion(outputs, dec_outputs.view(-1))
  236. print('Epoch:', '%04d' % (epoch + 1), 'loss =', '{:.6f}'.format(loss))
  237. optimizer.zero_grad()
  238. loss.backward()
  239. optimizer.step()
  240. def greedy_decoder(model, enc_input, start_symbol):
  241. enc_outputs, enc_self_attns = model.encoder(enc_input)
  242. dec_input = torch.zeros(1, 0).type_as(enc_input.data)
  243. terminal = False
  244. next_symbol = start_symbol
  245. while not terminal:
  246. dec_input=torch.cat([dec_input.detach(),torch.tensor([[next_symbol]],dtype=enc_input.dtype)],-1)
  247. dec_outputs, _, _ = model.decoder(dec_input, enc_input, enc_outputs)
  248. projected = model.projection(dec_outputs)
  249. prob = projected.squeeze(0).max(dim=-1, keepdim=False)[1]
  250. next_word = prob.data[-1]
  251. next_symbol = next_word
  252. if next_symbol == tgt_vocab["."]:
  253. terminal = True
  254. print(next_word)
  255. return dec_input
  256. # Test
  257. enc_inputs, _, _ = next(iter(loader))
  258. for i in range(len(enc_inputs)):
  259. greedy_dec_input = greedy_decoder(model, enc_inputs[i].view(1, -1), start_symbol=tgt_vocab["S"])
  260. predict, _, _, _ = model(enc_inputs[i].view(1, -1), greedy_dec_input)
  261. predict = predict.data.max(1, keepdim=True)[1]
  262. print(enc_inputs[i], '->', [idx2word[n.item()] for n in predict.squeeze()])

Seq2Seq

  1. import torch
  2. import torch.nn as nn
  3. import torch.optim as optim
  4. import torch.nn.functional as F
  5. from torchtext.datasets import Multi30k
  6. from torchtext.data import Field, BucketIterator
  7. import spacy
  8. import numpy as np
  9. import random
  10. import math
  11. import time
  12. """Set the random seeds for reproducability."""
  13. SEED = 1234
  14. random.seed(SEED)
  15. np.random.seed(SEED)
  16. torch.manual_seed(SEED)
  17. torch.cuda.manual_seed(SEED)
  18. torch.backends.cudnn.deterministic = True
  19. """Load the German and English spaCy models."""
  20. ! python -m spacy download de
  21. spacy_de = spacy.load('de')
  22. spacy_en = spacy.load('en')
  23. """We create the tokenizers."""
  24. def tokenize_de(text):
  25. # Tokenizes German text from a string into a list of strings
  26. return [tok.text for tok in spacy_de.tokenizer(text)]
  27. def tokenize_en(text):
  28. # Tokenizes English text from a string into a list of strings
  29. return [tok.text for tok in spacy_en.tokenizer(text)]
  30. """The fields remain the same as before."""
  31. SRC = Field(tokenize = tokenize_de,
  32. init_token = '<sos>',
  33. eos_token = '<eos>',
  34. lower = True)
  35. TRG = Field(tokenize = tokenize_en,
  36. init_token = '<sos>',
  37. eos_token = '<eos>',
  38. lower = True)
  39. """Load the data."""
  40. train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),fields = (SRC, TRG))
  41. """Build the vocabulary."""
  42. SRC.build_vocab(train_data, min_freq = 2)
  43. TRG.build_vocab(train_data, min_freq = 2)
  44. """Define the device."""
  45. device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
  46. """Create the iterators."""
  47. BATCH_SIZE = 128
  48. train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
  49. (train_data, valid_data, test_data),
  50. batch_size = BATCH_SIZE,
  51. device = device)
  52. class Encoder(nn.Module):
  53. def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
  54. super().__init__()
  55. self.embedding = nn.Embedding(input_dim, emb_dim)
  56. self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
  57. self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
  58. self.dropout = nn.Dropout(dropout)
  59. def forward(self, src):
  60. '''
  61. src = [src_len, batch_size]
  62. '''
  63. src = src.transpose(0, 1) # src = [batch_size, src_len]
  64. embedded = self.dropout(self.embedding(src)).transpose(0, 1) # embedded = [src_len, batch_size, emb_dim]
  65. # enc_output = [src_len, batch_size, hid_dim * num_directions]
  66. # enc_hidden = [n_layers * num_directions, batch_size, hid_dim]
  67. enc_output, enc_hidden = self.rnn(embedded) # if h_0 is not give, it will be set 0 acquiescently
  68. # enc_hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
  69. # enc_output are always from the last layer
  70. # enc_hidden [-2, :, : ] is the last of the forwards RNN
  71. # enc_hidden [-1, :, : ] is the last of the backwards RNN
  72. # initial decoder hidden is final hidden state of the forwards and backwards
  73. # encoder RNNs fed through a linear layer
  74. # s = [batch_size, dec_hid_dim]
  75. s = torch.tanh(self.fc(torch.cat((enc_hidden[-2,:,:], enc_hidden[-1,:,:]), dim = 1)))
  76. return enc_output, s
  77. class Attention(nn.Module):
  78. def __init__(self, enc_hid_dim, dec_hid_dim):
  79. super().__init__()
  80. self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim, bias=False)
  81. self.v = nn.Linear(dec_hid_dim, 1, bias = False)
  82. def forward(self, s, enc_output):
  83. # s = [batch_size, dec_hid_dim]
  84. # enc_output = [src_len, batch_size, enc_hid_dim * 2]
  85. batch_size = enc_output.shape[1]
  86. src_len = enc_output.shape[0]
  87. # repeat decoder hidden state src_len times
  88. # s = [batch_size, src_len, dec_hid_dim]
  89. # enc_output = [batch_size, src_len, enc_hid_dim * 2]
  90. s = s.unsqueeze(1).repeat(1, src_len, 1)
  91. enc_output = enc_output.transpose(0, 1)
  92. # energy = [batch_size, src_len, dec_hid_dim]
  93. energy = torch.tanh(self.attn(torch.cat((s, enc_output), dim = 2)))
  94. # attention = [batch_size, src_len]
  95. attention = self.v(energy).squeeze(2)
  96. return F.softmax(attention, dim=1)
  97. class Decoder(nn.Module):
  98. def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
  99. super().__init__()
  100. self.output_dim = output_dim
  101. self.attention = attention
  102. self.embedding = nn.Embedding(output_dim, emb_dim)
  103. self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
  104. self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
  105. self.dropout = nn.Dropout(dropout)
  106. def forward(self, dec_input, s, enc_output):
  107. # dec_input = [batch_size]
  108. # s = [batch_size, dec_hid_dim]
  109. # enc_output = [src_len, batch_size, enc_hid_dim * 2]
  110. dec_input = dec_input.unsqueeze(1) # dec_input = [batch_size, 1]
  111. embedded = self.dropout(self.embedding(dec_input)).transpose(0, 1) # embedded = [1, batch_size, emb_dim]
  112. # a = [batch_size, 1, src_len]
  113. a = self.attention(s, enc_output).unsqueeze(1)
  114. # enc_output = [batch_size, src_len, enc_hid_dim * 2]
  115. enc_output = enc_output.transpose(0, 1)
  116. # c = [1, batch_size, enc_hid_dim * 2]
  117. c = torch.bmm(a, enc_output).transpose(0, 1)
  118. # rnn_input = [1, batch_size, (enc_hid_dim * 2) + emb_dim]
  119. rnn_input = torch.cat((embedded, c), dim = 2)
  120. # dec_output = [src_len(=1), batch_size, dec_hid_dim]
  121. # dec_hidden = [n_layers * num_directions, batch_size, dec_hid_dim]
  122. dec_output, dec_hidden = self.rnn(rnn_input, s.unsqueeze(0))
  123. # embedded = [batch_size, emb_dim]
  124. # dec_output = [batch_size, dec_hid_dim]
  125. # c = [batch_size, enc_hid_dim * 2]
  126. embedded = embedded.squeeze(0)
  127. dec_output = dec_output.squeeze(0)
  128. c = c.squeeze(0)
  129. # pred = [batch_size, output_dim]
  130. pred = self.fc_out(torch.cat((dec_output, c, embedded), dim = 1))
  131. return pred, dec_hidden.squeeze(0)
  132. class Seq2Seq(nn.Module):
  133. def __init__(self, encoder, decoder, device):
  134. super().__init__()
  135. self.encoder = encoder
  136. self.decoder = decoder
  137. self.device = device
  138. def forward(self, src, trg, teacher_forcing_ratio = 0.5):
  139. # src = [src_len, batch_size]
  140. # trg = [trg_len, batch_size]
  141. # teacher_forcing_ratio is probability to use teacher forcing
  142. batch_size = src.shape[1]
  143. trg_len = trg.shape[0]
  144. trg_vocab_size = self.decoder.output_dim
  145. # tensor to store decoder outputs
  146. outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
  147. # enc_output is all hidden states of the input sequence, back and forwards
  148. # s is the final forward and backward hidden states, passed through a linear layer
  149. enc_output, s = self.encoder(src)
  150. # first input to the decoder is the <sos> tokens
  151. dec_input = trg[0,:]
  152. for t in range(1, trg_len):
  153. # insert dec_input token embedding, previous hidden state and all encoder hidden states
  154. # receive output tensor (predictions) and new hidden state
  155. dec_output, s = self.decoder(dec_input, s, enc_output)
  156. # place predictions in a tensor holding predictions for each token
  157. outputs[t] = dec_output
  158. # decide if we are going to use teacher forcing or not
  159. teacher_force = random.random() < teacher_forcing_ratio
  160. # get the highest predicted token from our predictions
  161. top1 = dec_output.argmax(1)
  162. # if teacher forcing, use actual next token as next input
  163. # if not, use predicted token
  164. dec_input = trg[t] if teacher_force else top1
  165. return outputs
  166. """## Training the Seq2Seq Model
  167. The rest of this tutorial is very similar to the previous one.
  168. We initialise our parameters, encoder, decoder and seq2seq model (placing it on the GPU if we have one).
  169. """
  170. INPUT_DIM = len(SRC.vocab)
  171. OUTPUT_DIM = len(TRG.vocab)
  172. ENC_EMB_DIM = 256
  173. DEC_EMB_DIM = 256
  174. ENC_HID_DIM = 512
  175. DEC_HID_DIM = 512
  176. ENC_DROPOUT = 0.5
  177. DEC_DROPOUT = 0.5
  178. attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
  179. enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
  180. dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)
  181. model = Seq2Seq(enc, dec, device).to(device)
  182. TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
  183. criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX).to(device)
  184. optimizer = optim.Adam(model.parameters(), lr=1e-3)
  185. """We then create the training loop..."""
  186. def train(model, iterator, optimizer, criterion):
  187. model.train()
  188. epoch_loss = 0
  189. for i, batch in enumerate(iterator):
  190. src = batch.src
  191. trg = batch.trg # trg = [trg_len, batch_size]
  192. # pred = [trg_len, batch_size, pred_dim]
  193. pred = model(src, trg)
  194. pred_dim = pred.shape[-1]
  195. # trg = [(trg len - 1) * batch size]
  196. # pred = [(trg len - 1) * batch size, pred_dim]
  197. trg = trg[1:].view(-1)
  198. pred = pred[1:].view(-1, pred_dim)
  199. loss = criterion(pred, trg)
  200. optimizer.zero_grad()
  201. loss.backward()
  202. optimizer.step()
  203. epoch_loss += loss.item()
  204. return epoch_loss / len(iterator)
  205. """...and the evaluation loop, remembering to set the model to `eval` mode and turn off teaching forcing."""
  206. def evaluate(model, iterator, criterion):
  207. model.eval()
  208. epoch_loss = 0
  209. with torch.no_grad():
  210. for i, batch in enumerate(iterator):
  211. src = batch.src
  212. trg = batch.trg # trg = [trg_len, batch_size]
  213. # output = [trg_len, batch_size, output_dim]
  214. output = model(src, trg, 0) # turn off teacher forcing
  215. output_dim = output.shape[-1]
  216. # trg = [(trg_len - 1) * batch_size]
  217. # output = [(trg_len - 1) * batch_size, output_dim]
  218. output = output[1:].view(-1, output_dim)
  219. trg = trg[1:].view(-1)
  220. loss = criterion(output, trg)
  221. epoch_loss += loss.item()
  222. return epoch_loss / len(iterator)
  223. """Finally, define a timing function."""
  224. def epoch_time(start_time, end_time):
  225. elapsed_time = end_time - start_time
  226. elapsed_mins = int(elapsed_time / 60)
  227. elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
  228. return elapsed_mins, elapsed_secs
  229. """Then, we train our model, saving the parameters that give us the best validation loss."""
  230. best_valid_loss = float('inf')
  231. for epoch in range(10):
  232. start_time = time.time()
  233. train_loss = train(model, train_iterator, optimizer, criterion)
  234. valid_loss = evaluate(model, valid_iterator, criterion)
  235. end_time = time.time()
  236. epoch_mins, epoch_secs = epoch_time(start_time, end_time)
  237. if valid_loss < best_valid_loss:
  238. best_valid_loss = valid_loss
  239. torch.save(model.state_dict(), 'tut3-model.pt')
  240. print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
  241. print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
  242. print(f'\t Val. Loss: {valid_loss:.3f} | Val. PPL: {math.exp(valid_loss):7.3f}')
  243. """Finally, we test the model on the test set using these "best" parameters."""
  244. model.load_state_dict(torch.load('tut3-model.pt'))
  245. test_loss = evaluate(model, test_iterator, criterion)
  246. print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')
  247. """We've improved on the previous model, but this came at the cost of doubling the training time.
  248. In the next notebook, we'll be using the same architecture but using a few tricks that are applicable to all RNN architectures - packed padded sequences and masking. We'll also implement code which will allow us to look at what words in the input the RNN is paying attention to when decoding the output.
  249. """

对话系统

基于知识的对话系统(open)

cc-2019328162246.pdf
让机器具备与人交流的能力是人工智能领域的一项重要工作,同时也是一项极具挑战的任务。1951 年图灵在《计算机与智能》一文中提出用人机对话来测试机器智能水平,引起了研究者的广泛关注。此后,学者们尝试了各种方法研究建立对话系统。按照系统建设的目的,对话系统被分为任务驱动的限定领域对话系统和无特定任务的开放领域对话系统。限定领域对话系统是为了完成特定任务而设计的,例如网站客服、车载助手等。开放领域对话系统也被称为聊天机器人,是无任务驱动,为了纯聊天或者娱乐而开发的,它的目的是生成有意义且相关的回复。

任务驱动型对话系统