title: Word embeddings in 2020
subtitle: 当前单词嵌入方法的简要概述:从Word2vec到Transformers
date: 2020-08-10
author: NSX
转载自 https://colab.research.google.com/drive/1N7HELWImK9xCYheyozVP3C_McbiRo1nb

本文对每个词嵌入方法都有一个(非常)简短的描述,进一步研究的链接以及Python中的代码示例。所有代码都打包为Google Colab Notebook


One-hot or CountVectorizing



  1. from sklearn.feature_extraction.text import CountVectorizer
  2. # create CountVectorizer object
  3. vectorizer = CountVectorizer()
  4. corpus = [
  5. 'Text of the very first new sentence with the first words in sentence.',
  6. 'Text of the second sentence.',
  7. 'Number three with lot of words words words.',
  8. 'Short text, less words.',
  9. ]
  10. # learn the vocabulary and store CountVectorizer sparse matrix in term_frequencies
  11. term_frequencies = vectorizer.fit_transform(corpus)
  12. vocab = vectorizer.get_feature_names()
  13. # convert sparse matrix to numpy array
  14. term_frequencies = term_frequencies.toarray()
  15. # visualize term frequencies
  16. import seaborn as sns
  17. sns.heatmap(term_frequencies, annot=True, cbar = False, xticklabels = vocab);

  1. one_hot_vectorizer = CountVectorizer(binary=True)
  2. one_hot = one_hot_vectorizer.fit_transform(corpus).toarray()
  3. sns.heatmap(one_hot, annot=True, cbar = False, xticklabels = vocab)

对于大量的文档集,“ a”,“ the”,“ is”等词经常出现,但它们携带的信息不多。使用 one-hot 编码方法无法决定这些单词的重要性。解决此问题的方法之一是停用词过滤(stopwords filtering),但是此解决方案是离散的,不灵活。

TF-IDF(term frequency — inverse document frequency)可以更好地解决此问题。TF-IDF降低了常用单词的权重,增加了仅在当前文档中出现的稀有单词的权重。TF-IDF公式如下所示:

2020-08-10-Word embeddings in 2020 - 图4%3Dtf(term%2C%20document)%20%5Ccdot%20idf(term)%0A#card=math&code=tfidf%28term%2C%20document%29%3Dtf%28term%2C%20document%29%20%5Ccdot%20idf%28term%29%0A)


2020-08-10-Word embeddings in 2020 - 图5%3D%5Cfrac%7Bn%7Bi%7D%7D%7B%5Csum%7Bk%3D1%7D%5E%7BW%7D%20n%7Bk%7D%7D%0A#card=math&code=tf%28term%2C%20document%29%3D%5Cfrac%7Bn%7Bi%7D%7D%7B%5Csum%7Bk%3D1%7D%5E%7BW%7D%20n%7Bk%7D%7D%0A)

IDF(反向文档频率),其解释方式与反向文档数量相同,其中N是文档数量,2020-08-10-Word embeddings in 2020 - 图6#card=math&code=n%28t%29)是包含当前单词 2020-08-10-Word embeddings in 2020 - 图7 的文档数量。

2020-08-10-Word embeddings in 2020 - 图8%3D%5Clog%20%5Cfrac%7BN%7D%7Bn%7Bt%7D%7D%0A#card=math&code=idf%28term%29%3D%5Clog%20%5Cfrac%7BN%7D%7Bn%7Bt%7D%7D%0A)

  1. from sklearn.feature_extraction.text import TfidfVectorizer
  2. import seaborn as sns
  3. corpus = [
  4. 'Time flies like an arrow.',
  5. 'Fruit flies like a banana.'
  6. ]
  7. vocab = ['an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time']
  8. tfidf_vectorizer = TfidfVectorizer()
  9. tfidf = tfidf_vectorizer.fit_transform(corpus).toarray()
  10. sns.heatmap(tfidf, annot=True, cbar = False, xticklabels = vocab)

Word2Vec 和 GloVe


Word2Vec单词嵌入是单词的矢量表示,输入大量文本作为输入(例如Wikipedia,科学,新闻,文章等)时,并由无监督模型进行学习。单词的这些表示形式捕获了单词之间的语义相似性。Word2Vec单词嵌入以如下方式学习,即意思相近的单词(例如“ king”和“ queen”)的向量之间的距离比含义完全不同的单词(例如“ king”和“ carpet”)的距离更近 。

  1. # Download Google Word2Vec embeddings https://code.google.com/archive/p/word2vec/
  2. !wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
  3. !gunzip GoogleNews-vectors-negative300.bin
  4. # Try Word2Vec with Gensim
  5. import gensim
  6. # Load pretrained vectors from Google
  7. model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
  8. king = model['king']
  9. # king - man + woman = queen
  10. print(model.most_similar(positive=['woman', 'king'], negative=['man']))
  11. print(model.similarity('woman', 'man'))

另一个词嵌入方法是Glove(“Global Vectors”)。它是一种基于单词-上下文矩阵的矩阵分解技术。它首先构造一个由(单词x上下文)共现信息组成的大型矩阵,即,对于每个“单词”(行),您需要计算在一个大型语料库中,该单词在某个“上下文”(列)中出现的频率。然后,将此矩阵分解为低维(单词x特征)矩阵,其中每行现在存储每个单词的矢量表示。通常,这是通过最小化“reconstruction loss”来完成的。这种损失试图找到可以解释高维数据中大部分变化的低维表示形式。

  1. # Try Glove word embeddings with Spacy
  2. !python3 -m spacy download en_core_web_lg
  3. import spacy
  4. # Load the spacy model that you have installed
  5. import en_core_web_lg
  6. nlp = en_core_web_lg.load()
  7. # process a sentence using the model
  8. doc = nlp("man king stands on the carpet and sees woman queen")

找到King和Queen 之间的相似之处(值越高越好)。

  1. doc[1].similarity(doc[9])
  2. # 0.72526103

Find similarity between King and carpet.

  1. doc[1].similarity(doc[5])
  2. # 0.20431946

Check if king — man + woman = queen. We will multiply vectors for ‘man’ and ‘woman’ by two, because subtracting one vector for ‘man’ and adding the vector for ‘woman’ will do little to the original vector for “king”, likely because those “man” and “woman” are related themselves.

  1. v = doc[1].vector - (doc[0].vector*2) + (doc[8].vector*2)
  2. from scipy.spatial import distance
  3. import numpy as np
  4. # Format the vocabulary for use in the distance function
  5. vectors = [token.vector for token in doc]
  6. vectors = np.array(vectors)
  7. # Find the closest word below
  8. closest_index = distance.cdist(np.expand_dims(v, axis = 0), vectors, metric = 'cosine').argmin()
  9. output_word = doc[closest_index].text
  10. print(output_word)
  11. # queen


FastText是word2vec的扩展,由Tomas Mikolov团队开发(他于2013年创建了word2vec框架)


  1. !pip install Cython --install-option="--no-cython-compile"
  2. !pip install fasttext
  3. # download pre-trained language word vectors from one of 157 languges https://fasttext.cc/docs/en/crawl-vectors.html
  4. # it will take some time, about 5 minutes
  5. import fasttext
  6. import fasttext.util
  7. fasttext.util.download_model('en', if_exists='ignore') # English
  8. ft = fasttext.load_model('cc.en.300.bin')


  1. ft.get_word_vector('king')


  1. ft.get_nearest_neighbors('king')

  1. ft.get_nearest_neighbors('king-warrior')

ELMo (Embeddings from Language Models)



ELMo is a deep contextualized word representation that models both (1) complex characteristics of the word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis.

  1. # use tensorflow 1.x for ELMo, because trere are still no ELMo for tensorflow 2.0
  2. %tensorflow_version 1.x
  3. import tensorflow_hub as hub
  4. import tensorflow as tf
  5. # Download pretrained ELMo model from Tensorflow Hub https://tfhub.dev/google/elmo/3
  6. elmo = hub.Module("https://tfhub.dev/google/elmo/3", trainable=True)
  7. sentences = \
  8. ['king arthur, also called arthur or aathur pendragon, legendary british king who appears in a cycle of \
  9. medieval romances (known as the matter of britain) as the sovereign of a knightly fellowship of the round table.',
  10. 'it is not certain how these legends originated or whether the figure of arthur was based on a historical person.',
  11. 'the legend possibly originated either in wales or in those parts of northern britain inhabited by brythonic-speaking celts.',
  12. 'for a fuller treatment of the stories about king arthur, see also arthurian legend.']

为了将句子输入到模型训练,我们需要将它们分成单词数组和填充数组,并保持相同的长度。另外,我们将创建“mask”数组来表示每个element是一个实词还是填充符号(在我们的示例中为“ _”)。稍后我们将使用“掩码”数组进行可视化,来显示真实存在的单词。

  1. words = []
  2. mask = []
  3. masked_words = []
  4. for sent in sentences:
  5. splitted = sent.split()
  6. for i in range(36):
  7. try:
  8. words.append(splitted[i])
  9. except:
  10. words.append('_')
  11. for word in words:
  12. if word == "_":
  13. mask.append(False)
  14. else:
  15. mask.append(True)
  16. masked_words.append(word)


  1. embeddings = elmo(
  2. sentences,
  3. signature="default",
  4. as_dict=True)["elmo"]


  1. %%time
  2. with tf.Session() as sess:
  3. sess.run(tf.global_variables_initializer())
  4. sess.run(tf.tables_initializer())
  5. x = sess.run(embeddings)
  6. embs = x.reshape(-1, 1024)
  7. masked_embs = embs[mask]


  1. from sklearn.decomposition import PCA
  2. pca = PCA(n_components=10)
  3. y = pca.fit_transform(masked_embs)
  4. from sklearn.manifold import TSNE
  5. y = TSNE(n_components=2).fit_transform(y)
  6. import plotly as py
  7. import plotly.graph_objs as go
  8. data = [
  9. go.Scatter(
  10. x=[i[0] for i in y],
  11. y=[i[1] for i in y],
  12. mode='markers',
  13. text=[i for i in masked_words],
  14. marker=dict(
  15. size=16,
  16. color = [len(i) for i in masked_words], #set color equal to a variable
  17. opacity= 0.8,
  18. colorscale='Viridis',
  19. showscale=False
  20. )
  21. )
  22. ]
  23. layout = go.Layout()
  24. layout = dict(
  25. yaxis = dict(zeroline = False),
  26. xaxis = dict(zeroline = False)
  27. )
  28. fig = go.Figure(data=data, layout=layout)
  29. fig.show()

最后,是时候使用最新技术- Transformers。著名的GPT-2BERTCTRL 都是基于Transformers生成上下文相关的词嵌入。但是与ELMo 不同,Transformers不使用RNN,它们不需要一个接一个地顺序处理句子中的单词。句子中的所有单词都是并行处理的,这种方法可以加快处理速度,并解决梯度消失的问题(vanishing gradient problem)。

Transformers 使用注意力机制来描述每个特定单词与句子中所有其他单词的联系和依存关系。Jay Alammar在精美插图中详细描述了Transformers 的这种机制和主要原理。

示例,我们将使用 Hugging face 开源的Transformers库,其中包含最新的基于Transformers的模型(例如BERTXLNetDialoGPTGPT-2)。


  1. !pip install transformers

现在,我们导入pytorch, the pretrained BERT model, and a BERT tokenizer,它将句子转换为适合BERT的输入格式(标记自身并添加特殊标记,例如[SEP]和[CLS])的所有必需工作。

  1. import torch
  2. torch.manual_seed(0)
  3. from transformers import BertTokenizer, BertModel
  4. import logging
  5. import matplotlib.pyplot as plt
  6. % matplotlib inline
  7. # Load pre-trained model tokenizer (vocabulary)
  8. tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)


  1. sentences = \
  2. ['king arthur, also called arthur or aathur pendragon, legendary british king who appears in a cycle of \
  3. medieval romances (known as the matter of britain) as the sovereign of a knightly fellowship of the round table.',
  4. 'it is not certain how these legends originated or whether the figure of arthur was based on a historical person.',
  5. 'the legend possibly originated either in wales or in those parts of northern britain inhabited by brythonic-speaking celts.',
  6. 'for a fuller treatment of the stories about king arthur, see also arthurian legend.']
  7. # Print the original sentence.
  8. print(' Original: ', sentences[0][:99])
  9. # Print the sentence splitted into tokens.
  10. print('Tokenized: ', tokenizer.tokenize(sentences[0])[:15])
  11. # Print the sentence mapped to token ids.
  12. print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[0]))[:15])

请注意,某些标记可能看起来像这样:[‘aa’, ‘##th’, ‘##ur’, ‘pen’, ‘##dra’, ‘##gon’]。这是因为 BERT tokenizer 是使用WordPiece模型创建的。该模型贪婪地创建一个固定大小的词汇表,其中包含最适合我们的语言数据的单个字符,子词和单词。BERT tokenizer 生成器使用的词汇表包含所有英语字符,以及在该模型所训练的英语语料库中找到的约30,000个最常见的单词和子单词。因此,如果词汇表中未提及该词,则该词将分为子词和字符。某些子词之前的两个井号(##)表明该子词是较大词的一部分,并在另一个子词之前。

我们将使用 tokenizer.encode_plus 函数,该函数:

  • 将句子拆分为 tokens
  • 添加特殊的[CLS]和[SEP] tokens
  • 将令牌映射到其ID
  • 将所有句子填充或截断为相同长度
  1. # Tokenize all of the sentences and map tokens to word IDs.
  2. input_ids = []
  3. attention_masks = []
  4. tokenized_texts = []
  5. for sent in sentences:
  6. encoded_dict = tokenizer.encode_plus(
  7. sent,
  8. add_special_tokens = True,
  9. truncation=True,
  10. max_length = 48,
  11. pad_to_max_length = True,
  12. return_tensors = 'pt',
  13. )
  14. # Save tokens from sentence as a separate array.
  15. marked_text = "[CLS] " + sent + " [SEP]"
  16. tokenized_texts.append(tokenizer.tokenize(marked_text))
  17. # Add the encoded sentence to the list.
  18. input_ids.append(encoded_dict['input_ids'])
  19. # Convert the list into tensor.
  20. input_ids = torch.cat(input_ids, dim=0)

Segment ID. BERT通过使用1和0来区分两个句子这种方式训练的。我们将分别对每个句子进行编码,因此我们将每个句子中的每个标记标记为1。

  1. segments_ids = torch.ones_like(input_ids)


  1. with torch.no_grad():
  2. outputs = model(input_ids, segments_ids)
  3. hidden_states = outputs[2]

Let’s examine what we’ve got

  1. print ("Number of layers:", len(hidden_states), " (initial embeddings + 12 BERT layers)")
  2. print ("Number of batches:", len(hidden_states[0]))
  3. print ("Number of tokens:", len(hidden_states[0][0]))
  4. print ("Number of hidden units:", len(hidden_states[0][0][0]))
  5. # 13
  6. # 4
  7. # 48
  8. # 768
  1. # Concatenate the tensors for all layers.
  2. token_embeddings = torch.stack(hidden_states, dim=0)
  3. # Swap dimensions, so we get tensors in format: [sentence, tokens, hidden layes, features]
  4. token_embeddings = token_embeddings.permute(1,2,0,3)


  1. processed_embeddings = token_embeddings[:, :, 9:, :]


  1. embeddings = torch.reshape(processed_embeddings, (4, 48, -1))


  1. for i, token_str in enumerate(tokenized_texts[0]):
  2. print (i, token_str)

  1. from scipy.spatial.distance import cosine
  2. kings = cosine(embeddings[0][1], embeddings[0][17])
  3. king_table = cosine(embeddings[0][1], embeddings[0][46])
  4. king_archtur = cosine(embeddings[0][2], embeddings[0][1])
  5. print('Distance for two kings: %.2f' % kings)
  6. print('Distance from king to table: %.2f' % king_table)
  7. print('Distance from Archtur to king: %.2f' % king_archtur)
  8. # 0.21
  9. # 0.73
  10. # 0.40


使用 simplerepresentations 模块可能会更简单。该模块完成了我们之前所做的所有工作-从BERT中提取所需的隐层状态,并在几行代码中创建词嵌入。

  1. !pip install simplerepresentations
  2. import torch
  3. from simplerepresentations import RepresentationModel
  4. torch.manual_seed(0)
  5. model_type = 'bert'
  6. model_name = 'bert-base-uncased'
  7. representation_model = RepresentationModel(
  8. model_type=model_type,
  9. model_name=model_name,
  10. batch_size=4,
  11. max_seq_length=48,
  12. combination_method='cat',
  13. last_hidden_to_use=4
  14. )
  15. text_a = sentences
  16. all_sentences_representations, all_tokens_representations = representation_model(text_a=text_a)

Check distaces between Archtur, king and table.

  1. from scipy.spatial.distance import cosine
  2. kings = cosine(all_tokens_representations[0][1], all_tokens_representations[0][17])
  3. king_table = cosine(all_tokens_representations[0][1], all_tokens_representations[0][46])
  4. king_archtur = cosine(all_tokens_representations[0][2], all_tokens_representations[0][1])
  5. print('Distance for two kings: %.2f' % kings)
  6. print('Distance from king to table: %.2f' % king_table)
  7. print('Distance from Archtur to king: %.2f' % king_archtur)
  8. # 0.21
  9. # 0.73
  10. # 0.40



