Text and Web Mining - Word meaning by embedding - 《机器学习》

Recently, the Word2Vec models have become very popular for their ability to represent word meaning as a vector of probabilities of association with other context words, partially replacing earlier approaches such as latent semantic analysis. 最近，Word2Vec模型变得非常流行，因为它们能够将词义表示为与其他上下文词关联的概率向量，部分取代了早期的方法，如潜在语义分析。
One important training algorithm is Continuous Bag of Words whereby neural net classifiers are trained on a window of the words surrounding the target word in the corpus to predict the missing word. 一种重要的训练算法是连续单词包，由此神经网络分类器在语料库中围绕目标单词的单词窗口上被训练，以预测丢失的单词。

The other algorithm, Continuous Skip-gram does the reverse and predicts the surrounding words based on a single focus word. 另一种算法是连续跳跃式语法，它反过来根据单个焦点单词预测周围的单词。
It turns out that this causes the neural net to represent meaning in the sense that the vectors in the learnt classifiers for words can be arithmetically manipulated to derive the representations for similar words. Famously,
vector(”King”) - vector(”Man”) + vector(”Woman”) = vector(“Queen”)
ACTION: If you wish to learn more about Word2Vec, read this 2013 paper, Efficient Estimation of Word Representations in Vector Space

Language Models

The success of word2vec initiated a furious interest in building language models that aim to build a domain-independent representation of words (that is, language) that can be used in many different language tasks. These language models typically comprise both a trained neural network and_ embeddings, _where embeddings are vectors of the form produced by word2vec, that is, for each word, a vector of values that represent the meaning of a word via its relationship to other words in training text. word2vec的成功引发了人们对构建语言模型的极大兴趣，这种语言模型旨在构建可用于许多不同语言任务的独立于领域的单词表示(即语言)。这些语言模型通常包括训练好的神经网络和嵌入，其中嵌入是由word2vec产生的形式向量，也就是说，对于每个单词，是一个值向量，该值向量通过单词与训练文本中其他单词的关系来表示单词的含义。

The new representation (e.g. trained neural networks and embeddings) has some features that make it easy to feed in to ML pipelines. In particular, labelled training data is not required.
A language model predicts what is the next word given a sequence of words.
- Input: A sequence of words x_1, …, x_n-1
- Output: Predict P(x_n|x_1,…,x_n-1)

The applications of language models are:

Completing search queries
Suggesting the next word on keyboards

How the language models are built:

Deep learning algorithms, especially Recurrent Neural Networks and Transformers are common methods for building language models.
Pre-trained language models are available which can then be tuned to be used in different downstream tasks.
The off-the-shelf pretrained models vary with respect to:
- Data that was used for training them
- Learning network architectures

The language models are domain-independent which means they are not trained for any specific task. The learned (pre-trained) language model could be the input to a variety of specific problems like sentiment analysis and language translation. 语言模型是独立于领域的，这意味着它们不是为任何特定的任务而训练的。学习过的(预先训练过的)语言模型可以作为情感分析和语言翻译等各种具体问题的输入。对于
For a specific problem, we might have some labelled data that we use to learn a domain-dependent model. This data can be much smaller than that used for pre-training the language model. This task-dependent stage is called few-shot learning to recognise the small data needs, or tuning to recognise the minor adjustments to the pre-trained model for the task at hand. 对于一个特定的问题，我们可能有一些标记的数据，用来学习一个依赖于领域的模型。该数据可能比用于语言模型预训练的数据小得多。这个依赖于任务的阶段被称为少量学习，以识别小数据需求，或者调整以识别对手头任务的预训练模型的微小调整。
Google BERT and GPT-3 from OpenAI are two state-of-the-art language models.