NLP - L4 Language Models - 《机器学习》

Probabilistic Language Modelling 概率语言模型
Absolute-Discounting Interpolation
Kneser-Ney Smoothing
Smoothing for Web-scale N-grams
Stupid Backoff
Stupid Backoff

Goal: assign a probability to a word sequence
− Speech recognition:
• P(I ate a cherry) > P(Eye eight Jerry)
− Spelling correction:
• P(Australian National University) > P(Australian National Univerisity)
− Collocation error correction:
• P(high wind) > P(large wind)
− Machine Translation:
• P(The magic of Usain Bolt on show…) > P(The magic of Usain Bolt at the show …)
− Question-answering, summarization, etc

Probabilistic Language Modelling 概率语言模型

▪ A language model computes the probability of a sequence of words 一种计算单词序列概率的语言模型

▪ Related task: probability of an upcoming word 相关任务：预测下一个单词的概率

▪ LM: Either

How to Compute

Apply chain rule: 应用链式法则

Compute 计算

P(John Smith’s hotel room bugged) = P(John)P(Smith’s | John)P(hotel | John Smith’s) … P(bugged | John Smith’s hotel room)

Estimate the Probabilities 估计概率

▪ P(bugged | John Smith’s hotel room) will most likely result in 0, as such specific long sequences are not likely to appear in the training set! 很可能导致0，因为这样的特定长序列不太可能出现在训练集中
▪ P(bugged | John Smith’s hotel room) =

▪ Most likely to result in 0 counts! 大概率会得到0 counts
▪ So, we need to simplify the calculations! 因此，我们需要简化计算！

Markov Assumption (simplifying assumption) 马尔可夫假设(简化假设)

▪ Simplification: 简化:
P(bugged | John Smith’s hotel room) ~ P(bugged | room)
OR
P(bugged | John Smith’s hotel room) ~ P(bugged | hotel room)
▪ First-order Markov assumption:

▪ Assumes only most recent words are relevant! 假设只有相邻的单词是相关的

Unigram Model 单幅模型

▪ Zero-order Markov assumption

▪ Examples generated from a unigram model

▪ Not very good!

Bigram Model 双gram模型

▪ First-order assumption

▪ P(I want to eat Chinese food) = ?
▪ Estimate bigram probabilities from a training corpus 从训练语料库中估计二元模型概率

Bigram Counts from Berkeley Restaurant Corpus 伯克利餐厅语料库中的二元模型计数

Compute Bigram Probabilities 计算二元模型概率

Log对数有助于数值稳定性(概率变小)

Trigram Models 三元模型

▪ Second order markov assumption

▪ Long-distance dependencies of language 语言的远距离依赖性

▪ We can extend to 4-grams , 5-grams …

Sequence Generation 序列生成

▪ Compute conditional probabilities: 计算条件概率
− P(want | I) = 0.32
− P(to | I) = 0
− P(eat | I) = 0.004
− P(Chinese | I) = 0
− P (I | I) = 0.002 − …
▪ Sampling: 取样
− Ensures you don’t just get the same sentence all the time! 确保您不会一直听到同一句话！
− Generate a random number in [0,1] 生成0-1之间的随机数 – corresponds to P’s 基于P
• Based on a uniform distribution 基于均匀分布
− Words that are a better fit, are more likely to be selected! 更可能选择到合适的单词

Approximating Shakespeare

▪ Generate sentences from a unigram model: 从单元模型生成句子
− Every enter now severally so, let
− Hill he late speaks; or! a more to leg less first you enter
▪ From a bigram model:
− What means, sir. I confess she? then all sorts, he is trim, captain
− Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry
▪ From a trigram model:
− Sweet prince, Falstaff shall die
− This shall forbid it should be branded, if renown made it empty

The Perils of Overfitting 过拟合的危险

▪ P(I want to eat Chinese food) = ? when count(Chinese food) = 0 当部分短语概率为0，则整句的概率为？
▪ In practice, the test corpus is often different than the training corpus 测试集和训练集通常很不同
▪ Unknown words! 没见过的单词
▪ Unseen n-grams! 没见过的gram

Interpolation 插入文字

▪ Key idea: mix lower order n-gram probabilities 关键思想:混合低阶n-gram概率
▪ Lambdas are hyperparameters λ是超参数
▪ For bigram models: 双元模型

▪ For trigram models: 三元模型

How to Set the Lambdas? 怎样设置超参数

▪ Estimate on the held-out data 留存预料

▪ Typically, Expectation Maximization (EM) is used 通常使用期望最大化
▪ One crude approach (Collins et al. Course notes 2013) 一种粗略的方法

• Ensures is larger when count is larger.
• Different lambda for each n-gram.
• Only one parameter to estimate (gamma). 只需要估计一个值

Absolute-Discounting Interpolation 绝对贴现插值

We tend to systematically overestimate n-gram counts. 我们倾向于系统性地高估n-gram计数

We can use this to get better estimates and set the lambdas. Using discounting.

Absolute-Discounting Interpolation

▪ Involves the interpolation of lower and higher-order models 涉及低阶和高阶模型的插值
▪ Aims to deal with sequences that occur infrequently 旨在处理不常出现的序列
▪ Subtract d (the discount) from each of the counts 从每个计数中减去d(折扣)

▪ The discount reserves some probability mass for the lower order n-grams ▪ Thus, the discount determines the lambda, for 0 < d < 1 we have:

▪ Typically, d = 0.75 is used

Kneser-Ney Smoothing

▪ If there are only a few words that come after a context, then a novel word in that context should be less likely
▪ But we also expect that if a word appears after a small number of contexts, then it should be less likely to appear in a novel context…

Consider the word: Francisco
Might be common because: San Francisco is a common term.
But Francisco occurs in very few contexts.

▪ Provides better estimates for probabilities of lower-order unigrams:
− I can’t see without my reading __?
− “Francisco” is more common than “glasses”, say, … but “Francisco” always follows “San”!
▪ Pcontinuation(x) : How likely is a word to continue a new context?
− For each word x_i, count the number of bigram types it completes

ie unigram probability is proportional to the number of different words it follows

▪ “smoothing” ~ adjusting low probs (such as zero probs) upwards, and high probs downward − ie adjusting MLE to produce more accurate probabilities

▪ Works very well, also used to improve NN approaches

▪ https://nlp.stanford.edu/~wcmac/papers/20050421- smoothing-tutorial.pdf

Smoothing for Web-scale N-grams

▪ “Stupid Backoff” (Brants et al. 2007)
▪ No discounting, just use relative frequencies!
▪ Does not give a probability distribution

Stupid Backoff

▪ A simplistic type of smoothing
▪ Inexpensive to train on large data sets, and approaches the quality of Kneser-Ney Smoothing as the amount of training data increases
▪ Try to use higher-order n-gram’s, otherwise drop to shorter sequence ie one less sequence, hence backoff!
− Every time you ‘backoff’, eg reduce from trigram to bigram because you’ve encountered a 0 probability, you multiply by 0.4
− ie use trigram if good evidence available, otherwise bigram, otherwise unigram!
▪ S = scores, not probabilities!

Stupid Backoff

▪ Extrinsic evaluation:
− Put each model in a task
• Spelling correction, machine translation etc
− Time consuming
▪ Intrinsic evaluation:
− Perplexity
− Weighted average number of possible next words that can follow a given word
- Normalised likelihood
– the lower the better - L = length of sequence
- Works well when test and training data are similar
- Pre-processing and vocabulary matter