- Probabilistic Language Modelling 概率语言模型
- Absolute-Discounting Interpolation
- Kneser-Ney Smoothing
- Smoothing for Web-scale N-grams
- Stupid Backoff
- Stupid Backoff
Goal: assign a probability to a word sequence 
− Speech recognition:
 • P(I ate a cherry) > P(Eye eight Jerry) 
− Spelling correction: 
• P(Australian National University) > P(Australian National Univerisity) 
− Collocation error correction: 
• P(high wind) > P(large wind) 
− Machine Translation: 
• P(The magic of Usain Bolt on show…) > P(The magic of Usain Bolt  at the show …) 
− Question-answering, summarization, etc
Probabilistic Language Modelling 概率语言模型
▪ A language model computes the probability of a sequence of words    一种计算单词序列概率的语言模型
▪ Related task: probability of an upcoming word    相关任务:预测下一个单词的概率
        
▪ LM: Either 
How to Compute    
Apply chain rule:    应用链式法则
Compute 计算
P(John Smith’s hotel room bugged) = P(John)P(Smith’s | John)P(hotel | John Smith’s) … P(bugged | John Smith’s hotel room)
Estimate the Probabilities 估计概率
▪ P(bugged | John Smith’s hotel room) will most likely result in 0, as such specific long sequences are not likely to appear in the training set!     很可能导致0,因为这样的特定长序列不太可能出现在训练集中
▪ P(bugged | John Smith’s hotel room) =
        
▪ Most likely to result in 0 counts!     大概率会得到0 counts
▪ So, we need to simplify the calculations!    因此,我们需要简化计算!
Markov Assumption (simplifying assumption) 马尔可夫假设(简化假设)
▪ Simplification: 简化:
P(bugged | John Smith’s hotel room) ~ P(bugged | room)  
OR 
P(bugged | John Smith’s hotel room) ~ P(bugged | hotel  room) 
▪ First-order Markov assumption:
▪ Assumes only most recent words are relevant!    假设只有相邻的单词是相关的
Unigram Model 单幅模型
▪ Zero-order Markov assumption
            
▪ Examples generated from a unigram model
▪ Not very good!
Bigram Model 双gram模型
▪ First-order assumption 
▪ P(I want to eat Chinese food) = ?  
▪ Estimate bigram probabilities from a training corpus    从训练语料库中估计二元模型概率
Bigram Counts from Berkeley Restaurant Corpus 伯克利餐厅语料库中的二元模型计数

Compute Bigram Probabilities 计算二元模型概率

Log对数有助于数值稳定性(概率变小)
Trigram Models 三元模型
▪ Second order markov assumption
▪ Long-distance dependencies of language 语言的远距离依赖性
▪ We can extend to 4-grams , 5-grams …
Sequence Generation 序列生成
▪ Compute conditional probabilities: 计算条件概率
− P(want | I) = 0.32 
− P(to | I) = 0 
− P(eat | I) = 0.004  
− P(Chinese | I) = 0 
− P (I | I) = 0.002 − … 
▪ Sampling:  取样
− Ensures you don’t just get the same sentence all the time!     确保您不会一直听到同一句话!
− Generate a random number in [0,1]     生成0-1之间的随机数    – corresponds to P’s     基于P
• Based on a uniform distribution  基于均匀分布
− Words that are a better fit, are more likely to be selected!    更可能选择到合适的单词
Approximating Shakespeare
▪ Generate sentences from a unigram model:     从单元模型生成句子
− Every enter now severally so, let     
− Hill he late speaks; or! a more to leg less first you enter 
▪ From a bigram model: 
− What means, sir. I confess she? then all sorts, he is trim,  captain 
− Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry 
▪ From a trigram model: 
− Sweet prince, Falstaff shall die 
− This shall forbid it should be branded, if renown made it empty
The Perils of Overfitting 过拟合的危险
▪ P(I want to eat Chinese food) = ? when count(Chinese food) = 0     当部分短语概率为0,则整句的概率为?
▪ In practice, the test corpus is often different than the training corpus     测试集和训练集通常很不同
▪ Unknown words!     没见过的单词
▪ Unseen n-grams!    没见过的gram
Interpolation 插入文字
▪ Key idea: mix lower order n-gram probabilities     关键思想:混合低阶n-gram概率
▪ Lambdas are hyperparameters     λ是超参数
▪ For bigram models:    双元模型
▪ For trigram models:    三元模型 
How to Set the Lambdas? 怎样设置超参数
▪ Estimate on the held-out data 留存预料
▪ Typically, Expectation Maximization (EM) is used     通常使用期望最大化
▪ One crude approach (Collins et al. Course notes 2013) 一种粗略的方法
                
• Ensures is larger when count is larger. 
• Different lambda for each n-gram. 
• Only one parameter to estimate (gamma).    只需要估计一个值
Absolute-Discounting Interpolation 绝对贴现插值
We tend to systematically overestimate n-gram counts. 我们倾向于系统性地高估n-gram计数
We can use this to get better estimates and set the lambdas. Using discounting.
Absolute-Discounting Interpolation
▪ Involves the interpolation of lower and higher-order models     涉及低阶和高阶模型的插值
▪ Aims to deal with sequences that occur infrequently     旨在处理不常出现的序列
▪ Subtract d (the discount) from each of the counts    从每个计数中减去d(折扣)
▪ The discount reserves some probability mass for the  lower order n-grams ▪ Thus, the discount determines the lambda, for 0 < d <  1 we have:
▪ Typically, d = 0.75 is used
Kneser-Ney Smoothing
▪ If there are only a few words that come after a  context, then a novel word in that context should  be less likely 
▪ But we also expect that if a word appears after a  small number of contexts, then it should be less  likely to appear in a novel context… 
Consider the word: Francisco 
Might be common because: San Francisco is a  common term. 
But Francisco occurs in very few contexts.
▪ Provides better estimates for probabilities of lower-order  unigrams: 
− I can’t see without my reading __? 
− “Francisco” is more common than “glasses”, say, … but “Francisco” always follows “San”! 
▪ Pcontinuation(x) : How likely is a word to continue a new  context? 
− For each word x_i, count the number of bigram types it completes 
ie unigram probability is proportional to the number of  different words it follows

▪ “smoothing” ~ adjusting low probs (such as zero probs) upwards, and high probs downward − ie adjusting MLE to produce more accurate probabilities
▪ Works very well, also used to improve NN approaches
▪ https://nlp.stanford.edu/~wcmac/papers/20050421- smoothing-tutorial.pdf
Smoothing for Web-scale N-grams
▪ “Stupid Backoff” (Brants et al. 2007) 
▪ No discounting, just use relative frequencies! 
▪ Does not give a probability distribution
Stupid Backoff
▪ A simplistic type of smoothing 
▪ Inexpensive to train on large data sets, and  approaches the quality of Kneser-Ney  Smoothing as the amount of training data  increases 
▪ Try to use higher-order n-gram’s, otherwise drop to  shorter sequence ie one less sequence, hence  backoff! 
− Every time you ‘backoff’, eg reduce from trigram to bigram  because you’ve encountered a 0 probability, you multiply  by 0.4 
− ie use trigram if good evidence available, otherwise  bigram, otherwise unigram! 
▪ S = scores, not probabilities! 
Stupid Backoff
▪ Extrinsic evaluation: 
− Put each model in a task 
• Spelling correction, machine translation etc 
− Time consuming 
▪ Intrinsic evaluation: 
− Perplexity 
− Weighted average number of possible next words that can follow a given word 
- Normalised likelihood 
– the lower the better - L = length of sequence 
- Works well when test and training data are similar 
- Pre-processing and vocabulary matter

 
                         
                                

