当前位置: 首页 > news >正文

08.C2W3.Auto-complete and Language Models

往期文章请点这里

目录

  • N-Grams: Overview
  • N-grams and Probabilities
    • N-grams
    • Sequence notation
    • Unigram probability
    • Bigram probability
    • Trigram Probability
    • N -gram probability
    • Quiz
  • Sequence Probabilities
    • Probability of a sequence
    • Sequence probability shortcomings
    • Approximation by N gram probabilities
    • Quiz
  • Starting and Ending Sentences
    • Start of sentence token \<s\>
    • End of sentence token \</s\> -motivation
    • End of sentence token \</s\> -solution
    • Example-bigram
    • Quiz
  • The N-gram Language Model
    • Count matrix
    • Probability matrix
    • Language model
    • Log probability
    • Generative Language model
  • Language Model Evaluation
    • Test data
    • Perplexity
    • Perplexity for bigram models
    • Log perplexity
    • Example
  • Out of Vocabulary Words
    • Out of vocabulary words
    • Using \<UNK\> in corpus
    • How to create vocabulary V
    • Quiz
  • Smoothing
    • Missing N-grams in training corpus
    • Smoothing
    • Backoff
    • Interpolation
    • Quiz

往期文章请点这里

N-Grams: Overview

● Create language model (LM) from text corpus to
○ Estimate probability of word sequences
○ Estimate probability of a word following a sequence of words
● Apply this concept to autocomplete a sentence with most likely suggestions
在这里插入图片描述
语言模型在自然语言处理(NLP)和人工智能领域有着广泛的应用,以下是它们在特定领域中的应用:
语音识别(Speech Recognition):
语音识别系统将人类的语音转换成书面文本。语言模型在这个过程中起到关键作用,因为它们帮助系统理解语音片段中单词的上下文和语法结构。通过语言模型,系统能够更准确地预测一个单词序列的可能性,从而提高语音到文本转换的准确性。
在这里插入图片描述

拼写检查与纠正(Spelling Correction):
语言模型可以用来检测和纠正文本中的拼写错误。由于语言模型知道哪些单词序列在特定语言中是常见的或符合语法规则的,它们可以识别出不符合这些规则的单词,提示或自动更正为正确的拼写。例如,如果用户错误地输入了“recieve”,语言模型可以识别出这不是一个常见单词,并建议更正为“receive”。
在这里插入图片描述

辅助交流(Augmentative and Alternative Communication, AAC):
辅助交流设备或系统帮助那些有语言障碍或沟通困难的人表达自己。语言模型可以集成到这些系统中,提供个性化的预测和建议,帮助用户更快地构建句子和表达思想。例如,对于使用特殊设备进行交流的用户,语言模型可以预测他们可能想要表达的下一个单词或短语,从而提高交流效率。
在这里插入图片描述

主要目标:
●Process text corpus to N-gram language model
●Out of vocabulary words
●Smoothing for previously unseen N-grams
●Language model evaluation

N-grams and Probabilities

N-grams

N-gram 是自然语言处理中用于描述文本数据的一种统计模型。简单来说,一个 N-gram 是由 N 个连续的词(words)组成的序列。在这个序列中,每个词被称作一个“gram”,并且这个序列可以被用来捕捉文本中的局部上下文信息。

以下是不同 N 值的 N-gram 的一些例子:
对于 Unigram(1-gram):N=1,它只包含一个词。例如,“cat”就是一个 unigram。
对于 Bigram(2-gram):N=2,它包含两个连续的词。例如,“cat sat”就是一个 bigram。
对于 Trigram(3-gram):N=3,它包含三个连续的词。例如,“cat sat on”就是一个 trigram。
N-gram 模型在语言模型中非常重要,因为它们可以用来预测文本序列中下一个词出现的概率。例如,在一个 bigram 模型中,给定第一个词,模型可以预测第二个词出现的概率。这种模型对于诸如拼写检查、语法分析、机器翻译和语音识别等应用至关重要。

然而,N-gram 模型也存在一些局限性,比如当 N 值较大时,模型可能会遇到数据稀疏问题,因为大量的词序列在训练数据中可能只出现很少的次数或从未出现过。此外,N-gram 模型通常忽略了词序之外的上下文信息,如句法和语义。

理解 N-gram 的关键是认识到它们提供了一种简单但有效的方式来捕捉和表示文本数据中的局部依赖关系。
另外一个例子:
Corpus: I am happy because I am learning
Unigrams: {I , am, happy, because, learning}
Bigrams: {I am, am happy , happy because …}这里I happy不是Bigrams,必须要连续的两个词;I am在语料库中出现两次,只会记录一次
Trigrams: {I am happy , am happy because , …}

Sequence notation

假设现有语料库中有500个单词:
在这里插入图片描述
则单词序列可以表示为:
w 1 m = w 1 w 2 ⋯ w m w_1^m=w_1w_2\cdots w_m w1m=w1w2wm
例如第一个到第三个单词的序列:
w 1 3 = w 1 w 2 w 3 w_1^3=w_1w_2w_3 w13=w1w2w3
语料库中最后三个词可以表示为:
w m − 2 m = w m − 2 w m − 1 w m w_{m-2}^m=w_{m-2}w_{m-1}w_m wm2m=wm2wm1wm

Unigram probability

假设语料库为:I am happy because I am learning
语料库大小 m = 7 m=7 m=7
对于单词I: P ( I ) = 2 7 P(I)=\cfrac{2}{7} P(I)=72
对于单词happy: P ( h a p p y ) = 1 7 P(happy)=\cfrac{1}{7} P(happy)=71
Unigram probability公式为:
P ( w ) = C ( w ) m P(w)=\cfrac{C(w)}{m} P(w)=mC(w)

Bigram probability

假设语料库为:I am happy because I am learning
则前一个单词是I,后一个单词是am的概率为: P ( a m ∣ I ) = C ( I a m ) C ( I ) = 2 2 = 1 P(am|I)=\cfrac{C(I\space am)}{C(I)}=\cfrac{2}{2}=1 P(amI)=C(I)C(I am)=22=1
前一个单词是I,后一个单词是happy的概率为: P ( h a p p y ∣ I ) = C ( I h a p p y ) C ( I ) = 0 2 = 0 P(happy|I)=\cfrac{C(I\space happy)}{C(I)}=\cfrac{0}{2}=0 P(happyI)=C(I)C(I happy)=20=0
前一个单词是am,后一个单词是learning的概率为: P ( l e a r n i n g ∣ a m ) = C ( a m l e a r n i n g ) C ( a m ) = 1 2 P(learning|am)=\cfrac{C(am\space learning)}{C(am)}=\cfrac{1}{2} P(learningam)=C(am)C(am learning)=21
Bigram probability公式为:
P ( y ∣ x ) = C ( x y ) ∑ w C ( x w ) = C ( x y ) C ( x ) P(y|x)=\cfrac{C(x\space y)}{\sum_wC(x\space w)}=\cfrac{C(x\space y)}{C(x)} P(yx)=wC(x w)C(x y)=C(x)C(x y)

Trigram Probability

假设语料库为:I am happy because I am learning
前两个单词是I am,后一个单词是happy的概率为: P ( h a p p y ∣ I a m ) = C ( I a m h a p p y ) C ( I a m ) = 1 2 P(happy|I\space am)=\cfrac{C(I\space am\space happy)}{C(I\space am)}=\cfrac{1}{2} P(happyI am)=C(I am)C(I am happy)=21
Trigram Probability公式为:
P ( w 3 ∣ w 1 2 ) = C ( w 1 2 w 3 ) C ( w 1 2 ) P(w_3|w_1^2)=\cfrac{C(w_1^2w_3)}{C(w_1^2)} P(w3w12)=C(w12)C(w12w3)

N -gram probability

直接给公式:
P ( w N ∣ w 1 N − 1 ) = C ( w 1 N − 1 w N ) C ( w 1 N − 1 ) P(w_N|w_1^{N-1})=\cfrac{C(w_1^{N-1}w_N)}{C(w_1^{N-1})} P(wNw1N1)=C(w1N1)C(w1N1wN)
分子: C ( w 1 N − 1 w N ) = C ( w 1 N ) C(w_1^{N-1}w_N)=C(w_1^{N}) C(w1N1wN)=C(w1N)

Quiz

Corpus:
In every place of great resort the monster was the fashion. They sang of it in the cafes, ridiculed it in the papers, and rep res ented it on the stage. ” (Jules Verne, Twenty Thousand Leagues under the Sea)
In the context of our corpus, what is the probability of word “papers” following the phrase “it in the”.
Answer: 1/2
解析:it in the总共出现了2次,后面接papers出现了1次

Sequence Probabilities

Probability of a sequence

给定一个句子,其出现概率如何计算?
根据链式法则:
P ( A , B , C , D ) = P ( A ) P ( B ∣ A ) P ( C ∣ A , B ) P ( D ∣ A , B , C ) P(A, B,C, D)= P(A)P(B|A)P(C|A, B)P(D|A, B, C) P(A,B,C,D)=P(A)P(BA)P(CA,B)P(DA,B,C)
根据条件概率:
P ( B ∣ A ) = P ( A , B ) P ( A ) ⇒ P ( A , B ) = P ( A ) P ( B ∣ A ) P(B|A)=\cfrac{P(A,B)}{P(A)}\xRightarrow{} P(A,B)=P(A)P(B|A) P(BA)=P(A)P(A,B) P(A,B)=P(A)P(BA)
则某句话出现的概率为:
P(the teacher drinks tea)=P(the)P(teacher|the)P(drinks |the teacher)P(tea |the teacher drinks)

Sequence probability shortcomings

最大的问题:Corpus almost never contains the exact sentence we’re interested in or even its longer subsequences!
例如上面的例子中最后一项:
P ( t e a ∣ t h e t e a c h e r d r i n k s ) = C ( t h e t e a c h e r d r i n k s t e a ) C ( t h e t e a c h e r d r i n k s ) P(tea |the\space teacher\space drinks)=\cfrac{C(the\space teacher\space drinks\space tea)}{C(the\space teacher\space drinks)} P(teathe teacher drinks)=C(the teacher drinks)C(the teacher drinks tea)
可以预想到分子和分母项在语料中出现的次数估计为0,会使得P(the teacher drinks tea)计算依赖相乘的结果也为0

Approximation by N gram probabilities

为了避免上面提到的情况,将条件概率中的条件限制为前一个单词:
P ( t e a ∣ t h e t e a c h e r d r i n k s ) ≈ P ( t e a ∣ d r i n k s ) P(tea |the\space teacher\space drinks)\approx P(tea|drinks) P(teathe teacher drinks)P(teadrinks)
P ( t h e t e a c h e r d r i n k s t e a ) = P ( t h e ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t h e t e a c h e r ) P ( t e a ∣ t h e t e a c h e r d r i n k s ) ≈ P ( t h e ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t e a c h e r ) P ( t e a ∣ d r i n k s ) P(the\space teacher\space drinks\space tea)=P(the)P(teacher|the)P(drinks |the\space teacher)P(tea |the\space teacher\space drinks)\\ \approx P(the)P(teacher|the)P(drinks |teacher)P(tea |drinks) P(the teacher drinks tea)=P(the)P(teacherthe)P(drinksthe teacher)P(teathe teacher drinks)P(the)P(teacherthe)P(drinksteacher)P(teadrinks)
当然,还可以根据Markov assumption: only last N words matter
Bigram某个单词出现概率:
P ( w n ∣ w 1 n − 1 ) ≈ P ( w n ∣ w n − 1 ) P(w_n | w_1^{n-1}) \approx P(w_n | w_{n-1}) P(wnw1n1)P(wnwn1)
N-gram某个单词出现概率:
P ( w n ∣ w 1 n − 1 ) ≈ P ( w n ∣ w n − N + 1 n − 1 ) P(w_n | w_1^{n-1}) \approx P(w_n|w_{n-N+1}^{n-1}) P(wnw1n1)P(wnwnN+1n1)
Bigram整个句子出现概率:
P ( w 1 n ) ≈ P ( w 1 ) P ( w 2 ∣ w 1 ) ⋯ P ( w n ∣ w n − 1 ) P(w_1^n)\approx P(w_1)P(w_2|w_1)\cdots P(w_n|w_{n-1}) P(w1n)P(w1)P(w2w1)P(wnwn1)

Quiz

Given these conditional probabilities
P(Mary)=0.1;
P(likes)=0.2;
P(cats)=0.3
P(Mary|likes) =0.2;
P(likes|Mary) =0.3;
P(cats|likes)=0.1;
P(likes|cats)=0.4
Approximate the probability of the following sentence with bigrams: “Mary likes cats”
Answer:0.003
解析:P(Mary likes cats)=P(Mary)P(likes|Mary)P(cats|likes)=0.1×0.3×0.1=0.003

Starting and Ending Sentences

Start of sentence token <s>

P ( t h e t e a c h e r d r i n k s t e a ) ≈ P ( t h e ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t e a c h e r ) P ( t e a ∣ d r i n k s ) P(the\space teacher\space drinks\space tea) \approx P(the)P(teacher|the)P(drinks |teacher)P(tea |drinks) P(the teacher drinks tea)P(the)P(teacherthe)P(drinksteacher)P(teadrinks)
可以看到第一个单词没有前置词,无法使用Bigram来计算条件概率,因此,我们通常会加上一个特殊项,使得上面的公式右边每一项都变成Bigram,the teacher drinks tea就变成了<s> the teacher drinks tea,概率计算变成:
P ( < s > t h e t e a c h e r d r i n k s t e a ) ≈ P ( t h e ∣ < s > ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t e a c h e r ) P ( t e a ∣ d r i n k s ) P(<s>\space the\space teacher\space drinks\space tea) \approx P(the|<s>)P(teacher|the)P(drinks |teacher)P(tea |drinks) P(<s> the teacher drinks tea)P(the<s>)P(teacherthe)P(drinksteacher)P(teadrinks)

对于Trigram:
P ( t h e t e a c h e r d r i n k s t e a ) ≈ P ( t h e ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t h e t e a c h e r ) P ( t e a ∣ t e a c h e r d r i n k s ) P(the\space teacher\space drinks\space tea)\approx P(the)P(teacher|the)P(drinks| the\space teacher)P(tea|teacher\space drinks) P(the teacher drinks tea)P(the)P(teacherthe)P(drinksthe teacher)P(teateacher drinks)
需要加上两个<s>,得到:<s> <s> the teacher drinks tea

进一步推广到N-gram,则需要添加N-1个<s>

End of sentence token </s> -motivation

第一个动机:
对于公式:
P ( y ∣ x ) = C ( x , y ) ∑ w C ( x , w ) = C ( x , y ) C ( x ) P(y|x)=\cfrac{C(x,y)}{\sum_wC(x,w)}=\cfrac{C(x,y)}{C(x)} P(yx)=wC(x,w)C(x,y)=C(x)C(x,y)
当我们计算最后一个词的时候,上面公式的分母不一定相等,即: ∑ w C ( x , w ) ≠ C ( x ) \sum_wC(x,w)\neq C(x) wC(x,w)=C(x)
例如有语料库:
<s> Lyn drinks chocolate
<s> John drinks
数一下drinks后面带有单词出现的次数是1:
∑ w C ( d r i n k s , w ) = 1 \sum_wC(drinks,w)=1 wC(drinks,w)=1
drinks单独出现的次数是2:
∑ w C ( d r i n k s ) = 2 \sum_wC(drinks)=2 wC(drinks)=2
第二个动机:
假如有语料库:
<s> yes no
<s> yes yes
<s> no no
先生成长度为2的句子:
<s> yes yes
<s> yes no
<s> no no
<s> no yes
以第一个<s> yes yes为例,计算其出现概率:
P ( < s > y e s y e s ) = P ( y e s ∣ < s > ) × P ( y e s ∣ y e s ) = C ( < s > , y e s ) ∑ w C ( < s > , w ) × C ( y e s , y e s ) ∑ w C ( y e s , w ) = 2 3 × 1 2 = 1 3 P(<s>\space yes\space yes)=P(yes|<s>)\times P(yes|yes)\\ =\cfrac{C(<s>,yes)}{\sum_wC(<s>,w)}\times\cfrac{C(yes,yes)}{\sum_wC(yes,w)}\\ =\cfrac{2}{3}\times\cfrac{1}{2}=\cfrac{1}{3} P(<s> yes yes)=P(yes<s>)×P(yesyes)=wC(<s>,w)C(<s>,yes)×wC(yes,w)C(yes,yes)=32×21=31
同理,可以计算得到<s> yes no出现概率为:1/3;<s> no no出现概率为:1/3;<s> no yes 出现概率为:0;
也就是说所有长度为2的句子出现概率总和为: ∑ 2 w o r d P ( ⋯ ) = 1 / 3 + 1 / 3 + 1 / 3 + 0 = 1 \sum_{2\space word}P(\cdots)=1/3+1/3+1/3+0=1 2 wordP()=1/3+1/3+1/3+0=1
同理可以计算长度为3的句子:
在这里插入图片描述
这个结果是不符合我们的假设的,正常来说,根据语料库生成所有句子的可能性加起来应该为1,而不是某个长度的句子生成概率为1:
∑ 2 w o r d P ( ⋯ ) + ∑ 3 w o r d P ( ⋯ ) + ⋯ = 1 \sum_{2\space word}P(\cdots)+\sum_{3\space word}P(\cdots)+\cdots=1 2 wordP()+3 wordP()+=1

End of sentence token </s> -solution

解决方法就是在句末加</s>,例如:<s> the teacher drinks tea </s>,出现概率为:
P ( t h e ∣ < s > ) P ( t e a c h e r ∣ t h e ) P ( d r i n k s ∣ t e a c h e r ) P ( t e a ∣ d r i n k s ) P ( < / s > ∣ t e a ) P(the|<s>)P(teacher|the)P(drinks |teacher)P(tea |drinks)P(</s>|tea) P(the<s>)P(teacherthe)P(drinksteacher)P(teadrinks)P(</s>tea)
注意:和句首不一样,即使是N-gram也只需要加一个</s>,例如Trigram:
the teacher drinks tea=> <s> <s> the teacher drinks tea </s>

对于动机1:
<s> Lyn drinks chocolate </s>
<s> John drinks </s>
数一下drinks后面带有单词出现的次数是2:
∑ w C ( d r i n k s , w ) = 2 \sum_wC(drinks,w)=2 wC(drinks,w)=2
drinks单独出现的次数是2:
∑ w C ( d r i n k s ) = 2 \sum_wC(drinks)=2 wC(drinks)=2

Example-bigram

假设语料库为:
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s>
<s> Lyn eats chocolate </s>
以下是一些单词出现概率计算结果:
P ( J o h n ∣ < s > ) = 1 3 P ( < / s > ∣ t e a ) = 1 1 P(John|<s>)=\cfrac{1}{3}\quad P(</s>|tea)=\cfrac{1}{1} P(John<s>)=31P(</s>tea)=11
P ( c h o c o l a t e ∣ e a t s ) = 1 1 P ( L y n ∣ < s > ) = 2 3 P(chocolate |eats )=\cfrac{1}{1}\quad P(Lyn |<s>)=\cfrac{2}{3} P(chocolateeats)=11P(Lyn<s>)=32
对于第一句话出现概率为:
P ( s e n t e n c e ) = 2 3 × 1 2 × 1 2 × 2 2 = 1 6 P(sentence)=\cfrac{2}{3}\times\cfrac{1}{2}\times\cfrac{1}{2}\times\cfrac{2}{2}=\cfrac{1}{6} P(sentence)=32×21×21×22=61
可以看到,计算结果要比3个句子情况下出现概率为1/3的概率要低,剩余概率可以分布到语料库中使用bigram生成的其他句子中,这就是模型的泛化方式。

Quiz

Question:
Given these conditional probabilities
P(Mary)=0.1;
P(likes)=0.2;
P(cats)=0.3
P(Mary|<s>)=0.2;
P(</s>|cats)=0.6
P(likes|Mary) =0.3;
P(cats|likes)=0.1
Approximate the probability of the following sentence with bigrams: “<s> Mary likes cats </s>”
Answer: 0.0036
解析:P(Mary|<s>)P(likes|Mary)P(cats|likes)P(</s>|cats)=0.2×0.3×0.1×0.6

The N-gram Language Model

Count matrix

在N-gram的公式中:
P ( w n ∣ w n − N + 1 n − 1 ) = C ( w n − N + 1 n − 1 , w n ) C ( w n − N + 1 n − 1 ) P(w_n|w_{n-N+1}^{n-1})=\cfrac{C(w_{n-N+1}^{n-1},w_n)}{C(w_{n-N+1}^{n-1})} P(wnwnN+1n1)=C(wnN+1n1)C(wnN+1n1,wn)
分子: C ( w n − N + 1 n − 1 , w n ) C(w_{n-N+1}^{n-1},w_n) C(wnN+1n1,wn)
Count matrix计算了在语料库中出现的所有共现次数。
它的行值是非重复语料库前一词
列是所有非重复语料当前词
Bigram count matrix实例:
Corpus:<s> I study I learn </s>
在这里插入图片描述
上面的study I在语料库出现1次

Probability matrix

上面以及计算了分子,再计算出分母后就得到概率矩阵
Divide each cell by its row sum
s u m ( r o w ) = ∑ w ∈ V C ( w n − N + 1 n − 1 , w n ) = C ( w n − N + 1 n − 1 ) sum(row)=\sum_{w\in V}C(w_{n-N+1}^{n-1},w_n)=C(w_{n-N+1}^{n-1}) sum(row)=wVC(wnN+1n1,wn)=C(wnN+1n1)
根据Count matrix计算每行的求和
在这里插入图片描述
然后计算概率得到Probability matrix:
在这里插入图片描述

Language model

通过Probability matrix,Language model可以计算:
○ Sentence probability
○ Next word prediction
例如,根据上一节的Probability matrix,计算<s> I learn </s>这个句子的概率:
P ( s e n t e n c e ) = P ( I ∣ < s > ) P ( l e a r n ∣ I ) P ( < / s > ∣ l e a r n ) = 1 × 0.5 × 1 = 0.5 P(sentence)=P(I|<s>)P(learn|I)P(</s>|learn)=1\times0.5\times1=0.5 P(sentence)=P(I<s>)P(learnI)P(</s>learn)=1×0.5×1=0.5

Log probability

同样的,这里也出现了多个概率相乘的情况,需要使用对数计算防止下溢。
P ( w 1 n ) ≈ ∏ i = 1 n P ( w i ∣ w i − 1 ) P(w_1^n ) \approx\prod_{i=1}^{n} P(w_i | w_{i-1}) P(w1n)i=1nP(wiwi1)
取对数后:
log ⁡ ( P ( w 1 n ) ) ≈ ∑ i = 1 n log ⁡ ( P ( w i ∣ w i − 1 ) ) \log(P(w_1^n ) )\approx\sum_{i=1}^{n}\log( P(w_i | w_{i-1})) log(P(w1n))i=1nlog(P(wiwi1))

Generative Language model

实例:
在这里插入图片描述
可以看到,生成语言模型算法大概步骤如下:

  1. Choose sentence start
  2. Choose next bigram starting with previous word
  3. Continue until </s> is picked

Language Model Evaluation

Test data

For smaller corporaFor large corpora (typical for text)
Train80% Train98%
Validation10% Validation1%
Test10% Validation1%

●split method
对于连续的文本
在这里插入图片描述
对于Random short sequences
在这里插入图片描述

Perplexity

Perplexity(困惑度)是自然语言处理中用来衡量语言模型好坏的一个指标,特别是在评估语言模型对文本的预测能力时。Perplexity的公式通常表示为:

markdown
PP ( W ) = P ( w 1 , w 2 , . . . , w m ) − 1 m \text{PP}(W) = P(w_1 ,w_2 ,...,w_m)^{-\frac{1}{m}} PP(W)=P(w1,w2,...,wm)m1
其中:
P ( w 1 , w 2 , . . . , w m ) P(w_1 ,w_2 ,...,w_m) P(w1,w2,...,wm)是语言模型对观测到的词序列的概率的乘积
m m m 是词序列中的词的总数。
具体来说, P ( w 1 , w 2 , . . . , w N ) P(w_1 ,w_2 ,...,w_N) P(w1,w2,...,wN) 可以展开为:

P ( w 1 , w 2 , . . . , w N ) = ∏ i = 1 N P ( w i ∣ w 1 , w 2 , . . . , w i − 1 ) P(w_1, w_2, ..., w_N) = \prod_{i=1}^{N} P(w_i | w_1, w_2, ..., w_{i-1}) P(w1,w2,...,wN)=i=1NP(wiw1,w2,...,wi1)
这里:
w i w_i wi表示序列中的第 i i i 个词。
P ( w i ∣ w − 1 , w 2 , . . . , w i − 1 ) P(w_ i ∣w-1 ,w_2 ,...,w_{i−1} ) P(wiw1,w2,...,wi1) 是给定前 i − 1 i−1 i1 个词的情况下,第 i i i 个词出现的概率。
Perplexity的计算公式中的 P − 1 N P^{-\frac{1}{N}} PN1 表示的是所有词的概率的几何平均值的倒数。几何平均值可以看作是所有概率乘积的N次方根,而取倒数是为了将平均值转换为原始概率的尺度。
困惑度越低,表示语言模型对数据的预测越准确,即模型对词序列的预测越不困惑。在实践中,一个低困惑度的语言模型意味着它能够更好地预测下一个词,从而生成更自然、更连贯的句子。
Smaller perplexity = better model
Character level models PP < word based models PP

Perplexity for bigram models

P P ( W ) = ∏ i = 1 m ∏ j = 1 ∣ s i ∣ 1 P ( w j ( i ) ∣ w j − 1 ( i ) ) m PP(W)=\sqrt[m]{\prod_{i=1}^m\prod_{j=1}^{|s_i|}\cfrac{1}{P(w_j^{(i)}|w_{j-1}^{(i)})}} PP(W)=mi=1mj=1siP(wj(i)wj1(i))1
w j ( i ) w_j^{(i)} wj(i)表示第i个句子中的第j个词

concatenate all sentences in W
然后计算bigram模型的困惑度,需要计算所有句子的bigram概率的乘积,然后取幂次-1/m
P P ( W ) = ∏ i = 1 m 1 P ( w i ∣ w i − 1 ) m PP(W)=\sqrt[m]{\prod_{i=1}^m\cfrac{1}{P(w_i|w_{i-1})}} PP(W)=mi=1mP(wiwi1)1
w i w_{i} wi表示test set中第i个词

Log perplexity

同样将乘法变成加法:
log ⁡ P P ( W ) = 1 m ∑ i = 1 m log ⁡ 2 ( P ( w i ∣ w i − 1 ) ) \log PP(W)=\cfrac{1}{m}\sum_{i=1}^m\log_2(P(w_i|w_{i-1})) logPP(W)=m1i=1mlog2(P(wiwi1))

Example

在这里插入图片描述
Training 38 million words, test 1.5 million words, WSJ corpus
Perplexity Unigram: 962 Bigram: 170 Trigram: 109
WSJ corpus,全称为Wall Street Journal (WSJ) Corpus,是一个广泛使用的文本语料库,它基于《华尔街日报》的文本内容。这个语料库在自然语言处理(NLP)领域非常知名,特别是用于语言模型的训练和评估。

Out of Vocabulary Words

Out of vocabulary words

Closed vs. Open vocabularies
封闭词汇表提供了一种简化的方法来处理文本,但可能会牺牲对新词的处理能力;而开放词汇表提供了更大的灵活性,可以更好地适应多样化的语言使用,但可能会增加模型的复杂性和计算成本。
Closed Vocabularies(封闭词汇表):
在封闭词汇表系统中,模型在训练前定义了一个固定的词汇集,这个词汇集包含了所有在模型训练和预测时会用到的单词或标记(tokens)。
任何不在词汇表中的词在处理时通常会被忽略或替换为一个特殊的未知标记(如<UNK>)。
封闭词汇表有助于减少模型的复杂性,因为它限制了模型需要学习和预测的词汇数量。
这种方法的一个缺点是,它无法很好地处理词汇表之外的新词或罕见词,这可能会影响模型对新文本的理解能力。

Open Vocabularies(开放词汇表):
开放词汇表系统不限制模型使用的词汇数量。模型可以处理任何它遇到的词,无论这些词是否在训练数据中出现过。
在这种设置下,模型通常使用子词分割(subword segmentation)技术,如Byte Pair Encoding(BPE)或WordPiece,来处理不在训练集中的词。
开放词汇表可以更好地处理多样化的文本,包括专业术语或新出现的词汇,因为它们不会被简单地替换为未知标记。
然而,这种方法可能会增加模型的复杂性,因为模型需要学习更多的词汇和它们之间的关系。

Unknown word = Out of vocabulary word (OOV)
special tag <UNK> in corpus and in input

Using <UNK> in corpus

步骤:
● Create vocabulary V
● Replace any word in corpus and not in V by <UNK>
● Count the probabilities with <UNK> as with any other word
例子:
Corpus
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s>
<s> Lyn eats chocolate </s>
将词表门槛定为最少出现两次:Min frequency f=2
<s> Lyn drinks chocolate </s>
<s> <UNK> drinks <UNK> </s>
<s> Lyn <UNK> chocolate </s>
最后的词表为:
Vocabulary
Lyn, drinks, chocolate
在进行输入查询时,如果有非词表的单词,也要替换为UNK
<s>Adam drinks chocolate</s>
<s><UNK> drinks chocolate</s>

How to create vocabulary V

两种条件:

  1. 设定单词最小出现频率,大于该频率的进入词表,否则设置为UNK
  2. 设定词表最大容量 ∣ V ∣ |V| V,按单词出现频率排序,将前 ∣ V ∣ |V| V个单词包含进词表,其他的设置为UNK

虽然UNK对于降低困惑度有效,但不建议设置过多的UNK词,否则在你生成句子的时候会看到很多的UNK
在比较困惑度的时候,only compare LMs with the same V

Quiz

Given the training corpus and minimum word frequency=2, how would the vocabulary for corpus
preprocessed with <UNK> look like?
“<s> I am happy I am learning </s> <s> I am happy I can study </s>”
Answer:
V = (I,am,happy)

Smoothing

Missing N-grams in training corpus

Problem: N-grams made of known words still might be missing in the training corpus
如何处理由语料库中出现的单词组成但Ngram本身不存在的N-gram的概率
例如,语料库有“John”,“eats”,但是没有“John eats”,此时“John eats”的计数为0,其bigram概率也为0,会导致整个句子出现概率也为0

Smoothing

Add-one smoothing (Laplacian smoothing)

P ( w n ∣ w n − 1 ) = C ( w n − 1 , w n ) + 1 ∑ w ∈ V ( C ( w n − 1 , w n ) + 1 ) = C ( w n − 1 , w n ) + 1 C ( w n − 1 ) + V P(w_n|w_{n-1})=\cfrac{C(w_{n-1},w_n)+1}{\sum_{w\in V}(C(w_{n-1},w_n)+1)}=\cfrac{C(w_{n-1},w_n)+1}{C(w_{n-1})+V} P(wnwn1)=wV(C(wn1,wn)+1)C(wn1,wn)+1=C(wn1)+VC(wn1,wn)+1
Add-one smoothing需要在词表足够大的情况下使用,否则会使得缺失单词概率过高。
如果语料库非常大,则可以使用Add k smoothing(可用在3gram、4gram等高阶gram上):
P ( w n ∣ w n − 1 ) = C ( w n − 1 , w n ) + k ∑ w ∈ V ( C ( w n − 1 , w n ) + k ) = C ( w n − 1 , w n ) + k C ( w n − 1 ) + k × V P(w_n|w_{n-1})=\cfrac{C(w_{n-1},w_n)+k}{\sum_{w\in V}(C(w_{n-1},w_n)+k)}=\cfrac{C(w_{n-1},w_n)+k}{C(w_{n-1})+k\times V} P(wnwn1)=wV(C(wn1,wn)+k)C(wn1,wn)+k=C(wn1)+k×VC(wn1,wn)+k
Advanced methods:
Kneser-Ney Smoothing(Kneser-Ney 平滑):
Kneser-Ney 由 Reinhard Kneser 和 Hermann Ney 提出,是一种用于计算条件概率分布的平滑技术。
它通过调整概率分布,使得低频词或未见词的概率分布更加均匀,从而提高语言模型的泛化能力。
Kneser-Ney 考虑了词的上下文,通过加权平均的方式来更新概率,其中权重取决于词在语料库中的相对频率。
它特别适合处理大规模语料库,因为它可以有效地利用语料中的统计信息。

Good-Turing Smoothing(Good-Turing 平滑):
Good-Turing smoothing 是由I. J. Good提出的,用于估计在语料库中未出现过的词的概率。
它基于一个简单的统计观察:在语料库中出现一次的词的数量大约是出现多次的词的数量的一半。
Good-Turing 方法通过将概率质量从高频词转移到低频词来实现平滑,特别是对于那些在训练语料中未出现过的词。

这种方法简单且计算效率高,但可能不如 Kneser-Ney 方法那样灵活,因为它不区分不同上下文中的词。
两种平滑方法各有优势和局限性。Kneser-Ney smoothing 通常在实际应用中表现更好,因为它考虑了词的上下文信息,但计算复杂度较高。Good-Turing smoothing 则因其简单性和效率而在某些情况下被采用,尤其是在资源受限的情况下。

Backoff

If N-gram missing => use (N-1)-gram, …有两种backoff方式
第一种是直接替换:Probability discounting e.g. Katz backoff
第二种是乘以某个常数(0.4比较好)后替换:“Stupid” backoff
在这里插入图片描述

Interpolation

Interpolation(插值)是一种在自然语言处理中用于平滑语言模型的技术,特别是在处理不同概率分布的组合时。它通过将多个模型或分布的输出以某种方式结合起来,以减少模型的不确定性和过拟合,同时提高泛化能力。最常见的插值方法是线性插值,它简单地将不同模型的概率输出按照一定的权重进行加权平均。
在这里插入图片描述
系数 λ \lambda λ可以通过训练来确定

Quiz

Question:
Corpus: “I am happy I am learning”
In the context of our corpus, what is the estimated probability of word “can” following the word “I” using the
bigram model and add k smoothing where k=3.
Answer:
P(can|I)=P(can|I) = 3/(2+3×4)

相关文章:

08.C2W3.Auto-complete and Language Models

往期文章请点这里 目录 N-Grams: OverviewN-grams and ProbabilitiesN-gramsSequence notationUnigram probabilityBigram probabilityTrigram ProbabilityN -gram probabilityQuiz Sequence ProbabilitiesProbability of a sequenceSequence probability shortcomingsApproxi…...

【linux】log 保存和过滤

log 保存 ./run.sh 2>&1 | tee -a /home/name/log.txt log 过滤 import os import re# Expanded regular expression to match a wider range of error patterns error_patterns re.compile(# r(error|exception|traceback|fail|failed|fatal|critical|warn|warning…...

GeoTrust ——适合企业使用的SSL证书!

GeoTrust是一家全球知名的数字证书颁发机构&#xff08;CA&#xff09;&#xff0c;其提供的SSL证书非常适合企业使用。GeoTrust的SSL证书为企业带来了多重优势&#xff0c;不仅在验证级别、加密强度、兼容性、客户服务等方面表现出色&#xff0c;而且其高性价比和灵活的证书选…...

Kubelet 认证

当我们执行kubectl exec -it pod [podName] sh命令时&#xff0c;apiserver会向kubelet发起API请求。也就是说&#xff0c;kubelet会提供HTTP服务&#xff0c;而为了安全&#xff0c;kubelet必须提供HTTPS服务&#xff0c;且还要提供一定的认证与授权机制&#xff0c;防止任何知…...

aws slb

NLB 目标组 Target is in an Availability Zone that is not enabled for the load balancer 解决&#xff1a; https://docs.aws.amazon.com/zh_cn/elasticloadbalancing/latest/network/load-balancer-troubleshooting.html 负载均衡器添加 后端EC2 所在的vpc网段即可。…...

【AI大模型】ChatGPT-4 对比 ChatGPT-3.5:有哪些优势

引言 ChatGPT4相比于ChatGPT3.5,有着诸多不可比拟的优势&#xff0c;比如图片生成、图片内容解析、GPTS开发、更智能的语言理解能力等&#xff0c;但是在国内使用GPT4存在网络及充值障碍等问题&#xff0c;如果您对ChatGPT4.0感兴趣&#xff0c;可以私信博主为您解决账号和环境…...

详解yolov5的网络结构

转载自文章 网络结构图&#xff08;简易版和详细版&#xff09; 此图是博主的老师&#xff0c;杜老师的图 网络框架介绍 前言&#xff1a; YOLOv5是一种基于轻量级卷积神经网络&#xff08;CNN&#xff09;的目标检测算法&#xff0c;整体可以分为三个部分&#xff0c; ba…...

汽车零配件行业看板管理系统应用

生产制造已经走向了精益生产&#xff0c;计算时效产出、物料周转时间等问题&#xff0c;成为每一个制造企业要面临的问题&#xff0c;工厂更需要加快自动化&#xff0c;信息化&#xff0c;数字化的布局和应用。 之前的文章多次讲解了企业MES管理系统&#xff0c;本篇文章就为大…...

【Go】函数的使用

目录 函数返回多个值 init函数和import init函数 main函数 函数的参数 值传递 引用传递&#xff08;指针&#xff09; 函数返回多个值 用法如下&#xff1a; package mainimport ("fmt""strconv" )// 返回多个返回值&#xff0c;无参数名 func Mu…...

宝塔面板运行Admin.net框架

准备 宝塔安装 .netcore安装 Admin.net框架发布 宝塔面板设置 完结撒花 1.准备 服务器/虚拟机一台 系统Windows server / Ubuntu20.04&#xff08;本贴使用的是Ubuntu20.04版本系统&#xff09; Admin.net开发框架 先安装好服务器系统&#xff0c;这里就不做安装过程描述了&…...

Javaweb11-Filter过滤器

Filter过滤器 1.Filter的基本概念&#xff1a; 在Java Servlet中&#xff0c;Filter接口是用来处理HttpServletRequest和HttpServletResponse的对象的过滤器。主要用途是在请求到达Servlet之前或者响应离开Servlet之前对请求或响应进行预处理或后处理。 2.Filter常见的API F…...

【AI-7】CUDA

CUDA&#xff08;Compute Unified Device Architecture&#xff09;是NVIDIA公司开发的一种并行计算平台和编程模型&#xff0c;使开发者能够利用NVIDIA GPU的强大计算能力来加速各种应用。以下是关于CUDA的详细介绍&#xff1a; CUDA的特点 并行计算&#xff1a;CUDA允许开发…...

ctfshow-web入门-文件上传(web164、web165)图片二次渲染绕过

web164 和 web165 的利用点都是二次渲染&#xff0c;一个是 png&#xff0c;一个是 jpg 目录 1、web164 2、web165 二次渲染&#xff1a; 网站服务器会对上传的图片进行二次处理&#xff0c;对文件内容进行替换更新&#xff0c;根据原有图片生成一个新的图片&#xff0c;这样…...

基于实现Runnable接口的java多线程

Java多线程通常可以通过继承Thread类或者实现Runnable接口实现。本文主要介绍实现Runnable接口的java多线程的方法, 并通过ThreadPoolTaskExecutor调用执行&#xff0c;以及应用场景。 一、应用场景 异步、并行、子任务、磁盘读写、数据库查询、网络请求等耗时操作等。 以下…...

如何在uniapp中使用websocket?

websocket是我们经常使用到的接口,通常用于即时通讯以及K线图这种需要实时更新数据的业务需求上,传统的restful接口虽然可以满足,但是你需要轮询,这就要额外写一堆代码,不是很方便,用websocket就简单很多,我们来看代码 第一步定义全局常量、变量 const config = {host…...

PCL 点云PFH特征描述子

点云PFH特征描述子 一、概述1.1 概念1.2 算法原理二、代码实现三、结果示例一、概述 1.1 概念 点特征直方图PFH(Point Feature Histograms)描述子:用于表示点云中每个点的局部几何形状信息,它是一种直方图描述子,包括了点云的法线方向和曲率信息,PFH描述子可以帮助区分不同…...

linux程序安装-编译-rpm-yum

编译安装流程步骤详解 识途老码 | Linux编译安装程序 编译安装概览 编译安装是从软件的源代码构建到最终安装的过程,它允许用户根据自身的需求和系统的环境来自定义软件的配置和功能。相对于二进制安装,编译安装提供了更高的灵活性和控制能力,但同时也要求用户具备一定的…...

【网络协议】PIM

PIM 1 基本概念 PIM&#xff08;Protocol Independent Multicast&#xff09;协议&#xff0c;即协议无关组播协议&#xff0c;是一种组播路由协议&#xff0c;其特点是不依赖于某一特定的单播路由协议&#xff0c;而是可以利用任意单播路由协议建立的单播路由表完成RPF&…...

Redis 中的跳跃表(Skiplist)基本介绍

Redis 中的跳跃表&#xff08;Skiplist&#xff09;是一种用于有序元素集合的快速查找数据结构。它通过一个多级索引来提高搜索效率&#xff0c;能够在对数时间复杂度内完成查找、插入和删除操作。跳跃表特别适用于实现有序集合&#xff08;sorted set&#xff09;的功能&#…...

C语言编译和编译预处理

1.编译预处理 • 编译是指把高级语言编写的源程序翻译成计算机可识别的二进制程序&#xff08;目标程序&#xff09;的过程&#xff0c;它由编译程序完成。 • 编译预处理是指在编译之前所作的处理工作&#xff0c;它由编译预处理程序完成 在对一个源程序进行编译时&#xff0…...

ahb 总线的一些思考

1. 如何处理对不存在地址的访问&#xff1f; 当主设备试图访问内存映射中不存在的地址时&#xff0c;系统需要处理这一情况以避免错误或未定义行为。通常通过使用默认从设备或错误响应机制来管理。具体如下&#xff1a; 默认从设备&#xff1a;默认从设备响应对未定义或不存在…...

spark shuffle写操作——SortShuffleWriter

写入的简单流程&#xff1a; 1.生成ExternalSorter对象 2.将消息都是插入ExternalSorter对象中 3.获取到mapOutputWriter&#xff0c;将中间产生的临时文件合并到一个临时文件 4.生成最后的data文件和index文件 可以看到写入的重点类是ExternalSorter对象 ExternalSorter 基…...

ESP32CAM物联网教学12

ESP32CAM物联网教学12 MicroPython 视频服务 小智希望能在MicroPython中实现摄像头的视频服务&#xff0c;就像官方示例程序CameraWebServer那样。 下载视频服务驱动库 小智通过上网搜索&#xff0c;发现相关的教学材料还不少&#xff0c;并且知道有人已经写出了视频服务的驱…...

【C++精华铺】12.STL list模拟实现

1.序言 STL (Standard Template Library)是C标准库中的一个重要组件&#xff0c;提供了许多通用的数据结构和算法。其中&#xff0c;STL list是一种带头双向链表容器&#xff0c;可以存储任意类型的元素。 list的特点包括&#xff1a; 双向性&#xff1a;list中的元素可以根据需…...

ChatGPT Mac App 发布!

2024 年 6 月&#xff0c;OpenAI 的大语言模型 ChatGPT 的 Mac 客户端与 ChatGPT-4o 一起发布了。ChatGPT Mac 户端可以让用户直接在 Mac 电脑上使用 ChatGPT 进行对话。它提供了一个简单易用的用户界面&#xff0c;用户可以在其中输入文本或语音指令&#xff0c;并接收模型生成…...

ACE之ACE_Time_Value

简介 ACE_Time_Value在ACE中表示时间&#xff0c;集成不同平台的时间 结构 #mermaid-svg-dGoKn1R7GicabUif {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#mermaid-svg-dGoKn1R7GicabUif .error-icon{fill:#552222;}#mermaid-…...

[论文笔记] 自对齐指令反翻译:SELF-ALIGNMENT WITH INSTRUCTION BACKTRANSLATION

https://arxiv.org/pdf/2308.06259 这篇论文介绍了一种名为“指令反向翻译”(instruction backtranslation)的方法,用于通过自动标记人类书写的文本和相应的指令来构建高质量的指令跟随语言模型。这里是一个通俗易懂的解释: 一、背景 通常,训练一个高质量的指令跟随语言…...

算术运算符. 二

# 表达式 # 操作数和运算符组成 比如 11 # 作用&#xff1a;表达式可以求值&#xff0c;也可以给变量赋值。 # Python算术运算符&#xff1a; # - * / % //&#xff08;整除:向下取整&#xff09; ** print(10 4) # 14 print(10 - 4) # 6 print(10 * 4) # 40 …...

代码优化方法记录

每次代码 review 之后&#xff0c;对 review 的情况进行总结记录&#xff0c;产出实际经验&#xff0c;方便组内学习、分享。 1、提取公共内容 公共内容要提取&#xff0c;避免重复编写&#xff1b; 2、css 色值使用变量 css 中的色值、字体&#xff0c;都换成组件库中的变…...

qt 图形、图像、3D相关知识

1.qt 支持3d吗 Qt确实支持3D图形渲染。Qt 3D模块是Qt的一个组成部分&#xff0c;它允许开发者在Qt应用程序中集成3D内容。Qt 3D模块提供了一组类和函数&#xff0c;用于创建和渲染3D场景、处理3D对象、应用光照和纹理等。 Qt 3D模块包括以下几个主要组件&#xff1a; Qt 3D …...

广州网页设计机构/seo查询爱站网

1.运动前的健康筛查与评估方法主要包括 A.目前推荐常用身体活动准备问卷 B.健康/体适能机构修正的运动前自我筛查问卷 C.心脑血管疾病危险固素评价和分级 D.基于危险分层的医学检查、运动测试和医学监督建议 E.既往身体活动水平评价,常用如国际身体活动问卷等 2.运动锻炼的医…...

法院网站管理系统源码/百度导航下载2022最新版官网

视频微课知识点一、四大发明四大发明是我们国家最伟大的文化成果。包括&#xff1a;造纸术、指南针、火药、活字印刷术。二、人物介绍蔡伦&#xff0c;字敬仲&#xff0c;桂阳郡宋阳(今湖南宋阳)人&#xff0c;生于东汉永平四年(公元61年)&#xff0c;卒于建光元年(公元121年)。…...

免费手机版网站建设/灰色词快速排名方法

最近使用jmeter测试接口并发&#xff0c;所测接口需要登录后才可执行&#xff0c;开始尝试把登录和接口执行写到一个线程组中&#xff0c;但是发现在并发执行时&#xff0c;单点登录容易报错&#xff0c;故改成登录单独线程组。分线程组后&#xff0c;由于cookie管理器所存的co…...

专业做旅游网站/搜索平台

编辑/etc/rsyslog.d/50-default.conf 其中有这么一段*.info;*.notice;*.warn;\auth,authpriv.none;\cron,daemon.none;\mail,news.none -/var/log/messages保存后重启服务&#xff1a;sudo restart rsyslog 转载于:https://www.cnblogs.com/knightly/p/3765332.html...

企业网站源码哪个好/nba最新交易

2019独角兽企业重金招聘Python工程师标准>>> 在遍历数组的时候swift的forin给我们提供了更方便的模式,直接上代码解释 let btnName ["aaa", "bbb", "ccc", "ddd", "eee"] for (index,name) in btnName.enumerat…...

老闵行房价/搜索引擎营销优化诊断训练

最近因需要&#xff0c;翻出几年前的Leapmotion感测器&#xff0c;准备用Unity3D做个互动APP&#xff0c;于是连上官网下载SDK。等下载下来一安装调试&#xff0c;瞬间傻眼&#xff0c;居然要求VR设备。我们Lab倒是不缺VR&#xff0c;有几套VIVE&#xff0c;不过不能保证甲方也…...