当前位置：首页 > news >正文

09.C2W4.Word Embeddings with Neural Networks

news 2025/10/23 10:24:31

往期文章请点这里

Overview
Basic Word Representations
- Integers
- One-hot vectors
Word Embeddings
- Meaning as vectors
- Word embedding vectors
Word embedding process
Word Embedding Methods
- Basic word embedding methods
- Advanced word embedding methods
Continuous Bag-of-Words Model
- Center word prediction: rationale
- Creating a training example
- From corpus to training
Cleaning and Tokenization
- Cleaning and tokenization matters
- Example in Python
- - corpus
  - libraries
  - code
Sliding Window of Words in Python
Transforming Words into Vectors
- Transforming center words into vectors
- Transforming context words into vectors
- Final prepared training set
Architecture of the CBOW Model
Dimensions
- single input
- batch input
Activation Functions
- Rectified Linear Unit (ReLU)
- Softmax
- Softmax: example
Training a CBOW Model: Cost Function
- Loss
- Cross-entropy loss
Training a CBOW Model: Forward Propagation
- Forward propagation
- Cost
Training a CBOW Model: Backpropagation and Gradient Descent
- Backpropagation
- Gradient descent
Extracting Word Embedding Vectors
- option 1
- option 2
- option 3
Evaluating Word Embeddings
- Intrinsic evaluation
- Extrinsic Evaluation

往期文章请点这里

Overview

了解word embeddings一些基础应用
在这里插入图片描述
高级应用：

学习目标（需要掌握NN）：
●Identify the key concepts of word representations
●Generate word embeddings
●Prepare text for machine learning
●Implement the continuous bag-of-words model

Basic Word Representations

Integers

直接使用唯一的Integers对单词进行编码，优点是简单：
在这里插入图片描述
缺点是无法表达单词的语义信息：

One-hot vectors

使用0-1词向量来表示单词，向量长度与词表长度相同：
在这里插入图片描述
每一个单词可以使用其对应列为1，其他列为0的方式来表示：

Integers和独热编码可以相互转化

独热编码的优点是简单，没有暗含单词的排序信息；
但仍然没有语义信息：

且当词表较大时，向量长度很长：
在这里插入图片描述

Word Embeddings

Meaning as vectors

向量是否能包含语义？当然可以，这里用低维向量来进行演示：
在这里插入图片描述
上图是一个情感分析或情感评分的示例，它表示了一些词汇与它们对应的情感分数。

有8个词汇：spider, boring, kitten, happy, anger, paper, excited, rage。
这些词汇被分为4组，每组两个词，每组词旁边有括号内的情感分数，表示这些词与特定情感的关联强度。
第一组：spider (-2.52), boring (-2.08)，这些分数可能是负数，表明它们与负面情绪相关。
第二组：kitten (-1.53), happy (-0.91)，这些分数接近零或稍微负，可能表示它们与轻微的负面情绪或中性情绪相关。
第三组：anger (0.03), paper (1.09)，分数从接近零到正数，表明它们与正面情绪或中性情绪相关。
第四组：excited (2.31), rage，最后一个词 rage 没有给出分数，但根据上下文，它可能与强烈的负面情绪相关。
图片底部有标尺，从 -2 到 2，分为 negative（负向/消极）、0（中性）和 positive（正向/积极）三个情感区域。
当然还可以加上y轴表示单词的抽象和具体，例如：
在这里插入图片描述
当然，这样表示会丢失一些精确性，例如spider和snake都重合了，这个是不合理的。

Word embedding vectors

可以看到词嵌入向量表示有两个优点：
Low dimension（相对独热编码）
Embed meaning：
在这里插入图片描述
注意：
one-hot vectors，word embedding vectors都属于word vectors（词向量），但后者在很多场合也叫：“word vectors”，word embeddings

Word embedding process

Corpus对于生成词嵌入很重要，例如你要针对特定领域的单词进行词嵌入，则尽量包含该领域的语料，因为单词受到上下文影响很大，例如apple在农业领域是水果，在科技领域就是公司。
Embedding method这里主要是使用ML的模型，采用自监督的方式训练。
整个流程大概如下图所示：
在这里插入图片描述

Word Embedding Methods

Basic word embedding methods

●word2vec (Google, 2013)
○Continuous bag of words (CBOW)
○Continuous skip gram / Skip gram with negative sampling (SGNS)
●Global Vectors (GloVe) (Stanford, 2014)
●fastText (Facebook, 2016)
○Supports out of vocabulary (OOV) words
○训练速度很快

Advanced word embedding methods

Deep learning, contextual embeddings
●BERT (Google, 2018)
●ELMo (Allen Institute for AI, 2018)
●GPT 2 (OpenAI, 2018)
…
这些都是预训练模型，可以对其进行finetune

Continuous Bag-of-Words Model

在这里插入图片描述

Center word prediction: rationale

词向量是CBOW任务的副产物，其主线任务是做预测的，根据上下文预测中间词：
在这里插入图片描述
因为单词与上下文是有关系的，例如上图中，通过足够打的语料库，模型将学会预测缺失的单词与狗相关。

Creating a training example

在这里插入图片描述
中心词（Center word）：在这个示例中，中心词是 “happy”。
上下文词（Context words）：围绕中心词的词，用于提供上下文信息。在这个例子中，上下文词包括 “because”, “learning”, “am”（出现了两次）。
窗口大小（Window size）：指上下文窗口可以包含的总词数。在这个例子中，窗口大小是5。
上下文半尺寸（Context half-size）：指窗口一半的大小，通常用于确定窗口在中心词的左侧和右侧分别可以扩展多远。在这个例子中，上下文半尺寸是2，意味着窗口在中心词的左侧和右侧各扩展2个词的位置。
窗口（Window）：实际上指的是上下文词围绕中心词的布局。根据窗口大小和上下文半尺寸，窗口包括中心词以及它左右两侧的词。

From corpus to training

根据上面的训练实例，我们对I am happy because I am learning，假设窗口大小为5

Context words	Center word
I am because I	happy
am happy I am	because
happy because am learning	I

在这里插入图片描述

Cleaning and Tokenization

Cleaning and tokenization matters

数据清理是预处理阶段的重要步骤，目的是提高文本数据的质量，使其更适合后续的分析和模型训练。
●Letter case
●Punctuation
●Numbers
●Special characters
●Special words
在这里插入图片描述

Letter case（字母大小写）：
清理操作可能包括将所有文本转换为小写或大写，以消除大小写差异带来的影响。
例如，将 “Hello” 和 “hello” 统一转换为 “hello”，以便模型不会将它们视为两个不同的词。

Punctuation（标点符号）：
标点符号的清理可能涉及删除或替换文本中的所有标点符号，因为它们可能对某些NLP任务不重要或会干扰模型的分析。
例如，将句子 “Hello! How are you?” 中的感叹号和问号去除，变为 “Hello How are you”。

Numbers（数字）：
数字清理通常指将文本中的数字替换或删除，因为数字可能对某些文本分析任务没有意义或会引入噪声。
例如，将 “I have 3 apples” 中的 “3” 删除或替换，变为 “I have apples”。

Special characters（特殊字符）：
特殊字符包括非字母数字的符号，如 @, #, $, % 等。清理这些字符可以简化文本数据，避免它们对模型造成干扰。
例如，将 “email@example.com” 中的 “@” 和 “.” 删除，变为 “emailexamplecom”。

Special words（特殊词汇）：
特殊词汇的清理可能包括去除常见的但对分析没有帮助的词，如停用词（stop words，如 “and”, “the” 等）或特定的行业术语。
例如，从 “The quick brown fox jumps over the lazy dog” 中去除 “the” 和 “over” 等停用词。

Example in Python

corpus

在这里插入图片描述

libraries

# pip install nltk
# pip install emoji
import nltk
from nltk.tokenize import word_tokenize
import emoji
nltk.download(' punkt') # download pre trained Punkt tokenizer for English

code

corpus = 'Who ❤️"word embeddings" in 2020? I do!!!'
data = re.sub(r'[,!?;-]+', '.', corpus)

结果：
Who ❤️"word embeddings" in 2020. I do.

data = nltk.word_tokenize(data) # tokenize string to words

结果：
[‘Who’, ‘❤️’, ‘``’, ‘word’, ‘embeddings’, “‘’”, ‘in’, ‘2020’, ‘.’, ‘I’, ‘do’, ‘.’]

data = [ ch.lower() for ch in dataif ch.isalpha() or ch == '.'or emoji.get_emoji_regexp().search(ch)]

结果：
[‘who’, ‘❤️’, ‘word’, ‘embeddings’, ‘in’, ‘.’, ‘i’, ‘do’, ‘.’]

Sliding Window of Words in Python

def get_windows (words, C):i = Cwhile i < len(words)-C:center_word = words[i]context_words = words[(i-C):i] + words[(i+ 1 ):(i+C+1)]yield context_words, center_wordi += 1

在这里插入图片描述
可以看到i初始化是从i = C=2开始的，也是第一个中心词happy对应的索引，i结束于倒数第三个词len(words)-C，每次i往前移动一个单词
最后使用yield 完成多次返回值传递

for x, y in get_windows([' i', ' am', ' happy', ' because', ' i', ' am', 'learning'],2
):
print(f'{x}\t{y}')

结果：
在这里插入图片描述

Transforming Words into Vectors

有了上下文和中心词，接下来就是将它们转化为向量。

Transforming center words into vectors

语料库：I am happy because I am learning
词库：am, because, happy, I, learning
使用独热编码表示每个中心词：
在这里插入图片描述

Transforming context words into vectors

使用上下文的独热编码平均值来表示，对于中心词为happy的时候：
在这里插入图片描述

Final prepared training set

Context words	Context words vector	Center word	Center word vector
I am because I	[0.25; 0.25; 0; 0.5; 0]	happy	[0; 0; 1; 0; 0]

Architecture of the CBOW Model

在这里插入图片描述
CBOW 是一个典型的前馈神经网络结构，其中包括输入层、一个或多个隐藏层，以及一个输出层。每一层都包含权重和偏置，以及激活函数来处理数据和进行非线性变换。
Input layer（输入层）：这一层接收输入数据，在这个例子中是文本序列 “I am happy because I am learning”。输入数据通常会被转换为数值向量，如词嵌入（Word Embeddings）。

Context words and Center word（上下文词和中心词）：在某些模型中，如卷积神经网络（CNN）或循环神经网络（RNN），上下文词可以提供周围词的语境信息，而中心词是当前正在处理的词。

W1, W2, …（权重）：这些表示网络中的权重参数，每个权重连接输入层和隐藏层的神经元。

b, b2, …（偏置）：偏置参数，用于调整神经元的激活函数的输出。

Hidden layer（隐藏层）：输入层之后是隐藏层，隐藏层中的神经元会对输入数据进行处理，提取特征。

Output layer（输出层）：隐藏层之后是输出层，输出层的神经元数量通常取决于任务的类别数，用于生成最终的预测结果。

Vector（向量）：表示输入文本被转换为固定大小的数值向量，以便神经网络可以处理。

ReLU（Rectified Linear Unit）：一种常用的激活函数，用于增加非线性，帮助模型学习更复杂的特征。

softmax：一种在输出层使用的激活函数，用于多分类任务中将输出转换为概率分布。

V = 5：表示词表大小，这里使用独热编码，也是输入向量的维度大小。

X：可能表示输入数据的特征矩阵或特征向量。

当然还有别的超参数可以配置，例如：N: Word embedding size等等…

Dimensions

single input

在这里插入图片描述
如果输入不是列向量，而是行向量，则需要使用转置矩阵和矩阵乘法中的倒置项进行计算。

batch input

上面以单个样本作为输入为例，演示了CBOW的各个部分的维度，在实际操作过程中，为了加快运行速度，我们通常一次传入一个batch（批次）的数据，batch_size是一个超参数，下图给出了batch_size=m的例子：
在这里插入图片描述
我们将m个样本的列向量合在一起，变成输入矩阵
这里的偏置项写成了大写的B，之前的b是1×N大小的，这里在和矩阵做加法的时候，Python会自动做broadcasting，将其大小扩展到m×N大小：

这里注意输入和输出矩阵中向量于预测结果的对应关系（绿色部分）：
在这里插入图片描述

Activation Functions

Rectified Linear Unit (ReLU)

这个没有什么好说的，还有很多变体，例如：leakyReLU
输入层经过W和b后，再进入ReLU
$z_1 = W_1 x + b_1\\ h= ReLU(z_1)$
在这里插入图片描述
ReLU公式为：
$ReLU(x)=\max(0,x)$
图像为：

下面是一组 $z_1$ 对应的h值：

Softmax

Sofmax是吃隐藏层输出的线性变换：
$W_2 h + b_2\\ \hat y=softmax(z)$
一组实数经过Sofmax后会得到一组0-1之间的数字（可以说是概率），这一组数字和为1
在这里插入图片描述
对于CBOW模型，得到的是每个单词对应的出现概率：

$\hat y_i$ 的公式如下，其原理就相当于把每个 $\hat y_i$ 进行标准化，使其概率和为1。
$\hat y_i=\cfrac{e^{z_i}}{\sum_{j=1}^Ve^{z_j}}$

Softmax: example

最后预测结果是happy因为其对应的概率值最大。
在这里插入图片描述

Training a CBOW Model: Cost Function

Loss

"Loss"通常指的是在机器学习中，模型预测值与实际值之间的差异或误差。在训练机器学习模型的过程中，目标是最小化这个损失函数（Loss function），这样可以使模型的预测更加接近真实值。

具体来说，损失函数是一个衡量模型性能的指标，它计算了模型预测值与真实值之间的差距。不同的机器学习任务会使用不同类型的损失函数。例如：
对于分类问题，常用的损失函数是交叉熵损失（Cross-Entropy Loss）。
对于回归问题，常用的损失函数是均方误差（Mean Squared Error, MSE）。
在这里插入图片描述

Cross-entropy loss

CBOW 使用的损失函数形式为：
$J=-\sum_{k=1}^Vy_k\log \hat y_k$
真实值和预测值形式为：
在这里插入图片描述
对于语料：
I am happy because I am learning
前五个单词中心词是happy，假设其预测值和真实值如下：

按照公式取对数后与真实值进行点乘，然后再求和：

可以看到当预测值与真实值相近的时候，损失值较小。
下面看预测值为am是中心词的情况：
在这里插入图片描述
上面的损失函数计算可以进一步简化为：
$J=-\log \hat y_{actual\space word}$
例如：

J=-log 0.01=4.61，注意这里写的是log其实是ln
根据简化后的公式可以画出其函数图像：

正确中心词对应的预测概率越大，Loss值越小，反正Loss越大。

Training a CBOW Model: Forward Propagation

整个训练过程包含：
●Forward propagation
●Cost
●Backpropagation and gradient descent

Forward propagation

其实在CBOW构架中就cover了前向传播，尝试用自己的话描述下图（注意，这里使用的是batch模式）：
在这里插入图片描述
你能写出下面公式么？

Cost

“cost”（成本）和"loss"（损失）这两个术语经常被用来描述衡量模型预测与实际值之间差异的函数。尽管在日常使用中它们可能可以互换，但它们在严格意义上有一些区别。损失函数通常用于单个样本，而成本函数则用于整个数据集。在实践中，当我们说“最小化损失”时，我们通常指的是最小化成本函数，因为这是我们在训练模型时优化的总体目标。
这一节中的Cost是指一个Batch的Loss的平均，假设一个batch有m个样本，则：
$J_{batch}=-\cfrac{1}{m}\sum_{i=1}^m\sum_{j=1}^Vy_j^{(i)}\log \hat y_j^{(i)}$
同样的可以简化为：
$J_{batch}=-\cfrac{1}{m}\sum_{i=1}^mJ^{(i)}$
在这里插入图片描述

Training a CBOW Model: Backpropagation and Gradient Descent

训练模型的目的是最小化cost，按batch的cost 函数有四个变量：
$J_{batch}=f(W_1,W_2,b_1,b_2)$
我们可以使用Backpropagation: calculate partial derivatives of cost with respect to weights and biases
使用Gradient descent: update weights and biases

Backpropagation

$\cfrac{\partial J_{batch}}{\partial W_1}=\cfrac{1}{m}ReLU\left(W_2^\intercal (\hat Y-Y)\right)X^\intercal$
$\cfrac{\partial J_{batch}}{\partial W_2}=\cfrac{1}{m}(\hat Y-Y)H^\intercal$
$\cfrac{\partial J_{batch}}{\partial b_1}=\cfrac{1}{m}ReLU\left(W_2^\intercal (\hat Y-Y)\right)1_m^\intercal$
$\cfrac{\partial J_{batch}}{\partial b_2}=\cfrac{1}{m}(\hat Y-Y)1_m^\intercal$
这里 $1_m$ 是一个有m个元素且都为1的列向量，其转置后与其他矩阵相乘得到矩阵每行求和：
在这里插入图片描述
实际操作的时候是用numpy的求和函数实现的：

import numpy as np
# code to initialize matrix a omitted
np.sum(a, axis= 1 , keepdims=True )

反向传播就是要根据链式法则求偏导，具体计算推导这里不展开，可以直接使用现有的函数实现计算。

Gradient descent

Hyperparameter: learning rate $\alpha$
$W_1:= W_1-\alpha\cfrac{\partial J_{batch}}{\partial W_1}$
$W_2:= W_2-\alpha\cfrac{\partial J_{batch}}{\partial W_2}$
$b_1:= b_1-\alpha\cfrac{\partial J_{batch}}{\partial b_1}$
$b_2:= b_2-\alpha\cfrac{\partial J_{batch}}{\partial b_2}$

Extracting Word Embedding Vectors

共有3种方式

option 1

将 $W_1$ 的每一个列作为词表中每一个单词的嵌入列向量， $W_1$ 有V列刚好和词表长度对应，其对应方式与输入X的顺序相对应（看蓝色部分）：
在这里插入图片描述

option 2

将 $W_2$ 的每一个行作为词表中每一个单词的嵌入行向量， $W_2$ 有V行刚好和词表长度对应，其对应方式与输入X的顺序相对应（看蓝色部分）：
在这里插入图片描述

option 3

将上面二者相结合得到V×N的矩阵 $W_3$ ，每一个列作为词表中每一个单词的嵌入列向量：
$W_3=0.5(W_1+W_2^T)$
在这里插入图片描述

Evaluating Word Embeddings

主要有两种：Intrinsic Evaluation（内在评估），Extrinsic Evaluation（外在评估）。内在评估提供了关于模型预测能力的信息，而外在评估则提供了关于模型在实际应用中效果的信息。两者都是重要的，因为一个模型可能在技术上表现良好（内在评估），但如果它不能有效地支持最终的应用目标（外在评估），那么它可能不是一个成功的模型。在实际应用中，通常需要结合这两种评估方法来全面理解模型的性能。

Intrinsic evaluation

Analogies
Clustering
Visualization
Analogies主要是Test relationships between words，有三种常见方式：

Analogies	example
Semantic analogies	“France” is to “Paris” as “Italy” is to <?>
Syntactic analogies	“seen” is to “saw” as “been” is to <?>
Ambiguity	“wolf” is to “pack” as “bee” is to <?> → swarm? colony?
Clustering

在这里插入图片描述
Visualization

Extrinsic Evaluation

使用其他任务来测试词向量的性能：
e.g. named entity recognition, parts of speech tagging
在这里插入图片描述

目录

Overview

Basic Word Representations

Integers

One-hot vectors

Word Embeddings

Meaning as vectors

Word embedding vectors

Word embedding process

Word Embedding Methods

Basic word embedding methods

Advanced word embedding methods

Continuous Bag-of-Words Model

Center word prediction: rationale

Creating a training example

From corpus to training

Cleaning and Tokenization

Cleaning and tokenization matters

Example in Python

corpus

libraries

code

Sliding Window of Words in Python

Transforming Words into Vectors

Transforming center words into vectors

Transforming context words into vectors

Final prepared training set

Architecture of the CBOW Model

Dimensions

single input

batch input

Activation Functions

Rectified Linear Unit (ReLU)

Softmax

Softmax: example

Training a CBOW Model: Cost Function

Loss

Cross-entropy loss

Training a CBOW Model: Forward Propagation

Forward propagation

Cost

Training a CBOW Model: Backpropagation and Gradient Descent

Backpropagation

Gradient descent

Extracting Word Embedding Vectors

option 1

option 2

option 3

Evaluating Word Embeddings

Intrinsic evaluation

Extrinsic Evaluation

相关文章：