当前位置：首页 > news >正文

C2W4.LAB.Word_Embedding.Part1

news 2025/7/11 8:11:13

理论课：C2W4.Word Embeddings with Neural Networks

文章目录

Word Embeddings First Steps: Data Preparation
- Cleaning and tokenization
- Sliding window of words
- Transforming words into vectors for the training set
- - Mapping words to indices and indices to words
  - Getting one-hot word vectors
  - Getting context word vectors
- Building the training set
Intro to CBOW model
- The continuous bag-of-words model
- Activation functions
- - ReLU
  - Softmax
- Dimensions: 1-D arrays vs 2-D column vectors

理论课： C2W4.Word Embeddings with Neural Networks

Word Embeddings First Steps: Data Preparation

先导入包

import re
import nltk#nltk.download('punkt')import emoji
#pip install emoji -i https://pypi.tuna.tsinghua.edu.cn/simple
import numpy as np
from nltk.tokenize import word_tokenize
from utils2 import get_dict

在数据准备阶段，从文本语料库开始，完成：

清理和标记语料库。
提取上下文词和中心词对，这些词对将构成 CBOW 模型的训练数据集。上下文单词是输入模型的特征，中心词是模型学习预测的目标值。
创建上下文单词（特征）和中心词（目标）的简单向量表示，供 CBOW 模型的神经网络使用。

Cleaning and tokenization

为了演示清理和标记化过程，先创建一个包含表情符号和各种标点符号的语料库。

# Define a corpus
corpus = 'Who ❤️ "word embeddings" in 2020? I do!!!'

先用句号替换所有中断的标点符号（如逗号和感叹号）。

# Print original corpus
print(f'Corpus:  {corpus}')# Do the substitution
data = re.sub(r'[,!?;-]+', '.', corpus)# Print cleaned corpus
print(f'After cleaning punctuation:  {data}')

结果：

Corpus:  Who ❤️ "word embeddings" in 2020? I do!!!
After cleaning punctuation:  Who ❤️ "word embeddings" in 2020. I do.

接下来，使用 NLTK 将语料分割成单个标记。

# Print cleaned corpus
print(f'Initial string:  {data}')# Tokenize the cleaned corpus
data = nltk.word_tokenize(data)# Print the tokenized version of the corpus
print(f'After tokenization:  {data}')

结果

Initial string:  Who ❤️ "word embeddings" in 2020. I do.
After tokenization:  ['Who', '❤️', '``', 'word', 'embeddings', "''", 'in', '2020', '.', 'I', 'do', '.']

最后去掉数字和句号以外的标点符号，并将所有剩余标记转换为小写。

#emoji.get_emoji_regexp().search(ch)这个函数已经被移除，自己定义了一个
def get_emoji_regexp():# Sort emoji by length to make sure multi-character emojis are# matched firstemojis = sorted(emoji.EMOJI_DATA, key=len, reverse=True)pattern = u'(' + u'|'.join(re.escape(u) for u in emojis) + u')'return re.compile(pattern)

# Print the tokenized version of the corpus
print(f'Initial list of tokens:  {data}')# Filter tokenized corpus using list comprehension
data = [ ch.lower() for ch in dataif ch.isalpha()or ch == '.'or get_emoji_regexp().search(ch)#or emoji.get_emoji_regexp().search(ch)这个函数已经被移除，上面自己定义了一个]# Print the tokenized and filtered version of the corpus
print(f'After cleaning:  {data}')

结果：
在这里插入图片描述
注意，表情符号和其他普通单词一样被视为标记。
将以上步骤封装在一个函数中，从而简化清理和标记化过程。

# Define the 'tokenize' function that will include the steps previously seen
def tokenize(corpus):data = re.sub(r'[,!?;-]+', '.', corpus)data = nltk.word_tokenize(data)  # tokenize string to wordsdata = [ ch.lower() for ch in dataif ch.isalpha()or ch == '.'or get_emoji_regexp().search(ch)]return data

测试一下结果：

# Define new corpus
corpus = 'I am happy because I am learning'# Print new corpus
print(f'Corpus:  {corpus}')# Save tokenized version of corpus into 'words' variable
words = tokenize(corpus)# Print the tokenized version of the corpus
print(f'Words (tokens):  {words}')

结果：
Corpus: I am happy because I am learning
Words (tokens): [‘i’, ‘am’, ‘happy’, ‘because’, ‘i’, ‘am’, ‘learning’]

# Run this with any sentence
tokenize("Now it's your turn: try with your own sentence!")

结果：
[‘now’, ‘it’, ‘your’, ‘turn’, ‘try’, ‘with’, ‘your’, ‘own’, ‘sentence’, ‘.’]

Sliding window of words

上面函数将语料库转换成了一个干净的标记列表，接下来要在这个列表上滑动一个单词窗口。以便于为每个窗口提取中心词和上下文词。

# Define the 'get_windows' function
def get_windows(words, C):i = Cwhile i < len(words) - C:center_word = words[i]context_words = words[(i - C):i] + words[(i+1):(i+C+1)]yield context_words, center_wordi += 1

该函数的第一个参数是单词（或标记）列表。第二个参数 C 是上下文的一半大小。对于给定的中心词，上下文词是由中心词左边的 C 词和右边的 C 词组成的。
下面是使用该函数从标记列表中提取上下文词和中心词的方法。这些上下文词和中心词将构成训练集，用于训练 CBOW 模型。

# Print 'context_words' and 'center_word' for the new corpus with a 'context half-size' of 2
for x, y in get_windows(['i', 'am', 'happy', 'because', 'i', 'am', 'learning'], 2):print(f'{x}\t{y}')

结果：
[‘i’, ‘am’, ‘because’, ‘i’] happy
[‘am’, ‘happy’, ‘i’, ‘am’] because
[‘happy’, ‘because’, ‘am’, ‘learning’] i
对于第一个样本，上下文单词为：“i”, “am”, “because”, “i”
要预测的中心词是：“happy”
再试滑动窗口为1的例子：

# Print 'context_words' and 'center_word' for any sentence with a 'context half-size' of 1
for x, y in get_windows(tokenize("Now it's your turn: try with your own sentence!"), 1):print(f'{x}\t{y}')

结果：
[‘now’, ‘your’] it
[‘it’, ‘turn’] your
[‘your’, ‘try’] turn
[‘turn’, ‘with’] try
[‘try’, ‘your’] with
[‘with’, ‘own’] your
[‘your’, ‘sentence’] own
[‘own’, ‘.’] sentence

Transforming words into vectors for the training set

Mapping words to indices and indices to words

上下文、中心词都用独热编码表示。要创建独热编码，可以先将每个唯一的单字映射到一个唯一的整数（或索引）。这里提供了一个辅助函数 get_dict，它可以创建一个将单词映射到整数并返回的 Python 词典。

# Get 'word2Ind' and 'Ind2word' dictionaries for the tokenized corpus
word2Ind, Ind2word = get_dict(words)
print(word2Ind)

结果：
{‘am’: 0, ‘because’: 1, ‘happy’: 2, ‘i’: 3, ‘learning’: 4}
可以用这个词典获得某个单词对应的索引：

# Print value for the key 'i' within word2Ind dictionary
print("Index of the word 'i':  ",word2Ind['i'])

结果：
Index of the word ‘i’: 3
Ind2word字典打印结果：
{0: ‘am’, 1: ‘because’, 2: ‘happy’, 3: ‘i’, 4: ‘learning’}
根据索引查找单词：

# Print value for the key '2' within Ind2word dictionary
print("Word which has index 2:  ",Ind2word[2] )

结果：
Word which has index 2: happy

最后，保存词库的大小。

# Save length of word2Ind dictionary into the 'V' variable
V = len(word2Ind)# Print length of word2Ind dictionary
print("Size of vocabulary: ", V)

结果：
Size of vocabulary: 5

Getting one-hot word vectors

对于某个整数 $n$ ，很容易转化为独热编码，这里的整数 $n$ 对应的就是单词的索引，例如单词happy的索引是2.。

# Save index of word 'happy' into the 'n' variable
n = word2Ind['happy']# Print index of word 'happy'
n

做独热编码很简单，先要创建词表大小的数组，且初始值为0：

# Create vector with the same length as the vocabulary, filled with zeros
center_word_vector = np.zeros(V)# Print vector
center_word_vector

结果：
array([0., 0., 0., 0., 0.])
可以查看数组（或者说向量）大小与词表大小是否一致：

# Assert that the length of the vector is the same as the size of the vocabulary
len(center_word_vector) == V

结果：True
然后将数组中对应索引位置设置为1：

# Replace element number 'n' with a 1
center_word_vector[n] = 1
# Print vector
center_word_vector

结果：array([0., 0., 1., 0., 0.])
将以上步骤整合到一个方便的函数中，该函数的参数包括：要编码的单词、将单词映射到索引的字典，以及词汇量的大小。

# Define the 'word_to_one_hot_vector' function that will include the steps previously seen
def word_to_one_hot_vector(word, word2Ind, V):one_hot_vector = np.zeros(V)one_hot_vector[word2Ind[word]] = 1return one_hot_vector

测试该函数：

# Print output of 'word_to_one_hot_vector' function for word 'happy'
word_to_one_hot_vector('happy', word2Ind, V)

结果：array([0., 0., 1., 0., 0.])

Getting context word vectors

要创建代表上下文单词的向量，需要计算代表单个单词的独热向量的平均值。
先从上下文单词列表开始。

# Define list containing context words
context_words = ['i', 'am', 'because', 'i']

使用上面的函数完成每个单词独热编码的构建

# Create one-hot vectors for each context word using list comprehension
context_words_vectors = [word_to_one_hot_vector(w, word2Ind, V) for w in context_words]# Print one-hot vectors for each context word
context_words_vectors

结果：
[array([0., 0., 0., 1., 0.]),
array([1., 0., 0., 0., 0.]),
array([0., 1., 0., 0., 0.]),
array([0., 0., 0., 1., 0.])]
使用mean函数求和上下文单词的向量平均值：

# Compute mean of the vectors using numpy
np.mean(context_words_vectors, axis=0)

结果：array([0.25, 0.25, 0. , 0.5 , 0. ])
注意：这里的axis=0是对行求平均。
现在创建 context_words_to_vector 函数完成上面的操作，该函数接收上下文单词列表、单词索引词典和词汇量大小，并输出上下文单词的向量表示。

# Define the 'context_words_to_vector' function that will include the steps previously seen
def context_words_to_vector(context_words, word2Ind, V):context_words_vectors = [word_to_one_hot_vector(w, word2Ind, V) for w in context_words]context_words_vectors = np.mean(context_words_vectors, axis=0)return context_words_vectors

测试：

# Print output of 'context_words_to_vector' function for context words: 'i', 'am', 'because', 'i'
context_words_to_vector(['i', 'am', 'because', 'i'], word2Ind, V)

结果：array([0.25, 0.25, 0. , 0.5 , 0. ])
再测试另外一组上下文：

# Print output of 'context_words_to_vector' function for context words: 'am', 'happy', 'i', 'am'
context_words_to_vector(['am', 'happy', 'i', 'am'], word2Ind, V)

结果：array([0.5 , 0. , 0.25, 0.25, 0. ])

Building the training set

CBOW 模型创建训练集，先从语料库的分词处理开始。

# Print corpus
words

结果：
[‘i’, ‘am’, ‘happy’, ‘because’, ‘i’, ‘am’, ‘learning’]
使用滑动窗口函数 (get_windows)来提取上下文单词和中心词，然后使用word_to_one_hot_vector和context_words_to_vector将这些单词集转换为基本的向量表示

# Print vectors associated to center and context words for corpus
for context_words, center_word in get_windows(words, 2):  # reminder: 2 is the context half-sizeprint(f'Context words:  {context_words} -> {context_words_to_vector(context_words, word2Ind, V)}')print(f'Center word:  {center_word} -> {word_to_one_hot_vector(center_word, word2Ind, V)}')print()

结果：
Context words: [‘i’, ‘am’, ‘because’, ‘i’] -> [0.25 0.25 0. 0.5 0. ]
Center word: happy -> [0. 0. 1. 0. 0.]

Context words: [‘am’, ‘happy’, ‘i’, ‘am’] -> [0.5 0. 0.25 0.25 0. ]
Center word: because -> [0. 1. 0. 0. 0.]

Context words: [‘happy’, ‘because’, ‘am’, ‘learning’] -> [0.25 0.25 0.25 0. 0.25]
Center word: i -> [0. 0. 0. 1. 0.]

这里使用单个示例进行单次迭代训练，但在本周的作业中，将使用多次迭代和批量示例来训练 CBOW 模型。下面是如何使用 Python 生成器函数：

# Define the generator function 'get_training_example'
def get_training_example(words, C, word2Ind, V):for context_words, center_word in get_windows(words, C):yield context_words_to_vector(context_words, word2Ind, V), word_to_one_hot_vector(center_word, word2Ind, V)

该函数的输出可以通过迭代得到连续的上下文词向量和中心词向量：

# Print vectors associated to center and context words for corpus using the generator function
for context_words_vector, center_word_vector in get_training_example(words, 2, word2Ind, V):print(f'Context words vector:  {context_words_vector}')print(f'Center word vector:  {center_word_vector}')print()

结果：
Context words vector: [0.25 0.25 0. 0.5 0. ]
Center word vector: [0. 0. 1. 0. 0.]

Context words vector: [0.5 0. 0.25 0.25 0. ]
Center word vector: [0. 1. 0. 0. 0.]

Context words vector: [0.25 0.25 0.25 0. 0.25]
Center word vector: [0. 0. 0. 1. 0.]
训练数据准备完成。

Intro to CBOW model

本节介绍词袋模型，以及它的激活函数。

The continuous bag-of-words model

词袋模型结构如下图所示：
在这里插入图片描述
输入层（Input layer）：包含中心词（Center word）和上下文词（Context words）。在CBOW模型中，上下文词用于预测中心词。

隐藏层（Hidden layer）：这一层接收输入层的词向量，并通过权重矩阵（weights）和偏置（biases）进行变换。权重矩阵 $W$ 和偏置向量 $b$ 用于将输入层的词向量转换为隐藏层的表示。

激活函数（ReLU）：隐藏层的输出通常会通过一个非线性激活函数，如ReLU（Rectified Linear Unit），以增加模型的非线性能力。

输出层（Output layer）：隐藏层的输出被传递到输出层，这里通常会应用一个softmax函数，将输出转换为概率分布。这个概率分布表示的是词汇表中每个词成为中心词的概率。

权重和偏置（weights and biases）：图中的 $W_1$ 和 $W_2$ 表示权重矩阵， $b_1$ 和 $b_2$ 表示偏置。在训练过程中，这些参数会被优化以最小化预测误差。

输出（Output）：图中的 $\hat{y}$ 表示模型的输出，它是通过将输入层的词向量与权重矩阵相乘，加上偏置，然后通过激活函数和softmax函数得到的。

训练示例（Training example）：图中的 “I am happy because I am learning” 展示了一个训练示例，其中 “happy” 是中心词，而 “I”, “am”, “because”, “I”, “learning” 是上下文词。

损失函数（Loss function）：虽然图中没有直接显示，但在训练过程中，模型会使用交叉熵损失函数来计算预测概率分布和真实分布之间的差异，并据此更新权重和偏置。

优化（Optimization）：训练过程中，通过梯度下降或其他优化算法来更新权重和偏置，以最小化损失函数。

Activation functions

ReLU

ReLU函数的数学形式为：
$\begin{align} \mathbf{z_1} &= \mathbf{W_1}\mathbf{x} + \mathbf{b_1} \tag{1} \\ \mathbf{h} &= \mathrm{ReLU}(\mathbf{z_1}) \tag{2} \\ \end{align}$
下面使用随机种子来计算一下看看结果：

import numpy as np
# Define a random seed so all random outcomes can be reproduced
np.random.seed(10)# Define a 5X1 column vector using numpy
z_1 = 10*np.random.rand(5, 1)-5# Print the vector
z_1

结果：

array([[ 2.71320643],[-4.79248051],[ 1.33648235],[ 2.48803883],[-0.01492988]])

注意，使用 numpy 的 random.rand 函数会返回一个数组，数组中的值取自 [0, 1] 上的均匀分布。Numpy 允许矢量化，因此每个值都会乘以 10，然后再减去 5。
ReLU函数对于负值是取0。先复制一份z1，作为mask

# Create copy of vector and save it in the 'h' variable
h = z_1.copy()
# Determine which values met the criteria (this is possible because of vectorization)
h < 0

结果：
array([[False],
[ True],
[False],
[False],
[ True]])
然后将负值设置为0，其他的保持不变：

# Slice the array or vector. This is the same as applying ReLU to it
h[h < 0] = 0
# Print the vector after ReLU
h

结果：
array([[2.71320643],
[0. ],
[1.33648235],
[2.48803883],
[0. ]])
现在把以上内容做成函数：

# Define the 'relu' function that will include the steps previously seen
def relu(z):result = z.copy()result[result < 0] = 0return result

测试：

# Define a new vector and save it in the 'z' variable
z = np.array([[-1.25459881], [ 4.50714306], [ 2.31993942], [ 0.98658484], [-3.4398136 ]])# Apply ReLU to it
relu(z)

结果：
array([[0. ],
[4.50714306],
[2.31993942],
[0.98658484],
[0. ]])

Softmax

第二个激活函数是 softmax。该函数用于使用以下公式计算神经网络输出层的值：
$\begin{align} \mathbf{z_2} &= \mathbf{W_2}\mathbf{h} + \mathbf{b_2} \tag{3} \\ \mathbf{\hat y} &= \mathrm{softmax}(\mathbf{z_2}) \tag{4} \\ \end{align}$
要计算一个向量 $\mathbf{z}$ 的 softmax，所得向量的 $i$ -th 分量由以下公式给出：
$\textrm{softmax}(\textbf{z})_i = \frac{e^{z_i} }{\sum\limits_{j=1}^{V} e^{z_j} } \tag{5}$
下面是计算实例：

# Define a new vector and save it in the 'z' variable
z = np.array([9, 8, 11, 10, 8.5])# Print the vector
z

结果：
array([ 9. , 8. , 11. , 10. , 8.5])
计算每个元素的分子和分母的指数。

# Save exponentials of the values in a new vector
e_z = np.exp(z)# Print the vector with the exponential values
e_z

结果：
array([ 8103.08392758, 2980.95798704, 59874.1417152 , 22026.46579481,
4914.7688403 ])
分母等于这些指数之和。

# Save the sum of the exponentials
sum_e_z = np.sum(e_z)# Print sum of exponentials
sum_e_z

结果：
97899.41826492078
最后计算第一个元素的 $\textrm{softmax}(\textbf{z})$ ：

# Print softmax value of the first element in the original vector
e_z[0]/sum_e_z

结果：
0.08276947985173956
要计算所有元素，可使用numpy的向量化操作：

# Define the 'softmax' function that will include the steps previously seen
def softmax(z):e_z = np.exp(z)sum_e_z = np.sum(e_z)return e_z / sum_e_z

测试函数：

# Print softmax values for original vector
softmax([9, 8, 11, 10, 8.5])

array([0.08276948, 0.03044919, 0.61158833, 0.22499077, 0.05020223])

softmax的结果累加和为1。

Dimensions: 1-D arrays vs 2-D column vectors

在处理前向传播、反向传播和梯度下降之前，先熟悉一下向量的维度。
先创建一个长度为 $V$ 的向量，并用0填充

# Define V. Remember this was the size of the vocabulary in the previous lecture notebook
V = 5# Define vector of length V filled with zeros
x_array = np.zeros(V)# Print vector
x_array

结果：array([0., 0., 0., 0., 0.])
从数组的 .shape 属性可以看出，这是一个一维数组（向量）。

# Print vector's shape
x_array.shape

结果：(5,)
要在接下来的步骤中执行矩阵乘法，实际上需要将列向量表示为一列一列的矩阵。在 numpy 中，该矩阵表示为二维数组。将一维向量转换为二维列矩阵的最简单方法是将其 .shape 属性设置为行和列数，如：

# Copy vector
x_column_vector = x_array.copy()# Reshape copy of vector
x_column_vector.shape = (V, 1)  # alternatively ... = (x_array.shape[0], 1)# Print vector
x_column_vector

array([[0.],
[0.],
[0.],
[0.],
[0.]])
该数组大小为：

# Print vector's shape
x_column_vector.shape

结果：(5, 1)