当前位置：首页 > news >正文

Lecture 20 Topic Modelling

news 2025/7/11 15:09:00

Topic Modelling

makeingsense of text
- English Wikipedia: 6M articles
- Twitter: 500M tweets per day
- New York Times: 15M articles
- arXiv: 1M articles
- What can we do if we want to learn something about these document collections?
questions
- What are the less popular topics on Wikipedia?
- What are the big trends on Twitter in the past month?
- How do the social issues evolve over time in New York Times from 1900s to 2000s?
- What are some influential research areas?
topic models to the rescue
- Topic models learn common, overlapping themes in a document collection
- Unsupervised model
  - No labels; input is just the documents!
- What’s the output of a topic model?
  - Topics: each topic associated with a list of words
  - Topic assignments: each document associated with a list of topics
what do topics look like
- A list of words
- Collectively describes a concept or subject
- Words of a topic typically appear in the same set of documents in the corpus(words overlapping in documents)
- Wikipedia topics(broad)
- Twitter topics(short,conversational)
- New York Times topics
applications of topic models
- Personalised advertising(e.g. types of products bought)
- Search engine
- Discover senses of polysemous words(e.g. apple: fruit, company, two different clusters)

A Brief History of Topic Models

latent semantic analysis
- LSA: truncate
- issues
  - Positive and negative values in the $U$ and $V^T$
  - Difficult to interpret(negative values)
probabilistic LSA
- based on a probabilistic model to get rid of negative values
- issues
  - No more negative values!
  - PLSA can learn topics and topic assignment for documents in the train corpus
  - But it is unable to infer topic distribution on new documents
  - PLSA needs to be re-trained for new documents
latent dirichlet allocation(LDA)
- Introduces a prior to the document-topic and topicword distribution
- Fully generative: trained LDA model can infer topics on unseen documents!
- LDA is a Bayesian version of PLSA

LDA

LDA
- Core idea: assume each document contains a mix of topics
- But the topic structure is hidden (latent)
- LDA infers the topic structure given the observed words and documents
- LDA produces soft clusters of documents (based on topic overlap), rather than hard clusters
- Given a trained LDA model, it can infer topics on new documents (not part of train data)
input
- A collection of documents
- Bag-of-words
- Good preprocessing practice:
  - Remove stopwords
  - Remove low and high frequency word types
  - Lemmatisation
output
- Topics: distribution over words in each topic
- Topic assignment: distribution over topics in each document
learning
- How do we learn the latent topics?
- Two main family of algorithms:
  - Variational methods
  - Sampling-based methods
- sampling method (Gibbs)
  1. Randomly assign topics to all tokens in documents
  2. Collect topic-word and document-topic co-occurrence statistics based on the assignments
    - first give some psudo-counts in every cell of two matrix(smoothing,no event is 0)
    - collect co-occurrence statistics
  3. Go through every word token in corpus and sample a new topic:
    - delete current topic assigned to a word
    - update two matrices
    - compute the probability distribution to sample: $P(t_i|w,d) \propto P(t_i|w)P(t_i|d)$ ( $P(t_i|w) \to$ topic-word, $P(t_i|d) \to$ document-topic)
      - $P(t_1|w,d)=P(t_1|mouse)\times{P(t_1|d_1)}=\frac{0.01}{0.01+0.01+2.01}\times{\frac{1.1}{1.1+1.1+2.1}}$
    - sample randomly based on the probability distribution
  4. Go to step 2 and repeat until convergence
  - when to stop
    - Train until convergence
    - Convergence = model probability of training set becomes stable
    - How to compute model probability?
      - $logP(w_1,w_2,...,w_m)=log\sum_{j=0}^TP(w_1|t_j)P(t_j|d_{w_1})+...+log\sum_{j=0}^TP(w_m|t_j)P(t_j|d_{w_m})$
      - m = #word tokens
      - $P(w_1|t_j) \to$ based on the topic-word co-occurrence matrix
      - $P(t_j|d_{w_1}) \to$ based on the document-topic co-occurrence matrix
  - infer topics for new documents
    1. Randomly assign topics to all tokens in new/test documents
    2. Update document-topic matrix based on the assignments; but use the trained topic-word matrix (kept fixed)
    3. Go through every word in the test documents and sample topics: $P(t_i|w,d) \propto P(t_i|w)P(t_i|d)$
    4. Go to step 2 and repeat until convergence
  - hyper-parameters
    - $T$ : number of topic
    - $\beta$ : prior on the topic-word distribution
    - $\alpha$ : prior on the document-topic distribution
    - Analogous to k in add-k smoothing in N-gram LM
    - Pseudo counts to initialise co-occurrence matrix:
    - High prior values $\to$ flatter distribution
      - a very very large value would lead to a uniform distribution
    - Low prior values $\to$ peaky distribution
    - $\beta$ : generally small (< 0.01)
      - Large vocabulary, but we want each topic to focus on specific themes
    - $\alpha$ : generally larger (> 0.1)
      - Multiple topics within a document

Evaluation

how to evaluate topic models
- Unsupervised learning $\to$ no labels
- Intrinsic(内在的，固有的) evaluation:
  - model logprob / perplexity(困惑度，复杂度) on test documents
  - $logL=\sum_W\sum_TlogP(w|t)P(t|d_w)$
  - $ppl=exp^{\frac{-logL}{W}}$
issues with perlexity
- More topics = better (lower) perplexity
- Smaller vocabulary = better perplexity
  - Perplexity not comparable for different corpora, or different tokenisation/preprocessing methods
- Does not correlate with human perception of topic quality
- Extrinsic(外在的) evaluation the way to go:
  - Evaluate topic models based on downstream task
topic coherence
- A better intrinsic evaluation method
- Measure how coherent the generated topics (blue more coherent than red)
- A good topic model is one that generates more coherent topics
word intrusion
- Idea: inject one random word to a topic
  - {farmers, farm, food, rice, agriculture} $\to$ {farmers, farm, food, rice, cat, agriculture}
- Ask users to guess which is the intruder word
- Correct guess $\to$ topic is coherent
- Try guess the intruder word in:
  - {choice, count, village, i.e., simply, unionist}
- Manual effort; does not scale
PMI $\approx$ coherence?
- High PMI for a pair of words $\to$ words are correlated
  - PMI(farm, rice) $\uparrow$
  - PMI(choice, village) $\downarrow$
- If all word pairs in a topic has high PMI $\to$ topic is coherent
- If most topics have high PMI $\to$ good topic model
- Where to get word co-occurrence statistics for PMI?
  - Can use same corpus for topic model
  - A better way is to use an external corpus (e.g. Wikipedia)
PMI
- Compute pairwise PMI of top-N words in a topic
  - $PMI(t)=\sum_{j=2}^N\sum_{i=1}^{j-1}log\frac{P(w_i,w_j)}{P(w_i)P(w_j)}$
- Given topic: {farmers, farm, food, rice, agriculture}
- Coherence = sum PMI for all word pairs:
  - PMI(farmers, farm) + PMI(farmers, food) + … + PMI(rice, agriculture)
- variants
  - Normalised PMI
    - $NPMI(t)=\sum_{j=2}^N\sum_{i=1}^{j-1}\frac{log\frac{P(w_i,w_j)}{P(w_i)P(w_j)}}{-logP(w_i,w_j)}$
  - conditional probability (proved not as good as PMI)
    - $LCP(t)=\sum_{j=2}^N\sum_{i=1}^{j-1}log\frac{P(w_i,w_j)}{P(w_i)}$
- example (PMI tends to favor rarer words, use NPMI to relieve this problem)

Conclusion

Topic model: an unsupervised model for learning latent concepts in a document collection
LDA: a popular topic model
- Learning
- Hyper-parameters
How to evaluate topic models?
- Topic coherence

Lecture 20 Topic Modelling

目录 Topic ModellingA Brief History of Topic ModelsLDAEvaluationConclusion Topic Modelling makeingsense of text English Wikipedia: 6M articlesTwitter: 500M tweets per dayNew York Times: 15M articlesarXiv: 1M articlesWhat can we do if we want to learn somet…...

编程日记 2023/6/11 10:19:17

ThreadPoolExecutor线程池

文章目录一、ThreadPool线程池状态二、ThreadPoolExecutor构造方法三、Executors3.1 固定大小线程池3.2 带缓冲线程池3.3 单线程线程池四、ThreadPoolExecutor4.1 execute(Runnable task)方法使用4.2 submit()方法4.3 invokeAll()4.4 invokeAny()4.5 shutdown()4.6 shutdownN…...

编程日记 2023/6/11 10:14:16

chatgpt赋能python：Python实践：如何升级pip

Python实践：如何升级pip Python作为一门高效的脚本语言，被广泛应用于数据分析、人工智能、Web开发等领域。而pip则是Python的包管理工具，是开发Python应用的必备工具。但是pip在使用过程中，有时候会出现版本不兼容或者出现漏洞等…...

编程日记 2023/6/11 10:09:15

【JavaEE进阶】mybatis

目录： 一、Mybatis是什么三个映射关系如下图： 二、mybatis的使用（前置工作简单案例） 第一步：导入MAVEN依赖第二步： 在spring项目当中新建数据源第三步：新建一个实体类，是和…...

编程日记 2023/6/11 10:04:14

Redis的大key

什么是 redis 的大 key redis 的大 key 不是指存储在 redis 中的某个 key 的大小超过一定的阈值，而是该 key 所对应的 value 过大对于 string 类型来说，一般情况下超过 10KB 则认为是大 key；对于set、zset、hash 等类型来说，一般…...

编程日记 2023/6/11 9:59:13

MMPretrain

title: mmpretrain实战 date: 2023-06-07 16:04:01 tags: [image classification,mmlab] mmpretrain实战 [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ccTl9bOl-1686129437336)(null)] 主要讲解了安装,还有使用教程.安装教程直接参考官网.下面讲…...

编程日记 2023/6/11 9:54:12

栈和队列(数据结构刷题)[一]-python

文章目录前言一、原理介绍二、用栈实现队列1.操作2.思路三、关于面试考察栈里面的元素在内存中是连续分布的么？ 前言提到栈和队列，大家可能对它们的了解只停留在表面，再深入一点，好像知道又好像不知道的感觉。本文我将从底层实…...

编程日记 2023/6/11 9:49:11

【备战秋招】JAVA集合

集合前言一方面， 面向对象语言对事物的体现都是以对象的形式，为了方便对多个对象的操作，就要对对象进行存储。另一方面，使用Array存储对象方面具有一些弊端，而Java 集合就像一种容器，可以动态地把多…...

编程日记 2023/6/11 9:44:10

setState详解

this. setState( [partialState], [callback]) 1.[partialState] :支持部分状态更改 this, setState({ x:100 //不论总共有多少状态，我们只修改了x，其余的状态不动 });callback :在状态更改/视图更新完毕后触发执行，也可以说只要执行了setS…...

编程日记 2023/6/11 9:39:09

Qt5.12.6配置Android Arm开发环境(windows)

1. 安装jdk1.8 2.安装Android Studio 并安装 SDK 与NDK SDK Tools 选择 26.0.3 SDK Platform 选择 Android SDK Platform 26 NDK选择19版本安卓ARM环境配置成功如下: JDK1.8 , SDK 26 , NDK 19 在安装QT时要选择 ARMv7(32位CPU)与ARM64-v8a(64位CPU) 选择支持android平台…...

编程日记 2023/6/11 9:34:08

七、进程程序替换

文章目录一、进程程序替换（一）概念（二）为什么程序替换（三）程序替换的原理（四）如何进行程序替换1. execl2. 引入进程创建——子进程执行程序替换，会不会影响父进程呢? &…...

编程日记 2023/6/11 9:29:07

C++核心编程——详解运算符重载

文章目录💬 一.运算符重载基础知识①基本概念②运算符重载的规则③运算符重载形式④运算符重载建议二.常用运算符重载①左移(<<)和右移(>>)运算符重载1️⃣重载后函数参数是什么？2️⃣重载的函数返回类型是什么？3️⃣重载为哪种…...

编程日记 2023/6/11 9:24:06

2023年前端面试汇总-CSS

1. CSS基础 1.1. CSS选择器及其优先级对于选择器的优先级： 1. 标签选择器、伪元素选择器：1； 2. 类选择器、伪类选择器、属性选择器：10； 3. id 选择器：100； 4. 内联样式：1000&a…...

编程日记 2023/6/11 9:19:04

Java调用Pytorch实现以图搜图（附源码）

Java调用Pytorch实现以图搜图设计技术栈： 1、ElasticSearch环境； 2、Python运行环境（如果事先没有pytorch模型时，可以用python脚本创建模型）； 1、运行效果 2、创建模型（有则可以跳过&#xf…...

编程日记 2023/6/11 9:14:03

【EasyX】实时时钟

目录实时时钟1. 绘制静态秒针2. 秒针的转动3. 根据实际时间转动4. 添加时针和分针5. 添加表盘刻度实时时钟本博客介绍利用EasyX实现一个实时钟表的小程序，同时学习时间函数的使用。本文源码可从github获取 1. 绘制静态秒针第一步定义钟表的中心坐标center&a…...

编程日记 2023/6/11 9:09:03

基于XC7Z100的PCIe采集卡（GMSL FMC采集卡）

GMSL 图像采集卡特性 ● PCIe Gen2.0 X8 总线； ● 支持V4L2调用； ● 1路CAN接口； ● 6路/12路 GMSL1/2摄像头输入，最高可达8MP； ● 2路可定义相机同步触发输入/输出； 优势 ● 采用PCIe主卡与FMC子…...

编程日记 2023/6/11 9:04:02

Kibana：使用 Kibana 自带数据进行可视化（一）

在今天的练习中，我们将使用 Kibana 自带的数据来进行一些可视化的展示。希望对刚开始使用 Kibana 的用户有所帮助。前提条件如果你还没有安装好自己的 Elastic Stack，你可以参考如下的视频来开启 Elastic Stack 并进行下面的练习。你可以开通阿里云检…...

编程日记 2023/6/11 8:59:01

MySQL数据库基础 07

第七章单行函数 1. 函数的理解1.1 什么是函数1.2 不同DBMS函数的差异1.3 MySQL的内置函数及分类 2. 数值函数2.1 基本函数2.2 角度与弧度互换函数2.3 三角函数2.4 指数与对数2.5 进制间的转换 3. 字符串函数4. 日期和时间函数4.1 获取日期、时间 4.2 日期与时间戳的转换 4.3 获…...

编程日记 2023/6/11 8:54:00

JVM | JVM垃圾回收

JVM | JVM垃圾回收 1、堆空间的基本结构2、内存分配和回收原则2.1、对象优先在 Eden 区分配2.2、大对象直接进入老年代2.3、长期存活的对象将进入老年代2.4、主要进行 gc 的区域2.5、空间分配担保3、死亡对象判断方法3.1、引用计数法3.2、可达性分析算法3.3、引用类型总结3.4、…...

编程日记 2023/6/11 8:48:59

avive零头撸矿

Avive 是一个透明的、自下而上替代自上而下的多元网络，旨在克服当前生态系统的局限性，实现去中心化社会。 aVive：一个基于 SBT 和市场的 deSoc，它使 dapps 能够与分散的位置 oracle 和 SBT 关系进行互操作。您的主权社交网络元宇宙…...

编程日记 2023/6/11 8:43:57

synchronized 学习

学习源： https://www.bilibili.com/video/BV1aJ411V763?spm_id_from333.788.videopod.episodes&vd_source32e1c41a9370911ab06d12fbc36c4ebc 1.应用场景不超卖，也要考虑性能问题（场景） 2.常见面试问题： sync出…...

编程新知 2025/6/21 18:22:44

【解密LSTM、GRU如何解决传统RNN梯度消失问题】

解密LSTM与GRU：如何让RNN变得更聪明？ 在深度学习的世界里，循环神经网络（RNN）以其卓越的序列数据处理能力广泛应用于自然语言处理、时间序列预测等领域。然而，传统RNN存在的一个严重问题——梯度消失&#…...

编程新知 2025/6/15 13:07:03

《用户共鸣指数（E）驱动品牌大模型种草：如何抢占大模型搜索结果情感高地》

在注意力分散、内容高度同质化的时代，情感连接已成为品牌破圈的关键通道。我们在服务大量品牌客户的过程中发现，消费者对内容的“有感”程度，正日益成为影响品牌传播效率与转化率的核心变量。在生成式AI驱动的内容生成与推荐环境中&#xff0…...

编程新知 2025/7/8 21:35:16

1.3 VSCode安装与环境配置

进入网址Visual Studio Code - Code Editing. Redefined下载.deb文件，然后打开终端，进入下载文件夹，键入命令 sudo dpkg -i code_1.100.3-1748872405_amd64.deb 在终端键入命令code即启动vscode 需要安装插件列表 1.Chinese简化 2.ros …...

编程新知 2025/7/9 15:17:38

屋顶变身“发电站” ，中天合创屋面分布式光伏发电项目顺利并网！

5月28日，中天合创屋面分布式光伏发电项目顺利并网发电，该项目位于内蒙古自治区鄂尔多斯市乌审旗，项目利用中天合创聚乙烯、聚丙烯仓库屋面作为场地建设光伏电站，总装机容量为9.96MWp。项目投运后，每年可节约标煤3670…...

编程新知 2025/7/10 6:26:00

css的定位（position）详解：相对定位绝对定位固定定位

在 CSS 中，元素的定位通过 position 属性控制，共有 5 种定位模式：static（静态定位）、relative（相对定位）、absolute（绝对定位）、fixed（固定定位）和…...

编程新知 2025/7/6 1:55:51

管理学院权限管理系统开发总结

文章目录 🎓 管理学院权限管理系统开发总结 - 现代化Web应用实践之路📝 项目概述🏗️ 技术架构设计后端技术栈前端技术栈 💡 核心功能特性1. 用户管理模块2. 权限管理系统3. 统计报表功能4. 用户体验优化 🗄️ 数据库设…...

编程新知 2025/6/23 5:12:02

【7色560页】职场可视化逻辑图高级数据分析PPT模版

7种色调职场工作汇报PPT，橙蓝、黑红、红蓝、蓝橙灰、浅蓝、浅绿、深蓝七种色调模版【7色560页】职场可视化逻辑图高级数据分析PPT模版：职场可视化逻辑图分析PPT模版https://pan.quark.cn/s/78aeabbd92d1...

编程新知 2025/7/8 15:27:15

jmeter聚合报告中参数详解

sample、average、min、max、90%line、95%line,99%line、Error错误率、吞吐量Thoughput、KB/sec每秒传输的数据量 sample（样本数） 表示测试中发送的请求数量，即测试执行了多少次请求。单位，以个或者次数表示。示例：…...

编程新知 2025/7/8 4:01:50

Spring AI Chat Memory 实战指南：Local 与 JDBC 存储集成

一个面向 Java 开发者的 Sring-Ai 示例工程项目，该项目是一个 Spring AI 快速入门的样例工程项目，旨在通过一些小的案例展示 Spring AI 框架的核心功能和使用方法。项目采用模块化设计，每个模块都专注于特定的功能领域，便于学习和…...

编程新知 2025/6/20 11:58:47

目录

Topic Modelling

A Brief History of Topic Models

LDA

Evaluation

Conclusion

相关文章：