当前位置：首页 > news >正文

大模型下的视频理解video understanding

news 2026/2/8 11:02:13

数据集

Learning Video Context as Interleaved Multimodal Sequences

Motivation：
针对Narrative videos, like movie clips, TV series, etc.：因为比较复杂
most top-performing video perception models 都是研究那种原子动作or人or物
understanding video contexts 有很多任务，解决这些任务的模型都太 specific了，不够general
++++=====>
can we develop a general solution that handles these diverse contexts and needs in videos?

Our work
虽然有类似的模型但是when applied to narrative videos, which encompass informative contexts , these models with a pre-defined visual-textual template still exhibit limitations due to inflexibility。基于此做了如下贡献：

提了一个新的多模态模型来解决这类视频，由于有复杂的结构，核心是要将embed the videos as
interleaved multi-modal sequences
想要统一多模态context和任务以一种用户友好的方式
收集了指令微调数据集（用了一系列方法a package of solutions来转换现有的数据集）而且是interleaved multimodal instruction-following。用这个数据集训练了一个deconder-only的模型
除此之外，这个模型的应用是，可以让用户以一种更free-form的形式与视频交互

Model
模型总体来说不难，frame也只是一个token，作者希望通过这样方式更好的编码交错多模态信息来帮助回答问题
model
DATA
建立了几个模板主要关注how to collect the corresponding tuning data for each type of interleaved prompt
实验
实验部分的话，任务很多,都是video 理解中最火的任务，基本都是sota了。一开始提了几个有意义的问题，并进行了深入思考。除此之外容易混淆的setting用了一些小标志代替，显得更清楚。

multi-task learning enhances individual capabilities.
This highlights the language model’s ability to acquire commonsense across
diverse objectives and contexts.
different kinds of interleaved multimodal instruction.

大模型下的视频理解video understanding

数据集

Learning Video Context as Interleaved Multimodal Sequences

相关文章：

大模型下的视频理解video understanding

【网络安全】CR/LF注入+Race Condition绕过MFA

深度学习入门——卷积神经网络

快团团供货大大团长帮卖团长如何线上结算和支付货款？

vite vue3 Webstorm multiple export width the same name “default“

Transformer预测模型及其Python和MATLAB实现

草的渲染理论

Redis：十大数据类型

bugku-web-source

一键生成视频并批量上传视频抖音、bilibili、腾讯（已打包）

Python WSGI服务器库之gunicorn使用详解

Java编程达人：每日一练，提升自我

（35）远程识别(又称无人机识别)（二）

提供三方API接口、调用第三方接口API接口、模拟API接口（一）通过signature签名验证，避免参数恶意修改

CDO学习

奥运会Ⅱ---谁会先抢走你的工作？

用Python打造精彩动画与视频，4.3 创建动态文本和字幕

spring boot + vue3 接入钉钉实现扫码登录

二叉树构建（从3种遍历中构建）python刷题记录

计算机网络中协议与报文的关系

UE5 学习系列（二）用户操作界面及介绍

JavaSec-RCE

边缘计算医疗风险自查APP开发方案

CMake基础：构建流程详解

LeetCode - 394. 字符串解码

在四层代理中还原真实客户端ngx_stream_realip_module

cf2117E

生成 Git SSH 证书

视频字幕质量评估的大规模细粒度基准

vue3 定时器-定义全局方法 vue+ts