当前位置: 首页 > news >正文

Next-Scale Prediction、InstantStyle、Co-Speech Gesture Generation

本文首发于公众号:机器感知

Next-Scale Prediction、InstantStyle、Co-Speech Gesture Generation

图片

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale  Prediction

图片

We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18.65 to 1.80, inception score (IS) from 80.4 to 356.4, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, wit......

BAdam: A Memory Efficient Full Parameter Training Method for Large  Language Models

图片

This work presents BAdam, an optimizer that leverages the block coordinate optimization framework with Adam as the inner solver. BAdam offers a memory efficient approach to the full parameter finetuning of large language models and reduces running time of the backward process thanks to the chain rule property. Experimentally, we apply BAdam to instruction-tune the Llama 2-7B model on the Alpaca-GPT4 dataset using a single RTX3090-24GB GPU. The results indicate that BAdam exhibits superior convergence behavior in comparison to LoRA and LOMO. Furthermore, our downstream performance evaluation of the instruction-tuned models using the MT-bench shows that BAdam modestly surpasses LoRA and more substantially outperforms LOMO. Finally, we compare BAdam with Adam on a medium-sized task, i.e., finetuning RoBERTa-large on the SuperGLUE benchmark. The results demonstrate that BAdam is capable of narrowing the performance gap with Adam. Our code is available at https://github.com/Ledzy/......

MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image  Generation

图片

Text-to-image generation has achieved astonishing results, yet precise spatial controllability and prompt fidelity remain highly challenging. This limitation is typically addressed through cumbersome prompt engineering, scene layout conditioning, or image editing techniques which often require hand drawn masks. Nonetheless, pre-existing works struggle to take advantage of the natural instance-level compositionality of scenes due to the typically flat nature of rasterized RGB output images. Towards adressing this challenge, we introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer ANnotations of RGB images as multilayer, instance-wise RGBA decompositions, and over 100K instance images. To build MuLAn, we developed a training free pipeline which decomposes a monocular RGB image into a stack of RGBA layers comprising of background and isolated instances. We achieve this through the use of pretrained general-purpose models, and by developing three modules: image decompo......

Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion  Models

图片

This study explores the role of cross-attention during inference in text-conditional diffusion models. We find that cross-attention outputs converge to a fixed point after few inference steps. Accordingly, the time point of convergence naturally divides the entire inference process into two stages: an initial semantics-planning stage, during which, the model relies on cross-attention to plan text-oriented visual semantics, and a subsequent fidelity-improving stage, during which the model tries to generate images from previously planned semantics. Surprisingly, ignoring text conditions in the fidelity-improving stage not only reduces computation complexity, but also maintains model performance. This yields a simple and training-free method called TGATE for efficient generation, which caches the cross-attention output once it converges and keeps it fixed during the remaining inference steps. Our empirical study on the MS-COCO validation set confirms its effectiveness. The sourc......

InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image  Generation

图片

Tuning-free diffusion-based models have demonstrated significant potential in the realm of image personalization and customization. However, despite this notable progress, current models continue to grapple with several complex challenges in producing style-consistent image generation. Firstly, the concept of style is inherently underdetermined, encompassing a multitude of elements such as color, material, atmosphere, design, and structure, among others. Secondly, inversion-based methods are prone to style degradation, often resulting in the loss of fine-grained details. Lastly, adapter-based approaches frequently require meticulous weight tuning for each reference image to achieve a balance between style intensity and text controllability. In this paper, we commence by examining several compelling yet frequently overlooked observations. We then proceed to introduce InstantStyle, a framework designed to address these issues through the implementation of two key strategies: 1)......

Language Models as Compilers: Simulating Pseudocode Execution Improves  Algorithmic Reasoning in Language Models

图片

Algorithmic reasoning refers to the ability to understand the complex patterns behind the problem and decompose them into a sequence of reasoning steps towards the solution. Such nature of algorithmic reasoning makes it a challenge for large language models (LLMs), even though they have demonstrated promising performance in other reasoning tasks. Within this context, some recent studies use programming languages (e.g., Python) to express the necessary logic for solving a given instance/question (e.g., Program-of-Thought) as inspired by their strict and precise syntaxes. However, it is non-trivial to write an executable code that expresses the correct logic on the fly within a single inference call. Also, the code generated specifically for an instance cannot be reused for others, even if they are from the same task and might require identical logic to solve. This paper presents Think-and-Execute, a novel framework that decomposes the reasoning process of language models into ......

A Unified Editing Method for Co-Speech Gesture Generation via Diffusion  Inversion

图片

Diffusion models have shown great success in generating high-quality co-speech gestures for interactive humanoid robots or digital avatars from noisy input with the speech audio or text as conditions. However, they rarely focus on providing rich editing capabilities for content creators other than high-level specialized measures like style conditioning. To resolve this, we propose a unified framework utilizing diffusion inversion that enables multi-level editing capabilities for co-speech gesture generation without re-training. The method takes advantage of two key capabilities of invertible diffusion models. The first is that through inversion, we can reconstruct the intermediate noise from gestures and regenerate new gestures from the noise. This can be used to obtain gestures with high-level similarities to the original gestures for different speech conditions. The second is that this reconstruction reduces activation caching requirements during gradient calculation, makin......

Prompts As Programs: A Structure-Aware Approach to Efficient  Compile-Time Prompt Optimization

图片

Large language models (LLMs) can now handle longer and more complex inputs, which facilitate the use of more elaborate prompts. However, prompts often require some tuning to improve performance for deployment. Recent work has proposed automatic prompt optimization methods, but as prompt complexity and LLM strength increase, many prompt optimization techniques are no longer sufficient and a new approach is needed to optimize {\em meta prompt programs}. To address this, we introduce SAMMO, a framework for {\em compile-time} optimizations of metaprompt programs, which represent prompts as structured objects that allows for a rich set of transformations that can be searched over during optimization. We show that SAMMO generalizes previous methods and improves the performance of complex prompts on (1) instruction tuning, (2) RAG pipeline tuning, and (3) prompt compression, across several different LLMs. We make all code available open-source at https://github.com/microsoft/sammo .......

Linear Combination of Saved Checkpoints Makes Consistency and Diffusion  Models Better

图片

Diffusion Models (DM) and Consistency Models (CM) are two types of popular generative models with good generation quality on various tasks. When training DM and CM, intermediate weight checkpoints are not fully utilized and only the last converged checkpoint is used. In this work, we find that high-quality model weights often lie in a basin which cannot be reached by SGD but can be obtained by proper checkpoint averaging. Based on these observations, we propose LCSC, a simple but effective and efficient method to enhance the performance of DM and CM, by combining checkpoints along the training trajectory with coefficients deduced from evolutionary search. We demonstrate the value of LCSC through two use cases: $\textbf{(a) Reducing training cost.}$ With LCSC, we only need to train DM/CM with fewer number of iterations and/or lower batch sizes to obtain comparable sample quality with the fully trained model. For example, LCSC achieves considerable training speedups for CM (23$......

相关文章:

Next-Scale Prediction、InstantStyle、Co-Speech Gesture Generation

本文首发于公众号:机器感知 Next-Scale Prediction、InstantStyle、Co-Speech Gesture Generation Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction We present Visual AutoRegressive modeling (VAR), a new generation p…...

class中 padding和margin的用法;

如果我们想要移动盒子等的位置 ,除了可以用相对定位和绝对定位还可以用margin 和paddinng; 结构如图所示 margin和padding的用法: padding和margin后面可以跟1或2或3或4个数,按照顺序分别是上,右,下&…...

单独使用YOLOV9的backbone网络

前言 YOLO系列的网络结构都是通过.yaml来进行配置的,当要单独想使用其中的backbone网络时,可以通过yaml配置文件来进行网络搭建。 backbone的yaml配置文件与网络结构 backbone:[[-1, 1, Silence, []], # conv<...

WordPress JS Support Ticket插件 RCE漏洞复现

0x01 产品简介 WordPress和WordPress plugin都是WordPress基金会的产品。JS Support Ticket是使用在其中的一套开源票务系统插件。 0x02 漏洞概述 WordPress中的JS Support Ticket插件存在未经上传漏洞,未经身份验证的攻击者可以上传恶意脚本的服务器,执行任意指令,从而获…...

加盟代理短视频无人直播项目,开启互联网线上经营新模式

随着短视频行业的快速发展和用户数量的不断增长&#xff0c;短视频无人直播项目成为了近年来备受关注的创业机会。本文将分享如何加盟代理短视频无人直播项目&#xff0c;开启属于自己的经营新模式。 一、了解无人直播项目的核心优势 短视频无人直播项目是结合了短视频与直播的…...

spring高级篇(一)

1、ApplicationContext与BeanFactory BeanFactory是ApplicationContext的父级接口&#xff1a;&#xff08;citlaltu查看类关系图&#xff09; 在springboot的启动类中&#xff0c;我们通过SpringApplication.run方法拿到的是继承了ApplicationContext的ConfigurableApplicatio…...

免费的GPT-3.5 API服务aurora

什么是 aurora &#xff1f; aurora 是利用免登录 ChatGPT Web 提供的无限制免费 GPT-3.5-Turbo API 的服务&#xff0c;支持使用 3.5 的 access 调用。 【注意】&#xff1a;仅 IP 属地支持免登录使用 ChatGPT的才可以使用&#xff08;也可以自定义 Baseurl 来绕过限制&#x…...

突破编程_C++_网络编程(Windows 套接字(处理 TCP 粘包问题))

1 TCP 协议与粘包问题概述 1.1 TCP 粘包的产生原因 TCP粘包问题的产生原因涉及多个方面&#xff0c;主要的原因如下&#xff1a; 首先&#xff0c;发送方在发送数据时&#xff0c;由于TCP协议为提高传输效率而采用的Nagle算法&#xff0c;可能会将多个小数据包合并成一个大数…...

【训练营】DateWhale——动手学大模型应用开发(更新中)

文章目录 写在前面大模型简介LLM简介RAG简介LangChain开发框架开发LLM应用的整体流程 写在前面 大模型时代从GPT爆发开始到现在已有一年多了&#xff0c;深度学习发展之快无法想象&#xff0c;一味感叹技术发展速度超越个人学习速度是没用的&#xff0c;倒不如花点时间参加一些…...

【学习笔记十九】EWM Yard Management概述及后台配置

一、EWM Yard堆场管理业务概述 1.Yard Management基本概念 YARD管理针对的是库房以外的区域&#xff0c;可以理解为入大门开始到库门之前的这部分的区域 堆场结构 像在仓库中一样&#xff0c;将相应仓位映射为堆场仓位&#xff0c;可将其分组到堆场分区。场地中可能具有以下结…...

【环境搭建】(五)Ubuntu22.04安装cuda_11.8.0+cudnn_8.6.0

一个愿意伫立在巨人肩膀上的农民...... 设备配置&#xff1a; 一、安装GCC 安装cuda之前&#xff0c;首先应该安装GCC&#xff0c;安装cuda需要用到GCC&#xff0c;否则报错。可以先使用下方指令在终端查看是否已经安装GCC。 gcc --version 如果终端打印如下则说明已经安装…...

【UE5.1】使用MySQL and MariaDB Integration插件——(3)表格形式显示数据

在上一篇&#xff08;【UE5.1】使用MySQL and MariaDB Integration插件——&#xff08;2&#xff09;查询&#xff09;基础上继续实现以表格形式显示查询到的数据的功能 效果 步骤 1. 在“WBP_Query”中将多行文本框替换未网格面板控件&#xff0c;该控件可以用表格形式布局…...

JVM复习

冯诺依曼模型与计算机处理数据过程相关联&#xff1a; 冯诺依曼模型&#xff1a; 输入/输出设备存储器输出设备运算器控制器处理过程&#xff1a; 提取阶段&#xff1a;输入设备传入原始数据&#xff0c;存储到存储器解码阶段&#xff1a;由CPU的指令集架构ISA将数值解…...

63、ARM/STM32中IIC相关学习20240417

完成温湿度传感器数据采集实验。 【思路&#xff1a;1.通过IIC通信原理&#xff0c;理解其通信过程&#xff0c;通过调用封装的IIC函数达成主机和从机之间&#xff1a;起始信号、终止信号、读、写数据的操作&#xff1b; 2.了解温湿度传感器控制芯片SI7006的工作原理&#…...

离岸人民币与人民币国际化

参考 什么是离岸人民币&#xff1f;它有什么用&#xff1f; - 知乎 “人民币就是人民币&#xff0c;为什么要在它前面加上离岸二字&#xff1f;” “既然有离岸人民币&#xff0c;是否有在岸人民币&#xff1f;” 今天我们就简单了解一下什么是离岸人民币。 离岸/在岸人民币…...

Linux平台上部署和运行Ollama的全面指南

Ollama的安装与配置 Ollama提供了一种简单的安装方法&#xff0c;只需一行命令即可完成安装&#xff0c;但是对于想要更深入了解和自定义安装的用户&#xff0c;我们也提供了手动安装的步骤。 快速安装 Ollama的安装极为简单&#xff0c;只需在终端中执行以下命令&#xff1…...

Web---robots协议详解

在Web中&#xff0c;robots协议&#xff08;也称为robots.txt&#xff09;是一种文本文件&#xff0c;用于向搜索引擎机器人&#xff08;通常称为爬虫&#xff09;提供指导&#xff0c;以指示它们哪些页面可以抓取&#xff0c;哪些页面应该忽略。robots.txt文件位于网站的根目录…...

华为海思校园招聘-芯片-数字 IC 方向 题目分享——第四套

华为海思校园招聘-芯片-数字 IC 方向 题目分享——第四套 (共9套&#xff0c;有答案和解析&#xff0c;答案非官方&#xff0c;仅供参考&#xff09;&#xff08;共九套&#xff0c;每套四十个选择题&#xff09; 部分题目分享&#xff0c;完整版获取&#xff08;WX:didadida…...

clipper一些数据结构(入门初识(一))

clipper一些数据结构&#xff08;一&#xff09; Clipper库是一个用于执行多边形裁剪&#xff08;clipping&#xff09;和偏移&#xff08;offsetting&#xff09;操作的开源C库。在Clipper库中&#xff0c;点和多边形&#xff08;polygon&#xff09;是基本的数据结构。Clipp…...

读《SQL基础教程 第二版 上》的一些总结

1. 数据库语言 DDL: Data Definition Language&#xff0c;数据定义语言&#xff08;库、表的操作&#xff09; DML: Data Manipulation Language&#xff0c; 数据操控语言&#xff08;对表中数据的增删改&#xff09; DQL: Data Query Language&#xff0c;数据库查询语言…...

EDI是什么:EDI系统功能介绍

EDI全称Electronic Data Interchange&#xff0c;中文名称是电子数据交换&#xff0c;也被称为“无纸化贸易”。EDI实现企业间&#xff08;B2B&#xff09;自动化通信&#xff0c;帮助贸易伙伴和组织完成更多的工作、加快物流时间并消除人为错误。 目前国内企业实现EDI通信大多…...

64B/66B GT Transceiver 配置

一、前言 前一篇文章已经讲述了64B/66B的编码原理&#xff0c;此篇文章来配置一下7系列GT的64B/66B编码。并讲述所对应的例子工程的架构&#xff0c;以及部分代码的含义。 二、IP核配置 1、打开7 Series FPGAs Transceiver Wizards&#xff0c;选择将共享逻辑放置在example …...

ES6: promise对象与回调地狱

ES6&#xff1a; promise对象与回调地狱 一、回调地狱二、Promise概述三、Promise的组成四、用函数封装Promise读取文件操作 一、回调地狱 在js中大量使用回调函数进行异步操作&#xff0c;而异步操作什么时候返回结果是不可控的&#xff0c;所以希望一段程序按我们制定的顺序执…...

Qt事件处理机制2-事件函数的传播

所有继承自QObject的类都有event函数&#xff0c;该函数用来处理自身的事件&#xff0c;函数定义如下&#xff1a; virtual bool QObject::event(QEvent *e)&#xff1b;Qt帮助文档&#xff1a; This virtual function receives events to an object and should return true i…...

【PDF.js】PDF文件预览

【PDF.js】PDF文件预览 一、PDF.js二、PDF.js 下载1、下载PDF.js2、在项目中引入3、屏蔽跨域错误 三、项目中使用四、说明五、实现效果 使用PDFJS实现pdf文件的预览&#xff0c;支持预览指定页、关键词搜索、缩略图、页面尺寸调整等等。 一、PDF.js 官方地址 文档地址 二、PD…...

从建表语句带你学习doris_表索引

1、doris建表概述 1.1、doris建表模板 CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [DATABASE.]table_name (column_definition1[,column_deinition2,......][,index_definition1,[,index_definition2,]] ) [ENGINE [olap|mysql|broker|hive]] [key_desc] [COMMENT "tabl…...

Linux CentOS 安装 MySQL 服务教程

Linux CentOS 安装 MySQL 服务教程 1. 查看系统和GNU C库(glibc)版本信息 1.1 查询机器 glibc 版本信息 glibc&#xff0c;全名GNU C Library&#xff0c;是大多数Linux发行版中使用的C库&#xff0c;为系统和应用程序提供核心的API接口。在Linux系统中&#xff0c;特别是在…...

MSSQL 命令行操作说明 sql server 2022 命令行下进行配置管理

说明&#xff1a;本文的内容是因为我在导入Access2019的 *.accdb 格式的数据时&#xff0c;总是出错的背景下&#xff0c;不得已搜索和整理了一下&#xff0c;如何用命令行进行sql server 数据库和用户管理的方法&#xff0c;作为从Access2019 直接导出数据到sql server 数据库…...

【系统分析师】系统安全分析与设计

文章目录 1、安全基础技术1.1 密码相关1.1.1对称加密1.1.2非对称加密1.1.3信息摘要1.1.4数字签名1.1.5数字信封 1.2 PKI公钥体系 2、信息系统安全2.1 保障层次2.2 网络安全2.2.1WIFI2.2.2 网络威胁与攻击2.2.3 安全保护等级 2.3计算机病毒与木马2.4安全防范体系 1、安全基础技术…...

ActiveMQ 07 集群配置

Active MQ 07 集群配置 官方文档 http://activemq.apache.org/clustering 主备集群 http://activemq.apache.org/masterslave.html Master Slave TypeRequirementsProsConsShared File System Master SlaveA shared file system such as a SANRun as many slaves as requ…...

大连开发区论坛网/seo优化技术培训中心

今天这篇文章来分析一下什么是前后端分离的相关知识&#xff0c;很多小伙伴不清楚到底什么是前端&#xff0c;什么是后端&#xff0c;什么是前后端分离。在说前后端分离之前&#xff0c;我们先要弄清楚这几个概念&#xff0c;大家可能经常听到前端&#xff0c;后端或者是大前端…...

域名网站可以做多个品牌产品吗/品牌推广与传播方案

安装一个类&#xff0c;该类扩展 ServiceBase来实现服务。在安装服务应用程序时由安装实用工具调用该类。 命名空间:System.ServiceProcess程序集:System.ServiceProcess&#xff08;在 system.serviceprocess.dll 中&#xff09; 语法 VBC#CF#JScript复制public class Service…...

做网站如何选择颜色/网页开发教程

github下载的Demo&#xff0c;很多时候使用到CocoaPods&#xff0c;有的时候因为依赖关系或者版本问题不能编译运行。出现例如The sandbox is not sync with the Podfile.lock问题时候 解决方案:关闭当前的工作空间&#xff0c;删除掉文件夹中的workspace&#xff0c;然后重新p…...

网站设置为起始页/软文文案范文

1 #以下代码是利用方法1 2 #-*- coding: utf-8 -*- 3 importrequests;4 importsys;5 importio;6 #重点&#xff1a;标准解析库 7 from bs4 importBeautifulSoup;8 sys.stdout io.TextIOWrapper(sys.stdout.buffer,encodingutf8); #改变标准输出的默认编码 9 #根据cookies访问后…...

网站正常打开速度慢/如何建立网站平台的步骤

概述&#xff1a; BOM&#xff1a;浏览器对象模型&#xff08;Browser Object Model&#xff09; BOM包含内容&#xff1a; window对象&#xff08;核心&#xff09;    location对象    navigator对象    screen对象    history对象 一、window对象 双重身份&…...

网站流量到底怎样赚钱的/做seo要投入什么

C语言提供了scanf函数&#xff0c;用于给程序输入数据。用户可以通过键盘&#xff0c;给指定的变量输入数据。printf函数是给终端输出数据&#xff0c;scanf函数是从终端接收&#xff08;获取&#xff09;用户的输入数据。 scanf函数的格式如下&#xff1a; int scanf (const…...