~ similar to 2605.30409· 14 results
VideoMLA introduces a novel Multi-Head Latent Attention (MLA) mechanism that replaces per-head KV caches with a shared low-rank content latent, significantly reducing memory and improving throughput f…
Peiwen Sun, Xudong Lu, Huadai Liu, Yang Bo +8 more
The paper introduces X-Stream, a new benchmark for multi-stream video understanding, and finds that current state-of-the-art MLLMs perform poorly when required to process multiple concurrent video str…
Ruotong Liao, Guowen Huang, Qing Cheng, Guangyao Zhai +5 more
TunerDiT introduces a training-free progressive steering method to enhance multi-event video generation using Diffusion Transformers, achieving state-of-the-art performance by explicitly managing even…
Jiayi Wu, Haoming Cai, Cornelia Fermuller, Christopher Metzler +1 more
Real2SAM2Real introduces a framework that uses explicit 3D caches, derived from 3D lifting models, to provide robust geometric guidance to Video Diffusion Models, significantly improving spatiotempora…
Jingyun Liang, Min Wei, Shikai Li, Yizeng Han +4 more
The paper proposes a novel render-free framework that conditions video diffusion models directly on compressed 3D human mesh tokens, enabling robust 3D-aware human motion control without relying on re…
The paper systematically characterizes column-level activation sparsity across various diffusion model architectures, demonstrating that element-level sparsity metrics significantly overestimate the a…
Fei Deng, Yanwu Xu, Zhipeng Bao, Zhixing Zhang +3 more
BlazeEdit is a highly efficient, generalist image-to-image diffusion model designed for on-device deployment, consolidating multiple editing tasks into a compact 195M parameter model that runs quickly…
Jinyang Du, Shenghao Jin, Ziqian Xu, Ruihao Gong +4 more
The paper proposes a compression pipeline combining few-step distillation and low-bit quantization to significantly reduce the deployment cost and parameter footprint of large dual-expert video diffus…
Jiazheng Xing, Hangjie Yuan, Lingling Cai, Xinyu Liu +8 more
Lumos-Nexus is a training-efficient framework that enhances video generation quality by progressively bridging generation from a lightweight model to a high-fidelity generator in a shared latent space…
Qixin Hu, Shuai Yang, Wei Huang, Song Han +1 more
LongLive-RAG proposes a novel Retrieval-Augmented Generation (RAG) framework to stabilize and improve the quality of long-horizon video generation by treating the entire generated history as a searcha…
The paper proposes a novel cross-axis feature fusion architecture and an auxiliary joint-difference prediction task to significantly improve text-based 3D human motion editing by better understanding…
The paper introduces NaRA, a noise-aware LoRA technique that dynamically adapts fine-tuning parameters based on the noise level during diffusion, significantly improving the performance of Diffusion L…
The paper proposes VRPO, a reinforcement learning-based optimization strategy that replaces static alignment losses in diffusion models, significantly improving both convergence and image fidelity.
Chatterbox-Flash introduces a prior-calibrated block diffusion model for zero-shot TTS that achieves high-fidelity, streaming synthesis with significantly lower computational overhead than existing me…