~ similar to 2605.27891· 14 results
The paper introduces Knowledge-Intensive Video Generation (KIVI) as a challenging benchmark for evaluating video models on factuality and practical usefulness, showing that current state-of-the-art sy…
JenBridge is a novel, adaptive framework that generates high-fidelity, long-form video soundtracks, significantly improving narrative coherence and naturalness across scene transitions.
Ruotong Liao, Guowen Huang, Qing Cheng, Guangyao Zhai +5 more
TunerDiT introduces a training-free progressive steering method to enhance multi-event video generation using Diffusion Transformers, achieving state-of-the-art performance by explicitly managing even…
The paper proposes a disentangled representation framework to significantly improve few-shot layout-to-image generation by separating semantic identity from local visual details, thereby mitigating re…
Haitian Li, Yanghao Zhou, Heyan Huang, Liangji Chen +14 more
The paper introduces MTAVG-Bench 2.0, a new benchmark designed to diagnose high-level failure modes of cinematic expressiveness in multi-talker audio-video generation, showing that even advanced model…
Qixin Hu, Shuai Yang, Wei Huang, Song Han +1 more
LongLive-RAG proposes a novel Retrieval-Augmented Generation (RAG) framework to stabilize and improve the quality of long-horizon video generation by treating the entire generated history as a searcha…
VideoMLA introduces a novel Multi-Head Latent Attention (MLA) mechanism that replaces per-head KV caches with a shared low-rank content latent, significantly reducing memory and improving throughput f…
The paper proposes a sequence-alignment framework using Soft Dynamic Time Warping to evaluate audio-driven talking-head generation, demonstrating that this approach provides more robust and fair compa…
Jingyun Liang, Min Wei, Shikai Li, Yizeng Han +4 more
The paper proposes a novel render-free framework that conditions video diffusion models directly on compressed 3D human mesh tokens, enabling robust 3D-aware human motion control without relying on re…
The paper introduces SPAWN, a training-free method that allows users to inject specified visual concepts into existing autoregressive world models, enabling controllable scene composition beyond the i…
Junjie Ye, Rong Xue, Basile Van Hoorick, Runhao Li +5 more
RoboDream introduces an embodiment-centric world model that synthesizes photorealistic, physically feasible robot demonstrations by decoupling motion generation from environment synthesis, significant…
Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He +2 more
The paper proposes ST-DRC, a Spatial-Temporal Decoupled Reference Conditioning framework that effectively balances high-level semantic control and low-level identity fidelity for text-to-video generat…
Hanwen Cui, Yuting Mei, Yuhang Fu, Dingyi Yang +1 more
The paper introduces STORYLENSWRITER, a novel framework that significantly improves personalized story rewriting by incorporating context-aware narrative enrichment, outperforming style-only adaptatio…
Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv +1 more
The paper proposes a unified framework that decouples long-video reasoning into semantic and visual evidence, significantly improving performance on the HD-EPIC VQA Challenge.