ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2605.27891· 14 results

cs.CVcs.AIRecentMay 31, 2026

Knowledge-Intensive Video Generation

Chenxu Wang, Mingda Chen

The paper introduces Knowledge-Intensive Video Generation (KIVI) as a challenging benchmark for evaluating video models on factuality and practical usefulness, showing that current state-of-the-art sy…

View →
cs.SDcs.AIcs.CVRecentJun 1, 2026

JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

Jiashuo Yu, Yao Yao, Boyu Chen, Alex Wang

JenBridge is a novel, adaptive framework that generates high-fidelity, long-form video soundtracks, significantly improving narrative coherence and naturalness across scene transitions.

View →
cs.CVcs.AIRecentMay 29, 2026

TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

Ruotong Liao, Guowen Huang, Qing Cheng, Guangyao Zhai +5 more

TunerDiT introduces a training-free progressive steering method to enhance multi-event video generation using Diffusion Transformers, achieving state-of-the-art performance by explicitly managing even…

View →
cs.CVcs.AIcs.LGRecentMay 29, 2026

Envisioning Beyond the Few: Disentangled Semantics and Primitives for Few-Shot Atypical Layout-to-Image Generation

Nan Bao, Yifan Zhao, Wenzhuang Wang, Jia Li

The paper proposes a disentangled representation framework to significantly improve few-shot layout-to-image generation by separating semantic identity from local visual details, thereby mitigating re…

View →
cs.AIcs.MMcs.SDRecentMay 27, 2026

MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

Haitian Li, Yanghao Zhou, Heyan Huang, Liangji Chen +14 more

The paper introduces MTAVG-Bench 2.0, a new benchmark designed to diagnose high-level failure modes of cinematic expressiveness in multi-talker audio-video generation, showing that even advanced model…

View →
cs.CVRecentJun 1, 2026

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

Qixin Hu, Shuai Yang, Wei Huang, Song Han +1 more

LongLive-RAG proposes a novel Retrieval-Augmented Generation (RAG) framework to stabilize and improve the quality of long-horizon video generation by treating the entire generated history as a searcha…

View →
cs.CVcs.AIRecentMay 28, 2026

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan +3 more

VideoMLA introduces a novel Multi-Head Latent Attention (MLA) mechanism that replaces per-head KV caches with a shared low-rank content latent, significantly reducing memory and improving throughput f…

View →
cs.GRcs.AIcs.CVRecentMay 31, 2026

Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

Zhicheng Zhang, Lei Wang, Yu Zhang, Yongsheng Gao

The paper proposes a sequence-alignment framework using Soft Dynamic Time Warping to evaluate audio-driven talking-head generation, demonstrating that this approach provides more robust and fair compa…

View →
cs.CVcs.AIeess.IVRecentJun 1, 2026

Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

Jingyun Liang, Min Wei, Shikai Li, Yizeng Han +4 more

The paper proposes a novel render-free framework that conditions video diffusion models directly on compressed 3D human mesh tokens, enabling robust 3D-aware human motion control without relying on re…

View →
cs.CVRecentJun 1, 2026

From Zero to Hero: Training-Free Custom Concept Spawning in World Models

Kiymet Akdemir, Pinar Yanardag

The paper introduces SPAWN, a training-free method that allows users to inject specified visual concepts into existing autoregressive world models, enabling controllable scene composition beyond the i…

View →
cs.ROcs.CVRecentJun 1, 2026

RoboDream: Compositional World Models for Scalable Robot Data Synthesis

Junjie Ye, Rong Xue, Basile Van Hoorick, Runhao Li +5 more

RoboDream introduces an embodiment-centric world model that synthesizes photorealistic, physically feasible robot demonstrations by decoupling motion generation from environment synthesis, significant…

View →
cs.CVRecentJun 1, 2026

Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He +2 more

The paper proposes ST-DRC, a Spatial-Temporal Decoupled Reference Conditioning framework that effectively balances high-level semantic control and low-level identity fidelity for text-to-video generat…

View →
cs.CLcs.AIRecentMay 27, 2026

StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment

Hanwen Cui, Yuting Mei, Yuhang Fu, Dingyi Yang +1 more

The paper introduces STORYLENSWRITER, a novel framework that significantly improves personalized story rewriting by incorporating context-aware narrative enrichment, outperforming style-only adaptatio…

View →
cs.CVcs.AIRecentMay 28, 2026

Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv +1 more

The paper proposes a unified framework that decouples long-video reasoning into semantic and visual evidence, significantly improving performance on the HD-EPIC VQA Challenge.

View →