~ similar to 2605.31603· 14 results
V-LynX is a framework that enhances Video LLMs by integrating new modalities into their existing token interface, achieving state-of-the-art performance across diverse video understanding tasks.
Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv +1 more
The paper proposes a unified framework that decouples long-video reasoning into semantic and visual evidence, significantly improving performance on the HD-EPIC VQA Challenge.
The paper analyzes token reduction for efficient unified VLM training, finding that while task-specific acceleration saves computation, it destroys the mutual performance gains achieved through joint…
Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai +1 more
The paper proposes a training-free framework, Visual Representation-Guided Video-LLM Reasoning, to perform composed video retrieval by using visual examples and text instructions, achieving strong per…
Ke Xu, Yuhao Wang, Ziyang Cheng, Hongcheng Liu +2 more
The paper introduces MOV-Bench, a challenging benchmark for multi-hop audio-visual reasoning, and proposes AOP-Agent, an agentic framework that significantly improves open-source Omni-LLMs' ability to…
Peiwen Sun, Xudong Lu, Huadai Liu, Yang Bo +8 more
The paper introduces X-Stream, a new benchmark for multi-stream video understanding, and finds that current state-of-the-art MLLMs perform poorly when required to process multiple concurrent video str…
Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu +5 more
The paper introduces LL-Bench, a comprehensive benchmark for evaluating large-scale generative models on low-level vision tasks, and proposes LL-Score, an MLLM-based evaluator that better aligns quali…
Minkyung Kwon, Jinhyeok Choi, Youngjin Shin, Jaeyeong Kim +2 more
MORPHOS is a novel autoregressive framework that generates dynamic 3D assets (like meshes and radiance fields) from videos by using a unified 4D representation to ensure temporal consistency and handl…
Chong Bao, Shichen Liu, Lijun Yu, David Futschik +8 more
The paper introduces Archon, a unified, fully pretrained multimodal model that addresses the challenge of generating holistic digital humans by integrating seven modalities (including text, audio, mot…
Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao +3 more
The paper proposes using Vision-Language Models (VLMs) as 'teachers' to guide Video Generation Models (VGMs) during test-time optimization, significantly improving video reasoning capabilities.
Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang +2 more
The paper introduces TRON, an online, rule-verifiable environment substrate that generates an unbounded stream of fresh, controllable visual reasoning training instances, significantly improving RL pe…
Qixin Hu, Shuai Yang, Wei Huang, Song Han +1 more
LongLive-RAG proposes a novel Retrieval-Augmented Generation (RAG) framework to stabilize and improve the quality of long-horizon video generation by treating the entire generated history as a searcha…
The paper proposes a disentangled representation framework to significantly improve few-shot layout-to-image generation by separating semantic identity from local visual details, thereby mitigating re…
The paper introduces pause-and-think-T, a reasoning-centric dataset and benchmark that enables compact Vision-Language Models to perform visually grounded, context-aware action suggestion, matching la…