~ similar to 2606.02522· 16 results
Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv +1 more
The paper proposes a unified framework that decouples long-video reasoning into semantic and visual evidence, significantly improving performance on the HD-EPIC VQA Challenge.
The paper introduces pause-and-think-T, a reasoning-centric dataset and benchmark that enables compact Vision-Language Models to perform visually grounded, context-aware action suggestion, matching la…
Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao +3 more
The paper proposes using Vision-Language Models (VLMs) as 'teachers' to guide Video Generation Models (VGMs) during test-time optimization, significantly improving video reasoning capabilities.
The paper introduces Multi-Clip Video (MCV) SafetyBench, a dataset demonstrating that the vulnerability of Multimodal Large Language Models (MLLMs) to jailbreaking increases with the diversity and num…
Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng +5 more
The paper identifies a failure mode called spatial lexical bias in MLLMs, where adding a spatial word to options biases the model's choice, and demonstrates that this failure originates primarily from…
Garvin Guo, Yu Chen, Xiang Wang, Shuai Li +3 more
The paper deconstructs latent visual reasoning tokens into components and finds that the performance gains are primarily due to boundary markers and attention patterns, not the tokens' ability to enco…
Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun +4 more
The paper proposes CSMR, a cognitive scheduling framework that allows a language model to dynamically decide when to acquire task-relevant visual evidence, significantly improving multimodal reasoning…
Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng +3 more
The paper introduces StemBind, a diagnostic benchmark that separates perception, rule induction, and answer selection in abstract visual reasoning, revealing that the primary failure point for MLLMs i…
Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu +3 more
The paper introduces PaSBench-Video, a comprehensive streaming video benchmark designed to rigorously test multimodal LLMs' ability to issue proactive safety warnings, finding that current models stru…
Peiwen Sun, Xudong Lu, Huadai Liu, Yang Bo +8 more
The paper introduces X-Stream, a new benchmark for multi-stream video understanding, and finds that current state-of-the-art MLLMs perform poorly when required to process multiple concurrent video str…
Yue Feng, Jingjing Li, Qijia Lu, Wei Ji +8 more
This paper addresses the challenge of detecting and explaining AI-manipulated segments within long, untrimmed videos by proposing a new benchmark and a coarse-to-fine forensic detection framework.
Haowen Hou, Zhen Huang, Zheming Liang, Qingyi Si +7 more
AdaCodec introduces a predictive visual coding scheme for video MLLMs, significantly improving efficiency and performance by transmitting only inter-frame changes and full reference frames when necess…
Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai +1 more
The paper proposes a training-free framework, Visual Representation-Guided Video-LLM Reasoning, to perform composed video retrieval by using visual examples and text instructions, achieving strong per…
Haoyuan Shi, Xiancong Ren, Yingji Zhang, Qinfan Zhang +8 more
VLA-Trace is a diagnostic framework that analyzes Vision-Language-Action (VLA) models by tracing their internal representations and external behaviors, revealing that while these models are good at vi…
Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai +8 more
VISUALTHINK-VLA introduces a visual intermediate-reasoning framework that guides action prediction using compact visual evidence, achieving high accuracy and significantly low latency for real-time Vi…
NeuroLip proposes an event-based spatiotemporal framework for visual speaker recognition that achieves robust cross-scene generalization by capturing fine-grained lip dynamics, outperforming existing…