ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2606.02522· 16 results

cs.CVcs.AIRecentMay 28, 2026

Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv +1 more

The paper proposes a unified framework that decouples long-video reasoning into semantic and visual evidence, significantly improving performance on the HD-EPIC VQA Challenge.

View →
cs.CVcs.AIRecentMay 30, 2026

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

Shivam Singh, Saptarshi Majumdar, Pratik Prabhanjan, Zicheng Liu +1 more

The paper introduces pause-and-think-T, a reasoning-centric dataset and benchmark that enables compact Vision-Language Models to perform visually grounded, context-aware action suggestion, matching la…

View →
cs.CVRecentJun 1, 2026

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao +3 more

The paper proposes using Vision-Language Models (VLMs) as 'teachers' to guide Video Generation Models (VGMs) during test-time optimization, significantly improving video reasoning capabilities.

View →
cs.CVcs.AIcs.CLRecentJun 1, 2026

Jailbreaking Multimodal Large Language Models using Multi-Clip Video

Choongwon Kang, Seungjong Sun, Hyunmin Jun, Jang Hyun Kim

The paper introduces Multi-Clip Video (MCV) SafetyBench, a dataset demonstrating that the vulnerability of Multimodal Large Language Models (MLLMs) to jailbreaking increases with the diversity and num…

View →
cs.CLcs.CVRecentJun 1, 2026

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng +5 more

The paper identifies a failure mode called spatial lexical bias in MLLMs, where adding a spatial word to options biases the model's choice, and demonstrates that this failure originates primarily from…

View →
cs.CVcs.AIRecentMay 31, 2026

Beyond Visual Memory: Mechanistic Diagnostics of Latent Visual Reasoning

Garvin Guo, Yu Chen, Xiang Wang, Shuai Li +3 more

The paper deconstructs latent visual reasoning tokens into components and finds that the performance gains are primarily due to boundary markers and attention patterns, not the tokens' ability to enco…

View →
cs.AIRecentMay 27, 2026

Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning

Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun +4 more

The paper proposes CSMR, a cognitive scheduling framework that allows a language model to dynamically decide when to acquire task-relevant visual evidence, significantly improving multimodal reasoning…

View →
cs.CVcs.AIRecentMay 29, 2026

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng +3 more

The paper introduces StemBind, a diagnostic benchmark that separates perception, rule induction, and answer selection in abstract visual reasoning, revealing that the primary failure point for MLLMs i…

View →
cs.CLcs.AIcs.CVRecentJun 1, 2026

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu +3 more

The paper introduces PaSBench-Video, a comprehensive streaming video benchmark designed to rigorously test multimodal LLMs' ability to issue proactive safety warnings, finding that current models stru…

View →
cs.CVRecentJun 1, 2026

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

Peiwen Sun, Xudong Lu, Huadai Liu, Yang Bo +8 more

The paper introduces X-Stream, a new benchmark for multi-stream video understanding, and finds that current state-of-the-art MLLMs perform poorly when required to process multiple concurrent video str…

View →
cs.CVRecentJun 1, 2026

Explainable Forensics of Manipulated Segments in Untrimmed Long Videos

Yue Feng, Jingjing Li, Qijia Lu, Wei Ji +8 more

This paper addresses the challenge of detecting and explaining AI-manipulated segments within long, untrimmed videos by proposing a new benchmark and a coarse-to-fine forensic detection framework.

View →
cs.CVcs.AIcs.CLRecentJun 1, 2026

AdaCodec: A Predictive Visual Code for Video MLLMs

Haowen Hou, Zhen Huang, Zheming Liang, Qingyi Si +7 more

AdaCodec introduces a predictive visual coding scheme for video MLLMs, significantly improving efficiency and performance by transmitting only inter-frame changes and full reference frames when necess…

View →
cs.CVRecentJun 1, 2026

Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai +1 more

The paper proposes a training-free framework, Visual Representation-Guided Video-LLM Reasoning, to perform composed video retrieval by using visual examples and text instructions, achieving strong per…

View →
cs.AIRecentMay 28, 2026

VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

Haoyuan Shi, Xiancong Ren, Yingji Zhang, Qinfan Zhang +8 more

VLA-Trace is a diagnostic framework that analyzes Vision-Language-Action (VLA) models by tracing their internal representations and external behaviors, revealing that while these models are good at vi…

View →
cs.CVcs.AIRecentMay 28, 2026

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai +8 more

VISUALTHINK-VLA introduces a visual intermediate-reasoning framework that guides action prediction using compact visual evidence, achieving high accuracy and significantly low latency for real-time Vi…

View →
cs.CVcs.AIcs.CRRecentApr 17, 2026

NeuroLip: An Event-driven Spatiotemporal Learning Framework for Cross-Scene Lip-Motion-based Visual Speaker Recognition

Junguang Yao, Wenye Liu, Stjepan Picek, Yue Zheng

NeuroLip proposes an event-based spatiotemporal framework for visual speaker recognition that achieves robust cross-scene generalization by capturing fine-grained lip dynamics, outperforming existing…

View →