~ similar to 2606.02120· 17 results
The paper introduces pause-and-think-T, a reasoning-centric dataset and benchmark that enables compact Vision-Language Models to perform visually grounded, context-aware action suggestion, matching la…
Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang +8 more
The paper introduces Moment-Video, a new benchmark that diagnoses the ability of video MLLMs to understand brief, critical visual events, revealing that current models struggle significantly with temp…
The paper proposes Continuous Reasoning for Vision-Language-Action (VLA) models, arguing that effective reasoning must be a shared, verifiable internal latent space rather than discrete text tokens, l…
Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv +1 more
The paper proposes a unified framework that decouples long-video reasoning into semantic and visual evidence, significantly improving performance on the HD-EPIC VQA Challenge.
The paper introduces AgenticVBench, a comprehensive benchmark of 100 real-world video post-production tasks, and finds that even the best AI agents perform significantly worse than human experts on th…
Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai +8 more
VISUALTHINK-VLA introduces a visual intermediate-reasoning framework that guides action prediction using compact visual evidence, achieving high accuracy and significantly low latency for real-time Vi…
The paper proposes a zero-shot reason-then-retrieve pipeline using Qwen3.5-27B to solve the challenging task of composed video retrieval (CoVR-R), achieving high performance on both validation and bli…
Tianhui Liu, Jie Feng, Zhiheng Zheng, Shengyuan Wang +5 more
The paper introduces SpatialAct, a challenging benchmark that reveals a significant 'reasoning-to-action gap,' showing that current VLMs struggle to maintain coherent spatial understanding and perform…
The paper demonstrates that LLM performance in zero-shot annotation is significantly limited by the alignment between the model's internal understanding and the task definition, showing that prompt-ba…
The paper introduces ConTrans, a novel local-global multi-scale encoder that combines convolutional and transformer features to significantly improve zero-shot temporal action localization by capturin…
Huiqiong Li, Jiayu Wang, Zhiting Mei, Anirudha Majumdar +2 more
The paper introduces RoboTrustBench, a comprehensive benchmark that evaluates the trustworthiness of video world models for robotic manipulation across challenging scenarios, finding that current mode…
The paper introduces a diagnostic framework to determine if World-Action Models (WAMs) provide genuinely actionable behavioral improvements beyond simply achieving task success, finding that WAMs ofte…
This paper analyzes failure modes in collaborative visual reasoning systems, demonstrating that naive shared workspaces can amplify hallucinations and proposing diagnostics for improving communication…
The paper formally addresses the challenging question of cross-domain transferability of latent predictive models by proposing a structured framework that quantifies the relationship between source an…
Tianzhuo Yang, Zihan Shen, Zirui Mi, Zhaoyi Zhang +6 more
The paper introduces MiraBench, a new benchmark that evaluates the action-conditioned reliability of robotic world models, finding that visual fidelity is insufficient and that optimism bias is a perv…
The paper addresses the difficulty of using general vision-language models (VLMs) for fine-grained driver behavior recognition by creating a new, richly described dataset and demonstrating that fine-t…