~ similar to 2605.30557· 20 results
Tianhui Liu, Jie Feng, Zhiheng Zheng, Shengyuan Wang +5 more
The paper introduces SpatialAct, a challenging benchmark that reveals a significant 'reasoning-to-action gap,' showing that current VLMs struggle to maintain coherent spatial understanding and perform…
Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma +2 more
The paper proposes GASP, a framework that injects fundamental geometric priors directly into Vision-Language Models (VLMs) using ground-truth video geometry, significantly enhancing 3D spatial reasoni…
Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng +5 more
The paper identifies a failure mode called spatial lexical bias in MLLMs, where adding a spatial word to options biases the model's choice, and demonstrates that this failure originates primarily from…
Chenming Zhu, Jingli Lin, Yilin Long, Peizhou Cao +3 more
The paper proposes Astra, an agentic framework that equips Vision-Language Models (VLMs) with the ability to perform spatial reasoning by actively generating and utilizing imagined visual evidence fro…
The paper evaluates the performance of Vision-Language Models (VLMs) in a collaborative dialogue task requiring spatial reconstruction, finding that while detailed text representations improve results…
Ruina Hu, Chen Wang, Lai Wei, Jionghao Bai +4 more
The paper introduces EASE, a method that enhances multimodal Reinforcement Learning with Verifiable Rewards (RLVR) by providing spatial attention supervision anchored to visual evidence, significantly…
Reasmory introduces a structured programming framework that uses explicit 3D memory and a Domain-Specific Language (DSL) to reliably enhance Vision-Language Models' spatial reasoning capabilities, ach…
Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su +7 more
This paper proposes SpatialClaw, a training-free framework for spatial reasoning that enables open-ended, complex 3D/4D spatial reasoning.
The paper proposes an agentic pipeline for spatial reasoning by introducing a dynamic cognitive map and Spatial Assertion Codes (SAC), achieving state-of-the-art performance on complex spatial tasks.
Yuhan Wang, Shuochen Chang, Yalin Feng, Dongsheng Ma +7 more
The paper proposes EAGLE, a novel evidence-aligned multi-agent framework, demonstrating that requiring shared visual evidence among agents is crucial for achieving reliable and trustworthy consensus i…
Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun +4 more
The paper proposes CSMR, a cognitive scheduling framework that allows a language model to dynamically decide when to acquire task-relevant visual evidence, significantly improving multimodal reasoning…
Mahtab Bigverdi, Lindsey Li, Weikai Huang, Yiming Liu +7 more
This paper introduces Imaginative Perception Tokens (IPT) to improve spatial reasoning in vision language models.
Garvin Guo, Yu Chen, Xiang Wang, Shuai Li +3 more
The paper deconstructs latent visual reasoning tokens into components and finds that the performance gains are primarily due to boundary markers and attention patterns, not the tokens' ability to enco…
The paper argues that benchmarking Vision-Language Models (VLMs) for urban perception must treat human disagreement and non-response as key measurement outcomes, rather than assuming perfect consensus…
Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao +4 more
The paper introduces a multilingual benchmark (MentalMap) to test if LLMs build internal spatial world models from text, finding a universal 'L3 reasoning cliff' suggesting that text-only working memo…
ROVER is a lightweight, learnable plugin that efficiently routes and integrates object-centric visual evidence across multiple images and objects, significantly improving performance on grounded multi…
This paper analyzes failure modes in collaborative visual reasoning systems, demonstrating that naive shared workspaces can amplify hallucinations and proposing diagnostics for improving communication…
Ke Xu, Yuhao Wang, Ziyang Cheng, Hongcheng Liu +2 more
The paper introduces MOV-Bench, a challenging benchmark for multi-hop audio-visual reasoning, and proposes AOP-Agent, an agentic framework that significantly improves open-source Omni-LLMs' ability to…
Kaiwen Xue, Tao Wei, Guoxin Zhang, Zhonghong Ou +4 more
The paper introduces ERGeoBench, a comprehensive diagnostic benchmark designed to evaluate the fine-grained capabilities of multimodal large language models (MLLMs) for embodied geo-localization acros…