Papers similar to 2605.29563

~ similar to 2605.29563· 17 results

cs.CVcs.AIcs.CLRecentMay 29, 2026

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

Tianhui Liu, Jie Feng, Zhiheng Zheng, Shengyuan Wang +5 more

The paper introduces SpatialAct, a challenging benchmark that reveals a significant 'reasoning-to-action gap,' showing that current VLMs struggle to maintain coherent spatial understanding and perform…

View →

cs.CVcs.AIRecentMay 28, 2026

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma +2 more

The paper proposes GASP, a framework that injects fundamental geometric priors directly into Vision-Language Models (VLMs) using ground-truth video geometry, significantly enhancing 3D spatial reasoni…

View →

cs.ROcs.AIRecentMay 29, 2026

TARIC: Memory-Augmented Traversability-Aware Outdoor VLN under Interrupted Semantic Cues

Tianle Zeng, Hanjing Ye, Jianwei Peng, Jingwen Yu +2 more

The paper proposes a memory-augmented, traversability-aware framework for outdoor VLN that maintains stable, goal-consistent guidance even when semantic cues are interrupted or unavailable.

View →

cs.CVRecentJun 4, 2026

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

Chenming Zhu, Jingli Lin, Yilin Long, Peizhou Cao +3 more

The paper proposes Astra, an agentic framework that equips Vision-Language Models (VLMs) with the ability to perform spatial reasoning by actively generating and utilizing imagined visual evidence fro…

View →

cs.ROcs.AIcs.LGRecentMay 29, 2026

Continuous Reasoning for Vision-Language-Action

Yueh-Hua Wu, Tatsuya Matsushima, Kei Ota

The paper proposes Continuous Reasoning for Vision-Language-Action (VLA) models, arguing that effective reasoning must be a shared, verifiable internal latent space rather than discrete text tokens, l…

View →

cs.LGcs.AIRecentMay 31, 2026

MViewRouter: Internalizing Geometric Equivariance via Multi-view Alternating Attention for Combinatorial Routing

Shiyan Liu, Bohan Tan, Yaoxin Wu, Yan Jin

MViewRouter proposes a multi-view framework that internalizes geometric equivariance using a Multi-view Alternating Attention mechanism to improve generalization and stabilize training for combinatori…

View →

cs.CVcs.AIEmpiricalRecentJun 11, 2026

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su +7 more

This paper proposes SpatialClaw, a training-free framework for spatial reasoning that enables open-ended, complex 3D/4D spatial reasoning.

View →

cs.ROcs.AIRecentJun 2, 2026

Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation

Roohan Ahmed Khan, Yasheerah Yaqoot, Muhammad Ahsan Mustafa, Dzmitry Tsetserukou

The paper introduces AgenticRL, a self-refining reinforcement learning framework that uses a multimodal GPT agent to automatically design, refine, and deploy reward functions for complex UAV navigatio…

View →

cs.CVcs.AIcs.CLRecentMay 28, 2026

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

Yue Zhang, Zun Wang, Han Lin, Yonatan Bitton +2 more

This paper introduces a new evaluation framework, SpatialUncertain, demonstrating that current Vision-Language Models (VLMs) are prone to overconfident and incorrect answers to spatial questions when…

View →

cs.CVcs.CLRecentMay 31, 2026

Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning

Jixuan He, Xueting Li, Chieh Hubert Lin, Ming-Hsuan Yang

Reasmory introduces a structured programming framework that uses explicit 3D memory and a Domain-Specific Language (DSL) to reliably enhance Vision-Language Models' spatial reasoning capabilities, ach…

View →

cs.CVRecentJun 1, 2026

Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

Wei Deng, Xianlin Zhang, Mengshi Qi

The paper proposes an agentic pipeline for spatial reasoning by introducing a dynamic cognitive map and Spatial Assertion Codes (SAC), achieving state-of-the-art performance on complex spatial tasks.

View →

cs.AIRecentMay 27, 2026

Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial Reasoning

Yi Wang, Haojie Lu, Zhaofan Zhang, Li Chen +1 more

This paper introduces MCTS-Guided Group Relative Policy Optimization (M-GRPO) to enhance LLM spatial reasoning by improving the decomposition of complex tasks into optimal sub-tasks.

View →

cs.CVcs.AIRecentMay 28, 2026

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai +8 more

VISUALTHINK-VLA introduces a visual intermediate-reasoning framework that guides action prediction using compact visual evidence, achieving high accuracy and significantly low latency for real-time Vi…

View →

cs.CVcs.AIRecentMay 28, 2026

VLM3: Vision Language Models Are Native 3D Learners

Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu +2 more

The paper proposes VLM3, a simple, scalable method that demonstrates standard Vision Language Models (VLMs) can natively learn 3D understanding by focusing on architectural simplicity and specific dat…

View →

cs.CVcs.AIRecentMay 27, 2026

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

Guannan Lv, Ren Nie, Hongjian Dou, Tingting Gao

ROVER is a lightweight, learnable plugin that efficiently routes and integrates object-centric visual evidence across multiple images and objects, significantly improving performance on grounded multi…

View →

cs.CVcs.CLRecentMay 30, 2026

Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding

Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom +5 more

The paper proposes Visual Gradient Steering (VGS), a method that decomposes the distillation loss into language and visual components and steers the optimization to prioritize visual grounding, signif…

View →

cs.CVcs.RORecentJun 2, 2026

Exploring Easy Boosts for Lidar Semantic Scene Completion

Tetiana Martyniuk, Jonathan Seele, Alexandre Boulch, Gilles Puy +2 more

The paper shows that simple, non-architectural enhancements, such as adding semantic pseudo-labels and visibility information, can significantly boost Lidar Semantic Scene Completion performance.

View →