VISUALTHINK-VLA introduces a visual intermediate-reasoning framework that guides action prediction using compact visual evidence, achieving high accuracy and significantly low latency for real-time Vision-Language-Action policies.
Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.
VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing
VLA-Trace is a diagnostic framework that analyzes Vision-Language-Action (VLA) m…
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Qwen-VLA introduces a unified embodied foundation model that extends vision-lang…
VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models
VLA-Pro is a plug-and-play framework that enhances cross-task generalization in…
BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterou…
BORA is an offline-to-online RL framework that enhances dexterous VLA models for…
Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reaso…
The paper proposes CSMR, a cognitive scheduling framework that allows a language…
Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial…
The paper evaluates the performance of Vision-Language Models (VLMs) in a collab…
Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Groun…
The paper proposes Visual Gradient Steering (VGS), a method that decomposes the…
Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Chal…
The paper proposes a unified framework that decouples long-video reasoning into…