The paper proposes VERA, a decoupled policy that uses an action-free video world model combined with an embodiment-specific Inverse Dynamics Model (IDM) to achieve generalizable, zero-shot robot control across different hardware.
Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex tasks across embodiments and environments. Recent work proposes robot foundation models that jointly predict future observations and actions by finetuning video models with action-labeled data. In this paper, we test the limits of an alternative approach: leave the video planner as-is while training an embodiment-specific inverse dynamics model (IDM). This decoupling offers several natural benefits: the video planner remains embodiment-agnostic, different video models can be interchanged easily without re-training the IDM, and the IDM can be independently trained with readily available self-play data. We present a closed-loop, video-to-action policy that combines an action-free video world model with a carefully-designed IDM based on the robot embodiment Jacobian. We demonstrate that our IDM design is both data-efficient and scalable to high-dimensional action spaces. Our policy, which we coin the Video-to-Embodied Robot Action Model (VERA), achieves strong performance across simulated and real-world benchmarks, including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation. The same video planner can be used across multiple embodiments by pairing it with different embodiment-specific IDMs. Our results show that decoupled video planning plus faithful video-to-action translation is a viable alternative route towards zero-shot, cross-embodiment, and generalizable robot control. More results are available on our project website: https://vera.csail.mit.edu.
RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation
The paper introduces RoboTrustBench, a comprehensive benchmark that evaluates th…
PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning
The paper introduces PaSBench-Video, a comprehensive streaming video benchmark d…
SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control
SmartDirector is a novel framework that significantly improves cinematic video g…
VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
VideoMLA introduces a novel Multi-Head Latent Attention (MLA) mechanism that rep…
VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer
VidPrism introduces a novel heterogeneous Mixture-of-Experts framework that spec…
GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors
This paper presents GRAIL, a digital generation pipeline that synthesizes human-…
Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification
The paper introduces VIP-Net, a framework that leverages multi-modal spatio-temp…
Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining
The paper proposes VISTA, a multi-level event semantics mining framework, to acc…