StressDream proposes a novel method to steer video world model imaginations toward high-impact, yet plausible outcomes, enabling robust policy evaluation and improvement by identifying undesirable future scenarios.
Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.
The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement
The paper introduces SAVE, a framework that uses on-policy feedback and the valu…
Turning Video Models into Generalist Robot Policies
The paper proposes VERA, a decoupled policy that uses an action-free video world…
X4Val: Learning Neural Surrogates for Variance-Reduced Policy Evaluation
This paper introduces X4Val, a framework for variance-reduced real-world metric…
Policy and World Modeling Co-Training for Language Agents
The paper proposes PaW, a co-training framework that uses standard RL rollouts t…
COMAP: Co-Evolving World Models and Agent Policies for LLM Agents
COMAP introduces a novel co-evolutionary framework that simultaneously updates t…
Drift Q-Learning
DriftQL introduces a novel, efficient offline RL method that combines a drift-ba…
MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models
The paper introduces MiraBench, a new benchmark that evaluates the action-condit…
RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation
The paper introduces RoboTrustBench, a comprehensive benchmark that evaluates th…