The paper introduces WorldCoder-Bench, a comprehensive benchmark and evaluation protocol for testing LLMs' ability to autonomously generate complex, physically grounded, and interactive 3D web worlds.
Large language models (LLMs) are increasingly asked not only to write static interfaces, but to construct executable interactive worlds from natural language. Browser-native 3D, commonly built with Three.js, is a natural next frontier: generated programs must integrate assets, obey spatial and physical constraints, and keep user-facing controls synchronized with hidden runtime state. Existing web-generation benchmarks and evaluators, however, largely observe only pixels or DOM nodes, while the mechanics of a Three.js world unfold inside an opaque <canvas>. We introduce WorldCoder-Bench, a benchmark for autonomous, physically grounded 3D world synthesis. WorldCoder-Bench contains 2,026 expert-curated tasks across Simulation, Rendering, and Application scenarios, with optional .glb assets and hidden behavioral contracts. We further propose StateProbe, an execution-based protocol that probes generated programs in a sandboxed browser and verifies hidden, mutation-hardened contracts over runtime states and transitions. Beyond verification coverage, we report Return on Automation and Time Efficiency Multiplier to measure correctness-adjusted cost and time savings. Across nine frontier models, the best system reaches only 27.8% verification coverage on WorldCoder-Core and 19.9% on WorldCoder-Robust, with failures dominated by state-schema drift and broken interaction chains rather than missing scene elements. Utility metrics further show that cheap or fast models can still provide substantial value on easier domains. WorldCoder-Bench is available at https://anonymous.4open.science/r/WorldCoder-Bench/.
Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners
The paper argues that current embodied planning benchmarks prioritize superficia…
Physically Viable World Models: A Case for Query-Conditioned Embodied AI
The paper argues that for embodied AI to be safe and effective, world models mus…
Scaling Agentic Capabilities via Grounded Interaction Synthesis
The paper introduces Grounded Agentic Interaction Synthesis (GAIS), a framework…
SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes
The paper introduces SMH-Bench, a comprehensive benchmark built on a simulator t…
LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models
The paper introduces LL-Bench, a comprehensive benchmark for evaluating large-sc…
BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs
The paper introduces BilliardPhys-Bench, a new benchmark that demonstrates that…
CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation
The paper introduces CRAB-Bench and RUSE, a rigorous evaluation framework that t…
Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation
The paper introduces Cookie-Bench, a novel, autonomous, and reference-free evalu…