Lumos-Nexus is a training-efficient framework that enhances video generation quality by progressively bridging generation from a lightweight model to a high-fidelity generator in a shared latent space, without sacrificing reasoning capabilities.
Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.
Turning Video Models into Generalist Robot Policies
This paper explores decoupling video generative models from embodiment-specific…
Linear Scaling Video VLMs for Long Video Understanding
This paper introduces StateKV, an inference-time method that adapts pretrained v…
Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Chal…
This paper addresses the challenge of understanding long-form egocentric videos…
Towards Consistent Video Geometry Estimation
ViGeo is a feed-forward foundation model designed to estimate spatially dense an…
Archon: A Unified Multimodal Model for Holistic Digital Human Generation
Archon is a unified multimodal model designed for holistic digital human generat…
Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation
This research addresses the memory and error accumulation issues in autoregressi…
minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
minWM is a full-stack open-source framework designed to transform existing bidir…
EarlyTom: Early Token Compression Completes Fast Video Understanding
Video-LLMs are hindered by inefficiency in processing visual tokens. EarlyTom pr…