World Models
Research papers, repositories, and articles about world models
Showing 11 of 11 items
Kling-Omni Technical Report
Kling-Omni is a unified system for generating and editing high-end video from text, images, and video context. Treat it as a reference design for next-gen multimodal world simulators.
Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning
Trains LLM agents to "imagine" future states and score plans, not just react step by step. They use a three-stage pipeline to inject forecasting, format it, then harden it with reinforcement learning. If you build agents that plan, this is a concrete recipe for giving them an internal world model. ([arxiv.org](https://arxiv.org/list/cs.AI/new))
minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
minWM turns heavy video diffusion models into fast, interactive "video world" simulators. It provides a full pipeline from data to few-step generators that run close to real time. If you care about agents in simulated worlds, this is an end-to-end recipe you can actually clone and run.
Reinforcement World Model Learning for LLM-based Agents
The authors train a world model that predicts how text-based environments change when an agent acts, using rewards that close the gap between imagined and real next states. Agents using this learned model perform better on ALFWorld and τ² Bench than agents trained only on task success signals.
Reinforcement World Model Learning for LLM-based Agents
RWML trains agents to imagine next states and then line them up with reality, instead of just predicting the next token. That shift gives stronger gains on text-based environments than reward-on-final-score alone.
The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text
WorldCanvas lets you script rich video scenes using text, object trajectories, and reference images instead of frame-by-frame editing. Use this as a template for building controllable world simulators and advanced video tools.
LongVie 2: Multimodal Controllable Ultra-Long Video World Model
Presents LongVie 2, a world-model-style generator for ultra-long videos with explicit control signals. The model can condition on multimodal inputs and maintain temporal coherence over very long horizons, with a public project page for demos. This sits right at the frontier of ‘video world models’ that might eventually underpin simulation-heavy planning and agent training.
Hallucination in World Models is Predictable and Preventable
Builds a big benchmark with ground-truth simulators to show where visual world models drift from reality. Identifies three failure modes and three simple signals that reliably flag them. If you deploy action-driven world models, you can use these signals as runtime tripwires. ([huggingface.co](https://huggingface.co/papers/2606.27326))
The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text
Hugging Face highlights WorldCanvas for its controllable video generation via text, paths, and reference images. Builders of simulation-style UIs should copy the interface ideas here.
OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation
OmniMem adds an external memory system to long-video generators so they can re-use past details instead of re-encoding full histories. It adapts which frames to keep, letting models generate longer, more coherent clips under fixed compute. If you’re chasing hour-scale video worlds, this is a template for managing context.
DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
Presents DrivePI, a 4D (3D + time) multimodal large model for autonomous driving that unifies perception, prediction, and planning. Instead of separate stacks, DrivePI treats driving as a holistic spatial-temporal understanding problem, ingesting sensor data and outputting both scene interpretations and future trajectories. It’s another sign that end-to-end or semi end-to-end ‘driving MLLMs’ are becoming a serious research direction.