ArXiv Paper

OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation

Lin Zhao, Yushu Wu, Yifan Gong +2June 1, 2026

Summary

OmniMem adds an external memory system to long-video generators so they can re-use past details instead of re-encoding full histories. It adapts which frames to keep, letting models generate longer, more coherent clips under fixed compute. If you’re chasing hour-scale video worlds, this is a template for managing context.

Topics

video world-models running

View Original View PDF

Related Content

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

The authors build SpatialClaw, a code-driven agent that uses a stateful Python kernel plus vision tools to solve 3D and 4D spatial puzzles. It beats prior spatial agents across 20 benchmarks and six vision-language backbones, showing that the action interface design can unlock much stronger spatial reasoning.

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

Wraps real robots in a closed-loop system where coding agents iteratively reset scenes, run policies, check results, and improve code. If you’re serious about autonomous robot labs, this is basically a blueprint.

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Builds thousands of synthetic "computers" with realistic files and calendars to simulate month-long knowledge work for AI agents. Each run spans 8+ hours and ~2,000 steps, yielding dense signals for training long-horizon productivity agents. If you are designing office copilots or agent training curricula, copy this setup to cheaply generate rich experience data. ([arxiv.org](https://arxiv.org/abs/2604.28181))

Kling-Omni Technical Report

Kling-Omni is a unified system for generating and editing high-end video from text, images, and video context. Treat it as a reference design for next-gen multimodal world simulators.