LongVie 2: Multimodal Controllable Ultra-Long Video World Model
Summary
Presents LongVie 2, a world-model-style generator for ultra-long videos with explicit control signals. The model can condition on multimodal inputs and maintain temporal coherence over very long horizons, with a public project page for demos. This sits right at the frontier of ‘video world models’ that might eventually underpin simulation-heavy planning and agent training.
Related Content
huggingface/transformers
The standard library for state-of-the-art models in text, vision, audio, and combined formats. If you build with open models, you almost certainly depend on this already.
SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
The authors build SpatialClaw, a code-driven agent that uses a stateful Python kernel plus vision tools to solve 3D and 4D spatial puzzles. It beats prior spatial agents across 20 benchmarks and six vision-language backbones, showing that the action interface design can unlock much stronger spatial reasoning.
STEP3-VL-10B Technical Report
STEP3-VL-10B is a 10B-parameter vision–language model that rivals much larger systems by combining unified pretraining with heavy post-training and parallel coordinated reasoning at run time. Use it as a strong open baseline for high-end multimodal tasks without giant hardware.
ENPIRE: Agentic Robot Policy Self-Improvement in the Real World
Wraps real robots in a closed-loop system where coding agents iteratively reset scenes, run policies, check results, and improve code. If you’re serious about autonomous robot labs, this is basically a blueprint.