Thinking with Images via Self-Calling Agent
Summary
Proposes Self-Calling Chain-of-Thought (sCoT), which reformulates multimodal CoT as a language-only CoT where a main agent spawns parameter-sharing visual subagents to solve atomic subtasks. This architecture simplifies RL for visual reasoning and yields better HR-Bench 4K performance with ~75% fewer GPU hours than prior multimodal CoT approaches. ([arxiv.org](https://arxiv.org/abs/2512.08511))
Related Content
huggingface/transformers
The standard library for state-of-the-art models in text, vision, audio, and combined formats. If you build with open models, you almost certainly depend on this already.
SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
The authors build SpatialClaw, a code-driven agent that uses a stateful Python kernel plus vision tools to solve 3D and 4D spatial puzzles. It beats prior spatial agents across 20 benchmarks and six vision-language backbones, showing that the action interface design can unlock much stronger spatial reasoning.
STEP3-VL-10B Technical Report
STEP3-VL-10B is a 10B-parameter vision–language model that rivals much larger systems by combining unified pretraining with heavy post-training and parallel coordinated reasoning at run time. Use it as a strong open baseline for high-end multimodal tasks without giant hardware.
ENPIRE: Agentic Robot Policy Self-Improvement in the Real World
Wraps real robots in a closed-loop system where coding agents iteratively reset scenes, run policies, check results, and improve code. If you’re serious about autonomous robot labs, this is basically a blueprint.