Thinking with Images via Self-Calling Agent
Proposes Self-Calling Chain-of-Thought (sCoT), which reformulates multimodal CoT as a language-only CoT where a main agent spawns parameter-sharing visual subagents to solve atomic subtasks. This architecture simplifies RL for visual reasoning and yields better HR-Bench 4K performance with ~75% fewer GPU hours than prior multimodal CoT approaches. ([arxiv.org](https://arxiv.org/abs/2512.08511))
Wenxi Yang, Yuzhong Zhao