Vision Language

Research papers, repositories, and articles about vision language

Showing 1 of 1 items

Thinking with Images via Self-Calling Agent

Proposes Self-Calling Chain-of-Thought (sCoT), which reformulates multimodal CoT as a language-only CoT where a main agent spawns parameter-sharing visual subagents to solve atomic subtasks. This architecture simplifies RL for visual reasoning and yields better HR-Bench 4K performance with ~75% fewer GPU hours than prior multimodal CoT approaches. ([arxiv.org](https://arxiv.org/abs/2512.08511))

Wenxi Yang, Yuzhong Zhao