ArXiv Paper

MMhops-R1: Multimodal Multi-hop Reasoning

Tao Zhang, Ziqi Zhang, Zongyang Ma +7December 16, 2025

Summary

Proposes MMhops-R1, a benchmark and model for multi-hop reasoning across visual and textual inputs. Tasks require chaining several intermediate inferences—over images and text—to reach a final answer, going beyond simple single-hop VQA. As LLMs get better at basic multimodal QA, these kinds of chain-of-thought, multi-hop setups are where reasoning gaps now show up, so having a dedicated resource here is valuable.

Topics

multimodal reasoning benchmarks

View Original View PDF

Related Content

huggingface/transformers

The standard library for state-of-the-art models in text, vision, audio, and combined formats. If you build with open models, you almost certainly depend on this already.

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

The authors build SpatialClaw, a code-driven agent that uses a stateful Python kernel plus vision tools to solve 3D and 4D spatial puzzles. It beats prior spatial agents across 20 benchmarks and six vision-language backbones, showing that the action interface design can unlock much stronger spatial reasoning.

STEP3-VL-10B Technical Report

STEP3-VL-10B is a 10B-parameter vision–language model that rivals much larger systems by combining unified pretraining with heavy post-training and parallel coordinated reasoning at run time. Use it as a strong open baseline for high-end multimodal tasks without giant hardware.

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

Wraps real robots in a closed-loop system where coding agents iteratively reset scenes, run policies, check results, and improve code. If you’re serious about autonomous robot labs, this is basically a blueprint.