Benchmarks

Research papers, repositories, and articles about benchmarks

Showing 4 of 4 items

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

FACTS is a multi-part leaderboard that evaluates LLM factuality across image-based QA, closed-book QA, search-augmented QA, and document-grounded long-form responses, using automated judge models. It’s designed as a long-lived suite with public and private splits, giving a single factuality score while still exposing failure modes across modalities and tool-use settings. ([huggingface.co](https://huggingface.co/papers/2512.10791))

Aileen Cheng, Alon Jacovi

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

FACTS is positioned as a one-stop leaderboard for LLM factuality, aggregating automated-judge scores from multimodal, parametric, search-augmented, and document-grounded tasks. It’s a natural next target for model releases that want to claim they’re less hallucinatory in practice, not just on isolated QA datasets. ([huggingface.co](https://huggingface.co/papers/2512.10791))

Aileen Cheng, Alon Jacovi

MMhops-R1: Multimodal Multi-hop Reasoning

Proposes MMhops-R1, a benchmark and model for multi-hop reasoning across visual and textual inputs. Tasks require chaining several intermediate inferences—over images and text—to reach a final answer, going beyond simple single-hop VQA. As LLMs get better at basic multimodal QA, these kinds of chain-of-thought, multi-hop setups are where reasoning gaps now show up, so having a dedicated resource here is valuable.

Tao Zhang, Ziqi Zhang

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Defines V‑REX, a benchmark where models must answer chains of interdependent questions about images, designed to probe exploratory reasoning instead of one-shot recognition. Each question builds on the previous ones, encouraging models to form and refine internal hypotheses about a scene. It’s a nice stress test for multimodal models that claim to ‘reason’ rather than just match patterns.

Chenrui Fan, Yijun Liang