Back to AI Lab

Benchmarks

Research papers, repositories, and articles about benchmarks

Showing 19 of 19 items

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

The InternVL 2.5 work pushes an open multimodal model to match or beat top proprietary systems on tough benchmarks. It digs into how model size, data curation, and smart test-time tricks together move the performance frontier.

Zhe Chen, Weiyun Wang

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

EvoArena builds a three-domain benchmark where agents must keep working as terminals, codebases, and user preferences change over time. The companion EvoMem memory system logs non-additive updates as patches, giving measurable gains on both step-level and chain-level success in evolving tasks.

Jundong Xu, Qingchuan Li

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

LongDS stresses test data-analysis agents over thousands of turns built from real Kaggle notebooks. Even top models collapse as sessions grow, with huge drops in late-turn accuracy. If you ship analytic agents, you should be benchmarking on LongDS or something like it, not just short chat tasks.

Kewei Xu, Xiaoben Lu

ARC Prize 2025: Technical Report

This report reviews the 2025 ARC-AGI-2 competition and the best-performing program-synthesis and agentic approaches. It argues that refinement loops and contamination issues now dominate progress on abstract reasoning benchmarks.

François Chollet, Mike Knoop

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Shows that single-number leaderboards for agent benchmarks often fail to predict how agents behave in new settings. If you run evals, you should copy their predictive-validity approach, not just chase top scores.

Dhaval C. Patel, Kaoutar El Maghraoui

NVIDIA Blackwell Leads on First Agentic AI Infrastructure Benchmark

AgentPerf, the first benchmark for agent workloads, shows NVIDIA’s Blackwell platform running many more agents per megawatt than older GPUs. It frames agent performance as an energy and density game, not just raw tokens per second.

NVIDIA Blog

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

OdysseyArena tests agents on environments where they must discover hidden rules over hundreds of steps, not just follow given instructions. Even top models struggle, showing that long-run discovery and strategy remain weak points.

Fangzhi Xu, Hang Yan

AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems

AstroReason-Bench tests agents on realistic satellite scheduling and space-mission planning rather than toy puzzles. Current agentic LLM systems lag far behind hand-built solvers, giving a sharp reality check for "generalist" planning claims.

Weiyi Wang, Xinchi Chen

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

FACTS is positioned as a one-stop leaderboard for LLM factuality, aggregating automated-judge scores from multimodal, parametric, search-augmented, and document-grounded tasks. It’s a natural next target for model releases that want to claim they’re less hallucinatory in practice, not just on isolated QA datasets. ([huggingface.co](https://huggingface.co/papers/2512.10791))

Aileen Cheng, Alon Jacovi

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

FACTS is a multi-part leaderboard that evaluates LLM factuality across image-based QA, closed-book QA, search-augmented QA, and document-grounded long-form responses, using automated judge models. It’s designed as a long-lived suite with public and private splits, giving a single factuality score while still exposing failure modes across modalities and tool-use settings. ([huggingface.co](https://huggingface.co/papers/2512.10791))

Aileen Cheng, Alon Jacovi

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Builds a benchmark where tasks and environments keep changing, and evaluation checks whether an agent actually executed real workflows. Uses logs and structured assessments, not just final answers. If you are deploying agents into production operations, this is much closer to what you actually care about. ([huggingface.co](https://huggingface.co/papers/2604.28139))

Chenxin Li, Zhengyang Tang

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

Argues most terminal-agent benchmarks are written like prompts, so they test instruction clarity, not real capability. Provides a playbook for building adversarial, hard, and readable tasks, plus a catalog of common reward-hacking failure modes. If you rely on agent benchmark scores, sanity-check your tasks against this checklist before bragging. ([papers.cool](https://papers.cool/arxiv/2604.28093?utm_source=openai))

Ivan Bercovich

ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

ThaiSafetyBench compiles nearly two thousand Thai prompts, many grounded in local culture, to probe model safety. The authors also release a classifier that matches GPT-4.1’s judgments, giving the community a reusable Thai safety watchdog.

Trapoom Ukarapol, Nut Chukamphaeng

MMhops-R1: Multimodal Multi-hop Reasoning

Proposes MMhops-R1, a benchmark and model for multi-hop reasoning across visual and textual inputs. Tasks require chaining several intermediate inferences—over images and text—to reach a final answer, going beyond simple single-hop VQA. As LLMs get better at basic multimodal QA, these kinds of chain-of-thought, multi-hop setups are where reasoning gaps now show up, so having a dedicated resource here is valuable.

Tao Zhang, Ziqi Zhang

MTEB Leaderboard: From a slow demo to feature-rich leaderboard

HuggingFace’s team rebuilt the MTEB embedding leaderboard to be much faster and more navigable. You can now slice models by task, filter aggressively, and actually pick the right embedding model instead of chasing a single score.

HuggingFace Blog

Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach

The authors introduce a benchmark where multimodal models must judge mobile app UX directly from full UI screenshots. They also propose a baseline model that reasons over layout, text and visual cues, highlighting how current systems miss many usability issues humans spot instantly.

Ruichao Mao, Zhou Fang

tasksource: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation

tasksource standardizes how hundreds of NLP datasets map inputs and labels into a common schema. That makes it much easier to train and test multi-task models without hand-writing fragile preprocessing code for each dataset.

Damien Sileo

huggingface/OpenEnv

OpenEnv is an interface library for training and evaluating reinforcement-learning style agents across many environments. It targets post-training, giving a cleaner way to plug modern models into classic RL-style tasks.

2,210

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Defines V‑REX, a benchmark where models must answer chains of interdependent questions about images, designed to probe exploratory reasoning instead of one-shot recognition. Each question builds on the previous ones, encouraging models to form and refine internal hypotheses about a scene. It’s a nice stress test for multimodal models that claim to ‘reason’ rather than just match patterns.

Chenrui Fan, Yijun Liang