Benchmarks
Research papers, repositories, and articles about benchmarks
Showing 19 of 19 items
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
The InternVL 2.5 work pushes an open multimodal model to match or beat top proprietary systems on tough benchmarks. It digs into how model size, data curation, and smart test-time tricks together move the performance frontier.
EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments
EvoArena builds a three-domain benchmark where agents must keep working as terminals, codebases, and user preferences change over time. The companion EvoMem memory system logs non-additive updates as patches, giving measurable gains on both step-level and chain-level success in evolving tasks.
LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis
LongDS stresses test data-analysis agents over thousands of turns built from real Kaggle notebooks. Even top models collapse as sessions grow, with huge drops in late-turn accuracy. If you ship analytic agents, you should be benchmarking on LongDS or something like it, not just short chat tasks.
ARC Prize 2025: Technical Report
This report reviews the 2025 ARC-AGI-2 competition and the best-performing program-synthesis and agentic approaches. It argues that refinement loops and contamination issues now dominate progress on abstract reasoning benchmarks.
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents
Shows that single-number leaderboards for agent benchmarks often fail to predict how agents behave in new settings. If you run evals, you should copy their predictive-validity approach, not just chase top scores.
NVIDIA Blackwell Leads on First Agentic AI Infrastructure Benchmark
AgentPerf, the first benchmark for agent workloads, shows NVIDIA’s Blackwell platform running many more agents per megawatt than older GPUs. It frames agent performance as an energy and density game, not just raw tokens per second.
OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions
OdysseyArena tests agents on environments where they must discover hidden rules over hundreds of steps, not just follow given instructions. Even top models struggle, showing that long-run discovery and strategy remain weak points.
AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems
AstroReason-Bench tests agents on realistic satellite scheduling and space-mission planning rather than toy puzzles. Current agentic LLM systems lag far behind hand-built solvers, giving a sharp reality check for "generalist" planning claims.
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
FACTS is positioned as a one-stop leaderboard for LLM factuality, aggregating automated-judge scores from multimodal, parametric, search-augmented, and document-grounded tasks. It’s a natural next target for model releases that want to claim they’re less hallucinatory in practice, not just on isolated QA datasets. ([huggingface.co](https://huggingface.co/papers/2512.10791))
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
FACTS is a multi-part leaderboard that evaluates LLM factuality across image-based QA, closed-book QA, search-augmented QA, and document-grounded long-form responses, using automated judge models. It’s designed as a long-lived suite with public and private splits, giving a single factuality score while still exposing failure modes across modalities and tool-use settings. ([huggingface.co](https://huggingface.co/papers/2512.10791))
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
Builds a benchmark where tasks and environments keep changing, and evaluation checks whether an agent actually executed real workflows. Uses logs and structured assessments, not just final answers. If you are deploying agents into production operations, this is much closer to what you actually care about. ([huggingface.co](https://huggingface.co/papers/2604.28139))
What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design
Argues most terminal-agent benchmarks are written like prompts, so they test instruction clarity, not real capability. Provides a playbook for building adversarial, hard, and readable tasks, plus a catalog of common reward-hacking failure modes. If you rely on agent benchmark scores, sanity-check your tasks against this checklist before bragging. ([papers.cool](https://papers.cool/arxiv/2604.28093?utm_source=openai))
ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
ThaiSafetyBench compiles nearly two thousand Thai prompts, many grounded in local culture, to probe model safety. The authors also release a classifier that matches GPT-4.1’s judgments, giving the community a reusable Thai safety watchdog.
MMhops-R1: Multimodal Multi-hop Reasoning
Proposes MMhops-R1, a benchmark and model for multi-hop reasoning across visual and textual inputs. Tasks require chaining several intermediate inferences—over images and text—to reach a final answer, going beyond simple single-hop VQA. As LLMs get better at basic multimodal QA, these kinds of chain-of-thought, multi-hop setups are where reasoning gaps now show up, so having a dedicated resource here is valuable.
MTEB Leaderboard: From a slow demo to feature-rich leaderboard
HuggingFace’s team rebuilt the MTEB embedding leaderboard to be much faster and more navigable. You can now slice models by task, filter aggressively, and actually pick the right embedding model instead of chasing a single score.
Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach
The authors introduce a benchmark where multimodal models must judge mobile app UX directly from full UI screenshots. They also propose a baseline model that reasons over layout, text and visual cues, highlighting how current systems miss many usability issues humans spot instantly.
tasksource: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation
tasksource standardizes how hundreds of NLP datasets map inputs and labels into a common schema. That makes it much easier to train and test multi-task models without hand-writing fragile preprocessing code for each dataset.
huggingface/OpenEnv
OpenEnv is an interface library for training and evaluating reinforcement-learning style agents across many environments. It targets post-training, giving a cleaner way to plug modern models into classic RL-style tasks.
V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions
Defines V‑REX, a benchmark where models must answer chains of interdependent questions about images, designed to probe exploratory reasoning instead of one-shot recognition. Each question builds on the previous ones, encouraging models to form and refine internal hypotheses about a scene. It’s a nice stress test for multimodal models that claim to ‘reason’ rather than just match patterns.