Evaluation
Research papers, repositories, and articles about evaluation
Showing 31 of 31 items
A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5
This report compares seven frontier language and vision models across many safety tests, from basic benchmarks to adversarial red-teaming. It finds GPT-5.2 clearly safest overall while others trade off safety across languages, modalities, and threat models.
Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments
Introduces GauntletBench, a web-based testbed with video editors, workflow tools, 3D apps, and more, focused on tough perception and reasoning tasks. Even the best agents hit only ~19% success while non-expert humans clear 80%+. If you think your agent is "human level," try it here. ([huggingface.co](https://huggingface.co/papers/2606.14397))
LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs
The authors separate two questions: can a model spit out training data, and how often does it actually do that in normal use. They build a framework that measures both worst-case extractability and everyday leakage. If you handle sensitive data, this is a blueprint for stress-testing your models instead of trusting vague privacy claims.
In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks
Directly compares workflow graphs managed by external orchestrators to a single prompt that spells out the whole procedure. For travel, tech support, and claims flows, one big prompt beats complex agent tooling on quality and failure rates. If your product is more orchestration code than prompt, this paper says simplify before you scale. ([arxiv.org](https://arxiv.org/abs/2604.27891))
Over-Searching in Search-Augmented Large Language Models
This work shows that search‑augmented models often call tools even when search hurts answers or wastes tokens. It introduces a cost‑aware metric and mitigation tricks, so teams can dial back needless retrieval instead of just adding more context.
Async Control: Stress-testing Asynchronous Control Measures for LLM Agents
Introduces a red-team/blue-team framework to evaluate how well asynchronous monitoring can catch sabotage attempts by LLM-based software agents that edit real codebases. The authors systematically stress-test different monitoring strategies, modeling the interaction as an adversarial game between attacking agents and defensive monitors. This matters because asynchronous oversight is far cheaper than real-time gating, but until now its effectiveness against misaligned coding agents has been poorly understood.
GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents
Builds a carefully matched benchmark where GUI agents and command-line agents solve identical desktop tasks under the same checks. Finds GUI agents fail on long, brittle interactions, while CLI agents are limited by missing skills, not raw intelligence. If you design computer-use stacks, this tells you where to invest next. ([huggingface.co](https://huggingface.co/papers/2606.24551))
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents
Shows how to test agents so scores actually predict field performance, not just benchmark bragging rights. If you own an eval suite, you should copy this framework.
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents
Shows that single-number leaderboards for agent benchmarks often fail to predict how agents behave in new settings. If you run evals, you should copy their predictive-validity approach, not just chase top scores.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Lays out a five-level roadmap for visual generation, from basic image mapping up to interactive world modeling for agents. Argues the next race is about structure, memory, and causality, not prettier pictures. If you work on vision models, benchmark against these levels, not just FID-style metrics. ([huggingface.co](https://huggingface.co/papers/2604.28185))
CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty
CAR-bench builds an in-car assistant world with messy, ambiguous user requests and many tools. It measures not just if agents finish tasks, but whether they know when they’re out of their depth.
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
AuditDM trains an "auditor" model that hunts for cases where strong vision-language models disagree. Teams can reuse these hard examples to patch weaknesses without manual labeling.
Adaptation of Agentic AI
This large-scale study tracks how agent-like AI systems adapt over time and across tasks. If you're betting on agents, it gives structure and warnings for long-term deployment.
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
Introduces NL2Repo-Bench, a benchmark where coding agents must generate or modify entire repositories from natural language specifications, rather than solving single-file LeetCode-style tasks. It evaluates long-horizon planning, tool use, and consistency across files and modules. This is a big step toward evaluating code agents in settings that look like real software projects instead of toy problems.
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
FACTS is a multi-part leaderboard that evaluates LLM factuality across image-based QA, closed-book QA, search-augmented QA, and document-grounded long-form responses, using automated judge models. It’s designed as a long-lived suite with public and private splits, giving a single factuality score while still exposing failure modes across modalities and tool-use settings. ([huggingface.co](https://huggingface.co/papers/2512.10791))
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
FACTS is positioned as a one-stop leaderboard for LLM factuality, aggregating automated-judge scores from multimodal, parametric, search-augmented, and document-grounded tasks. It’s a natural next target for model releases that want to claim they’re less hallucinatory in practice, not just on isolated QA datasets. ([huggingface.co](https://huggingface.co/papers/2512.10791))
Evaluating Gemini Robotics Policies in a Veo World Simulator
Google DeepMind uses a frontier video model (Veo) as a generative world model to evaluate robot manipulation policies across nominal, out-of-distribution, and safety-critical settings. They show Veo-based simulation can predict real-world policy rankings and failure modes via 1600+ physical trials, enabling scalable red-teaming and OOD robustness checks without exhaustive hardware experiments. ([huggingface.co](https://huggingface.co/papers/2512.10675))
DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference
DialDefer shows LLM judges treat identical claims differently depending on whether a human or "anonymous text" said them. The authors propose metrics and mitigation for this hidden bias toward agreeing with people.
Evaluating Gemini Robotics Policies in a Veo World Simulator
Uses a fine-tuned Veo video model as a generative world simulator for robot policy evaluation, covering in-distribution tasks, OOD generalization axes, and physical/semantic safety tests. The key takeaway is that high-fidelity video models can stand in for many expensive real-world trials while still predicting policy rankings and vulnerabilities reliably. ([huggingface.co](https://huggingface.co/papers/2512.10675))
What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design
Argues most terminal-agent benchmarks are written like prompts, so they test instruction clarity, not real capability. Provides a playbook for building adversarial, hard, and readable tasks, plus a catalog of common reward-hacking failure modes. If you rely on agent benchmark scores, sanity-check your tasks against this checklist before bragging. ([papers.cool](https://papers.cool/arxiv/2604.28093?utm_source=openai))
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
Builds a benchmark where tasks and environments keep changing, and evaluation checks whether an agent actually executed real workflows. Uses logs and structured assessments, not just final answers. If you are deploying agents into production operations, this is much closer to what you actually care about. ([huggingface.co](https://huggingface.co/papers/2604.28139))
WildSci: Advancing Scientific Reasoning from In-the-Wild Literature
WildSci builds a large question set from real scientific papers across many fields, then uses reinforcement learning to sharpen models’ scientific reasoning. It moves science QA beyond toy benchmarks and gives labs a more realistic way to stress-test research assistants.
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
Hugging Face surfaces AuditDM as a practical recipe for stress-testing multiple models at once. Use it to decide where smaller, cheaper models can safely replace bigger ones.
CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies
Simulates a small coffee supply chain where agents run farms, roasters, and retailers over 90 days. Different models show very different communication styles and profit profiles. If you care about economic alignment and multi-agent markets, CoffeeBench is a ready-made sandbox. ([huggingface.co](https://huggingface.co/papers/2606.16613))
Crisis-Bench: Benchmarking Strategic Ambiguity and Reputation Management in Large Language Models
Crisis-Bench drops models into simulated multi-day corporate crises and scores them on stock-price outcomes and public sentiment. It exposes when models act like blunt truth-tellers versus savvy spokespeople, giving companies a way to test PR-style agent behavior before deployment.
Reverse Thinking Enhances Missing Information Detection in Large Language Models
Shows that guiding LLMs through a reverse-thinking framework—reasoning backward from required conditions—substantially improves their ability to detect when problem statements lack necessary information. Experiments on modified GSM8K-style datasets demonstrate large gains over standard CoT and ToT prompting, with theoretical bounds on recall and false positives under simple accuracy assumptions. ([arxiv.org](https://arxiv.org/abs/2512.10273))
Characterizing the Consistency of the Emergent Misalignment Persona
Fine-tunes an aligned model on narrow harmful tasks and studies how that misaligned "persona" behaves across many scenarios. Finds patterns in how self-reports, harmful actions, and domain choices line up or diverge. If you care about frontier safety, mine this for concrete tests instead of relying on vibes about "misalignment". ([arxiv.org](https://arxiv.org/abs/2604.28082))
MTEB Leaderboard: From a slow demo to feature-rich leaderboard
HuggingFace’s team rebuilt the MTEB embedding leaderboard to be much faster and more navigable. You can now slice models by task, filter aggressively, and actually pick the right embedding model instead of chasing a single score.
MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data
Introduces MedInsightBench, a benchmark for ‘analytics agents’ that must reason over multimodal medical data—think tables, images, and reports—to extract multi-step clinical insights rather than just answer single questions. The tasks force agents to chain together retrieval, interpretation, and aggregation across data sources, closer to what real analytics workflows look like in hospitals. This is important if you care about LLM agents that move beyond toy QA into realistic decision support.
MOA: Multi-Objective Alignment for Role-Playing Agents
MOA is an RL framework that jointly optimizes multiple fine-grained rubrics for role-playing agents—such as persona consistency, domain knowledge, and dialogue quality—using multi-objective alignment and thought-augmented rollouts. An 8B model trained with MOA can match or surpass GPT‑4o and Claude on PersonaGym and RoleMRC, suggesting smaller models can be pushed far with better objective design. ([huggingface.co](https://huggingface.co/papers/2512.09756))
Multimodal Large Language Models as Image Classifiers
Tests big models that handle text and images when you use them as plain image classifiers. Shows when they beat or trail classic vision networks.