Back to AI Lab
Benchmarking
Research papers, repositories, and articles about benchmarking
Showing 2 of 2 items
LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
Defines a benchmark that scores how models actually write answers when given retrieved documents. Helps teams compare RAG setups on answer quality, not just retrieval hit rates.
Koki Itai, Shunichi Hasegawa
From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models
MiSI-Bench introduces "Microscopic Spatial Intelligence"—the ability to reason about invisible molecular 3D structures—and builds a massive VLM benchmark spanning 163k QA pairs over 4k molecules. Current VLMs lag well behind humans on many tasks, but a tuned 7B model can exceed human performance on some spatial transformations, highlighting both the promise and the need for domain knowledge in scientific AGI.
Zongzhao Li, Xiangzhe Kong