Eureka: Evaluating and understanding large foundation models

V Balachandran, J Chen, N Joshi, B Nushi… - arxiv preprint arxiv …, 2024 - arxiv.org
Rigorous and reproducible evaluation is critical for assessing the state of the art and for
guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due …

Holmes ⌕ A Benchmark to Assess the Linguistic Competence of Language Models

A Waldis, Y Perlitz, L Choshen, Y Hou… - Transactions of the …, 2024 - direct.mit.edu
We introduce Holmes, a new benchmark designed to assess language models'(LMs')
linguistic competence—their unconscious understanding of linguistic phenomena …

GameArena: Evaluating LLM Reasoning through Live Computer Games

L Hu, Q Li, A **e, N Jiang, I Stoica, H **… - arxiv preprint arxiv …, 2024 - arxiv.org
Evaluating the reasoning abilities of large language models (LLMs) is challenging. Existing
benchmarks often depend on static datasets, which are vulnerable to data contamination …