From generation to judgment: Opportunities and challenges of llm-as-a-judge

D Li, B Jiang, L Huang, A Beigi, C Zhao, Z Tan… - arxiv preprint arxiv …, 2024 - arxiv.org
Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …

Long-form factuality in large language models

J Wei, C Yang, X Song, Y Lu, N Hu, J Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) often generate content that contains factual errors when
responding to fact-seeking prompts on open-ended topics. To benchmark a model's long …

Justice or prejudice? quantifying biases in llm-as-a-judge

J Ye, Y Wang, Y Huang, D Chen, Q Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
LLM-as-a-Judge has been widely utilized as an evaluation method in various benchmarks
and served as supervised rewards in model training. However, despite their excellence in …

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

J Jung, F Brahman, Y Choi - arxiv preprint arxiv:2407.18370, 2024 - arxiv.org
We present a principled approach to provide LLM-based evaluation with a rigorous
guarantee of human agreement. We first propose that a reliable evaluation method should …

Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark

A Roger, P Humane, DZ Kaplan, K Gupta… - arxiv preprint arxiv …, 2025 - arxiv.org
The proliferation of Vision-Language Models (VLMs) in the past several years calls for
rigorous and comprehensive evaluation methods and benchmarks. This work analyzes …

[HTML][HTML] Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation

M Wysocka, O Wysocki, M Delmas, V Mutel… - Journal of Biomedical …, 2024 - Elsevier
Objective: The paper introduces a framework for the evaluation of the encoding of factual
scientific knowledge, designed to streamline the manual evaluation process typically …