Judging the judges: Evaluating alignment and vulnerabilities in llms-as-judges

AS Thakur, K Choudhary, VS Ramayapally… - arxiv preprint arxiv …, 2024 - arxiv.org
Offering a promising solution to the scalability challenges associated with human evaluation,
the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large …

Evaluating task-oriented dialogue systems: A systematic review of measures, constructs and their operationalisations

A Braggaar, C Liebrecht, E van Miltenburg… - arxiv preprint arxiv …, 2023 - arxiv.org
This review gives an extensive overview of evaluation methods for task-oriented dialogue
systems, paying special attention to practical applications of dialogue systems, for example …

A Survey on LLM-as-a-Judge

J Gu, X Jiang, Z Shi, H Tan, X Zhai, C Xu, W Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Accurate and consistent evaluation is crucial for decision-making across numerous fields,
yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large …

What makes a good story and how can we measure it? a comprehensive survey of story evaluation

D Yang, Q ** - arxiv preprint arxiv:2408.14622, 2024 - arxiv.org
With the development of artificial intelligence, particularly the success of Large Language
Models (LLMs), the quantity and quality of automatically generated stories have significantly …

DHP Benchmark: Are LLMs Good NLG Evaluators?

Y Wang, J Yuan, YN Chuang, Z Wang, Y Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language
Generation (NLG) tasks. However, the capabilities of LLMs in scoring NLG quality remain …

Improving context-aware preference modeling for language models

S Pitis, Z **ao, NL Roux, A Sordoni - arxiv preprint arxiv:2407.14916, 2024 - arxiv.org
While finetuning language models from pairwise preferences has proven remarkably
effective, the underspecified nature of natural language presents critical challenges. Direct …

Reviseval: Improving llm-as-a-judge via response-adapted references

Q Zhang, Y Wang, T Yu, Y Jiang, C Wu, L Li… - arxiv preprint arxiv …, 2024 - arxiv.org
With significant efforts in recent studies, LLM-as-a-Judge has become a cost-effective
alternative to human evaluation for assessing the text generation quality in a wide range of …

Outcome-Refining Process Supervision for Code Generation

Z Yu, W Gu, Y Wang, Z Zeng, J Wang, W Ye… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models have demonstrated remarkable capabilities in code generation, yet
they often struggle with complex programming tasks that require deep algorithmic …

Decision Information Meets Large Language Models: The Future of Explainable Operations Research

Y Zhang, Q Kang, WY Yu, H Gong, X Fu, X Han… - arxiv preprint arxiv …, 2025 - arxiv.org
Operations Research (OR) is vital for decision-making in many industries. While recent OR
methods have seen significant improvements in automation and efficiency through …

A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability

X Hu, M Gao, L Lin, Z Yu, X Wan - arxiv preprint arxiv:2502.12052, 2025 - arxiv.org
In NLG meta-evaluation, evaluation metrics are typically assessed based on their
consistency with humans. However, we identify some limitations in traditional NLG meta …