From generation to judgment: Opportunities and challenges of llm-as-a-judge

D Li, B Jiang, L Huang, A Beigi, C Zhao, Z Tan… - arxiv preprint arxiv …, 2024 - arxiv.org
Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …

Prometheus 2: An open source language model specialized in evaluating other language models

S Kim, J Suk, S Longpre, BY Lin, J Shin… - arxiv preprint arxiv …, 2024 - arxiv.org
Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from
various LMs. However, concerns including transparency, controllability, and affordability …

Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark

D Chen, R Chen, S Zhang, Y Wang, Y Liu… - … on Machine Learning, 2024 - openreview.net
Multimodal Large Language Models (MLLMs) have gained significant attention recently,
showing remarkable potential in artificial general intelligence. However, assessing the utility …

Aligning with human judgement: The role of pairwise preference in large language model evaluators

Y Liu, H Zhou, Z Guo, E Shareghi, I Vulić… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated promising capabilities as automatic
evaluators in assessing the quality of generated natural language. However, LLMs still …

Evaluating task-oriented dialogue systems: A systematic review of measures, constructs and their operationalisations

A Braggaar, C Liebrecht, E van Miltenburg… - arxiv preprint arxiv …, 2023 - arxiv.org
This review gives an extensive overview of evaluation methods for task-oriented dialogue
systems, paying special attention to practical applications of dialogue systems, for example …

Della-merging: Reducing interference in model merging through magnitude-based sampling

PT Deep, R Bhardwaj, S Poria - arxiv preprint arxiv:2406.11617, 2024 - arxiv.org
With the proliferation of domain-specific models, model merging has emerged as a set of
techniques that combine the capabilities of multiple models into one that can multitask …

Calibrating long-form generations from large language models

Y Huang, Y Liu, R Thirukovalluru, A Cohan… - arxiv preprint arxiv …, 2024 - arxiv.org
To enhance Large Language Models'(LLMs) reliability, calibration is essential--the model's
assessed confidence scores should align with the actual likelihood of its responses being …

Learning to refine with fine-grained natural language feedback

M Wadhwa, X Zhao, JJ Li, G Durrett - arxiv preprint arxiv:2407.02397, 2024 - arxiv.org
Recent work has explored the capability of large language models (LLMs) to identify and
correct errors in LLM-generated responses. These refinement approaches frequently …

Diahalu: A dialogue-level hallucination evaluation benchmark for large language models

K Chen, Q Chen, J Zhou, Y He, L He - arxiv preprint arxiv:2403.00896, 2024 - arxiv.org
Since large language models (LLMs) achieve significant success in recent years, the
hallucination issue remains a challenge, numerous benchmarks are proposed to detect the …

DHP Benchmark: Are LLMs Good NLG Evaluators?

Y Wang, J Yuan, YN Chuang, Z Wang, Y Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language
Generation (NLG) tasks. However, the capabilities of LLMs in scoring NLG quality remain …