From generation to judgment: Opportunities and challenges of llm-as-a-judge

D Li, B Jiang, L Huang, A Beigi, C Zhao, Z Tan… - arxiv preprint arxiv …, 2024 - arxiv.org
Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …

Leveraging large language models for nlg evaluation: Advances and challenges

Z Li, X Xu, T Shen, C Xu, JC Gu, Y Lai… - Proceedings of the …, 2024 - aclanthology.org
In the rapidly evolving domain of Natural Language Generation (NLG) evaluation,
introducing Large Language Models (LLMs) has opened new avenues for assessing …

[PDF][PDF] Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge

T Wu, W Yuan, O Golovneva, J Xu, Y Tian, J Jiao… - arxiv preprint arxiv …, 2024 - rivista.ai
ABSTRACT Large Language Models (LLMs) are rapidly surpassing human knowledge in
many domains. While improving these models traditionally relies on costly human data …

Foundational autoraters: Taming large language models for better automatic evaluation

T Vu, K Krishna, S Alzubi, C Tar, M Faruqui… - arxiv preprint arxiv …, 2024 - arxiv.org
As large language models (LLMs) advance, it becomes more challenging to reliably
evaluate their output due to the high costs of human evaluation. To make progress towards …

Audiobench: A universal benchmark for audio large language models

B Wang, X Zou, G Lin, S Sun, Z Liu, W Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce AudioBench, a universal benchmark designed to evaluate Audio Large
Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, among …

Recommendation with generative models

Y Deldjoo, Z He, J McAuley, A Korikov… - arxiv preprint arxiv …, 2024 - arxiv.org
Generative models are a class of AI models capable of creating new instances of data by
learning and sampling from their statistical distributions. In recent years, these models have …

Cheating automatic llm benchmarks: Null models achieve high win rates

X Zheng, T Pang, C Du, Q Liu, J Jiang, M Lin - arxiv preprint arxiv …, 2024 - arxiv.org
Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench,
have become popular for evaluating language models due to their cost-effectiveness and …

Self-generated critiques boost reward modeling for language models

Y Yu, Z Chen, A Zhang, L Tan, C Zhu, RY Pang… - arxiv preprint arxiv …, 2024 - arxiv.org
Reward modeling is crucial for aligning large language models (LLMs) with human
preferences, especially in reinforcement learning from human feedback (RLHF). However …

CopyBench: Measuring literal and non-literal reproduction of copyright-protected text in language model generation

T Chen, A Asai, N Mireshghallah, S Min… - arxiv preprint arxiv …, 2024 - arxiv.org
Evaluating the degree of reproduction of copyright-protected content by language models
(LMs) is of significant interest to the AI and legal communities. Although both literal and non …

Learning to refine with fine-grained natural language feedback

M Wadhwa, X Zhao, JJ Li, G Durrett - arxiv preprint arxiv:2407.02397, 2024 - arxiv.org
Recent work has explored the capability of large language models (LLMs) to identify and
correct errors in LLM-generated responses. These refinement approaches frequently …