From generation to judgment: Opportunities and challenges of llm-as-a-judge
Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …
and natural language processing (NLP). However, traditional methods, whether matching …
Leveraging large language models for nlg evaluation: Advances and challenges
In the rapidly evolving domain of Natural Language Generation (NLG) evaluation,
introducing Large Language Models (LLMs) has opened new avenues for assessing …
introducing Large Language Models (LLMs) has opened new avenues for assessing …
[PDF][PDF] Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge
ABSTRACT Large Language Models (LLMs) are rapidly surpassing human knowledge in
many domains. While improving these models traditionally relies on costly human data …
many domains. While improving these models traditionally relies on costly human data …
Foundational autoraters: Taming large language models for better automatic evaluation
As large language models (LLMs) advance, it becomes more challenging to reliably
evaluate their output due to the high costs of human evaluation. To make progress towards …
evaluate their output due to the high costs of human evaluation. To make progress towards …
Audiobench: A universal benchmark for audio large language models
We introduce AudioBench, a universal benchmark designed to evaluate Audio Large
Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, among …
Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, among …
Recommendation with generative models
Generative models are a class of AI models capable of creating new instances of data by
learning and sampling from their statistical distributions. In recent years, these models have …
learning and sampling from their statistical distributions. In recent years, these models have …
Cheating automatic llm benchmarks: Null models achieve high win rates
Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench,
have become popular for evaluating language models due to their cost-effectiveness and …
have become popular for evaluating language models due to their cost-effectiveness and …
Self-generated critiques boost reward modeling for language models
Reward modeling is crucial for aligning large language models (LLMs) with human
preferences, especially in reinforcement learning from human feedback (RLHF). However …
preferences, especially in reinforcement learning from human feedback (RLHF). However …
CopyBench: Measuring literal and non-literal reproduction of copyright-protected text in language model generation
Evaluating the degree of reproduction of copyright-protected content by language models
(LMs) is of significant interest to the AI and legal communities. Although both literal and non …
(LMs) is of significant interest to the AI and legal communities. Although both literal and non …
Learning to refine with fine-grained natural language feedback
Recent work has explored the capability of large language models (LLMs) to identify and
correct errors in LLM-generated responses. These refinement approaches frequently …
correct errors in LLM-generated responses. These refinement approaches frequently …