- Academic Search

AS Thakur, K Choudhary, VS Ramayapally… - arxiv preprint arxiv …, 2024 - arxiv.org

Offering a promising solution to the scalability challenges associated with human evaluation,
the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large …

Opslaan Citeren Geciteerd door 33 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Evaluating task-oriented dialogue systems: A systematic review of measures, constructs and their operationalisations

A Braggaar, C Liebrecht, E van Miltenburg… - arxiv preprint arxiv …, 2023 - arxiv.org

This review gives an extensive overview of evaluation methods for task-oriented dialogue
systems, paying special attention to practical applications of dialogue systems, for example …

Opslaan Citeren Geciteerd door 7 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A Survey on LLM-as-a-Judge

J Gu, X Jiang, Z Shi, H Tan, X Zhai, C Xu, W Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Accurate and consistent evaluation is crucial for decision-making across numerous fields,
yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large …

Opslaan Citeren Geciteerd door 16 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

What makes a good story and how can we measure it? a comprehensive survey of story evaluation

D Yang, Q ** - arxiv preprint arxiv:2408.14622, 2024 - arxiv.org

With the development of artificial intelligence, particularly the success of Large Language
Models (LLMs), the quantity and quality of automatically generated stories have significantly …

Opslaan Citeren Geciteerd door 3 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

DHP Benchmark: Are LLMs Good NLG Evaluators?

Y Wang, J Yuan, YN Chuang, Z Wang, Y Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language
Generation (NLG) tasks. However, the capabilities of LLMs in scoring NLG quality remain …

Opslaan Citeren Geciteerd door 6 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Improving context-aware preference modeling for language models

S Pitis, Z **ao, NL Roux, A Sordoni - arxiv preprint arxiv:2407.14916, 2024 - arxiv.org

While finetuning language models from pairwise preferences has proven remarkably
effective, the underspecified nature of natural language presents critical challenges. Direct …

Opslaan Citeren Geciteerd door 4 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Reviseval: Improving llm-as-a-judge via response-adapted references

Q Zhang, Y Wang, T Yu, Y Jiang, C Wu, L Li… - arxiv preprint arxiv …, 2024 - arxiv.org

With significant efforts in recent studies, LLM-as-a-Judge has become a cost-effective
alternative to human evaluation for assessing the text generation quality in a wide range of …

Opslaan Citeren Geciteerd door 4 Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Outcome-Refining Process Supervision for Code Generation

Z Yu, W Gu, Y Wang, Z Zeng, J Wang, W Ye… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models have demonstrated remarkable capabilities in code generation, yet
they often struggle with complex programming tasks that require deep algorithmic …

Opslaan Citeren Verwante artikelen Alle 2 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Decision Information Meets Large Language Models: The Future of Explainable Operations Research

Y Zhang, Q Kang, WY Yu, H Gong, X Fu, X Han… - arxiv preprint arxiv …, 2025 - arxiv.org

Operations Research (OR) is vital for decision-making in many industries. While recent OR
methods have seen significant improvements in automation and efficiency through …

Opslaan Citeren Verwante artikelen HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability

X Hu, M Gao, L Lin, Z Yu, X Wan - arxiv preprint arxiv:2502.12052, 2025 - arxiv.org

In NLG meta-evaluation, evaluation metrics are typically assessed based on their
consistency with humans. However, we identify some limitations in traditional NLG meta …

Opslaan Citeren Verwante artikelen HTML-versie

Melding maken

Citeren

Geavanceerd zoeken

Opgeslagen in Mijn bibliotheek

Are LLM-based Evaluators Confusing NLG Quality Criteria?

Judging the judges: Evaluating alignment and vulnerabilities in llms-as-judges

Evaluating task-oriented dialogue systems: A systematic review of measures, constructs and their operationalisations

A Survey on LLM-as-a-Judge

What makes a good story and how can we measure it? a comprehensive survey of story evaluation

DHP Benchmark: Are LLMs Good NLG Evaluators?

Improving context-aware preference modeling for language models

Reviseval: Improving llm-as-a-judge via response-adapted references

Outcome-Refining Process Supervision for Code Generation

Decision Information Meets Large Language Models: The Future of Explainable Operations Research

A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability