Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Judging the judges: Evaluating alignment and vulnerabilities in llms-as-judges
Offering a promising solution to the scalability challenges associated with human evaluation,
the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large …
the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large …
Evaluating task-oriented dialogue systems: A systematic review of measures, constructs and their operationalisations
This review gives an extensive overview of evaluation methods for task-oriented dialogue
systems, paying special attention to practical applications of dialogue systems, for example …
systems, paying special attention to practical applications of dialogue systems, for example …
A Survey on LLM-as-a-Judge
Accurate and consistent evaluation is crucial for decision-making across numerous fields,
yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large …
yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large …
What makes a good story and how can we measure it? a comprehensive survey of story evaluation
With the development of artificial intelligence, particularly the success of Large Language
Models (LLMs), the quantity and quality of automatically generated stories have significantly …
Models (LLMs), the quantity and quality of automatically generated stories have significantly …
DHP Benchmark: Are LLMs Good NLG Evaluators?
Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language
Generation (NLG) tasks. However, the capabilities of LLMs in scoring NLG quality remain …
Generation (NLG) tasks. However, the capabilities of LLMs in scoring NLG quality remain …
Improving context-aware preference modeling for language models
While finetuning language models from pairwise preferences has proven remarkably
effective, the underspecified nature of natural language presents critical challenges. Direct …
effective, the underspecified nature of natural language presents critical challenges. Direct …
Reviseval: Improving llm-as-a-judge via response-adapted references
With significant efforts in recent studies, LLM-as-a-Judge has become a cost-effective
alternative to human evaluation for assessing the text generation quality in a wide range of …
alternative to human evaluation for assessing the text generation quality in a wide range of …
Outcome-Refining Process Supervision for Code Generation
Large Language Models have demonstrated remarkable capabilities in code generation, yet
they often struggle with complex programming tasks that require deep algorithmic …
they often struggle with complex programming tasks that require deep algorithmic …
Decision Information Meets Large Language Models: The Future of Explainable Operations Research
Y Zhang, Q Kang, WY Yu, H Gong, X Fu, X Han… - arxiv preprint arxiv …, 2025 - arxiv.org
Operations Research (OR) is vital for decision-making in many industries. While recent OR
methods have seen significant improvements in automation and efficiency through …
methods have seen significant improvements in automation and efficiency through …
A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability
X Hu, M Gao, L Lin, Z Yu, X Wan - arxiv preprint arxiv:2502.12052, 2025 - arxiv.org
In NLG meta-evaluation, evaluation metrics are typically assessed based on their
consistency with humans. However, we identify some limitations in traditional NLG meta …
consistency with humans. However, we identify some limitations in traditional NLG meta …