Aligning with human judgement: The role of pairwise preference in large language model evaluators

Y Liu, H Zhou, Z Guo, E Shareghi, I Vulić… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated promising capabilities as automatic
evaluators in assessing the quality of generated natural language. However, LLMs still …

Hellobench: Evaluating long text generation capabilities of large language models

H Que, F Duan, L He, Y Mou, W Zhou, J Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities
in various tasks (eg, long-context understanding), and many benchmarks have been …

Debateqa: Evaluating question answering on debatable knowledge

R Xu, X Qi, Z Qi, W Xu, Z Guo - arxiv preprint arxiv:2408.01419, 2024 - arxiv.org
The rise of large language models (LLMs) has enabled us to seek answers to inherently
debatable questions on LLM chatbots, necessitating a reliable way to evaluate their ability …

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

X Wu, M Wang, Y Liu, X Shi, H Yan, X Lu, J Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
As Large Language Models (LLMs) continue to advance in natural language processing
(NLP), their ability to stably follow instructions in long-context inputs has become crucial for …

LCFO: Long context and long form output dataset and benchmarking

MR Costa-jussà, P Andrews, MC Meglioli… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper presents the Long Context and Form Output (LCFO) benchmark, a novel
evaluation framework for assessing gradual summarization and summary expansion …

FormalAlign: Automated Alignment Evaluation for Autoformalization

J Lu, Y Wan, Y Huang, J **ong, Z Liu, Z Guo - arxiv preprint arxiv …, 2024 - arxiv.org
Autoformalization aims to convert informal mathematical proofs into machine-verifiable
formats, bridging the gap between natural and formal languages. However, ensuring …

LongRAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall

Z Qi, R Xu, Z Guo, C Wang, H Zhang, W Xu - arxiv preprint arxiv …, 2024 - arxiv.org
Retrieval-augmented generation (RAG) is a promising approach to address the limitations of
fixed knowledge in large language models (LLMs). However, current benchmarks for …

SurveyX: Academic Survey Automation via Large Language Models

X Liang, J Yang, Y Wang, C Tang, Z Zheng… - arxiv preprint arxiv …, 2025 - arxiv.org
Large Language Models (LLMs) have demonstrated exceptional comprehension
capabilities and a vast knowledge base, suggesting that LLMs can serve as efficient tools for …

A Cognitive Writing Perspective for Constrained Long-Form Text Generation

K Wan, H Mu, R Hao, H Luo, T Gu, X Chen - arxiv preprint arxiv …, 2025 - arxiv.org
Like humans, Large Language Models (LLMs) struggle to generate high-quality long-form
text that adheres to strict requirements in a single pass. This challenge is unsurprising, as …

Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments

I De la Iglesia, I Goenaga, J Ramirez-Romero… - arxiv preprint arxiv …, 2024 - arxiv.org
Evaluating LLM-generated text has become a key challenge, especially in domain-specific
contexts like the medical field. This work introduces a novel evaluation methodology for LLM …