Benchmark data contamination of large language models: A survey

C Xu, S Guan, D Greene, M Kechadi - arxiv preprint arxiv:2406.04244, 2024 - arxiv.org
The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and
Gemini has transformed the field of natural language processing. However, it has also …

An Empirical Analysis of Uncertainty in Large Language Model Evaluations

Q **e, Q Li, Z Yu, Y Zhang, Y Zhang, L Yang - arxiv preprint arxiv …, 2025 - arxiv.org
As LLM-as-a-Judge emerges as a new paradigm for assessing large language models
(LLMs), concerns have been raised regarding the alignment, bias, and stability of LLM …

Outcome-Refining Process Supervision for Code Generation

Z Yu, W Gu, Y Wang, Z Zeng, J Wang, W Ye… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models have demonstrated remarkable capabilities in code generation, yet
they often struggle with complex programming tasks that require deep algorithmic …

SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text

R Ghosh, T Yao, L Chen, S Hasan, T Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Model (LLM) integrations into applications like Microsoft365 suite and
Google Workspace for creating/processing documents, emails, presentations, etc. has led to …