Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, so do risks from misalignment. To provide a comprehensive …

Benchmark data contamination of large language models: A survey

C Xu, S Guan, D Greene, M Kechadi - arxiv preprint arxiv:2406.04244, 2024 - arxiv.org
The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and
Gemini has transformed the field of natural language processing. However, it has also …

Benchmarking benchmark leakage in large language models

R Xu, Z Wang, RZ Fan, P Liu - arxiv preprint arxiv:2404.18824, 2024 - arxiv.org
Amid the expanding use of pre-training data, the phenomenon of benchmark dataset
leakage has become increasingly prominent, exacerbated by opaque training processes …

Promptbench: A unified library for evaluation of large language models

K Zhu, Q Zhao, H Chen, J Wang, X **e - Journal of Machine Learning …, 2024 - jmlr.org
The evaluation of large language models (LLMs) is crucial to assess their performance and
mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to …

Kieval: A knowledge-grounded interactive evaluation framework for large language models

Z Yu, C Gao, W Yao, Y Wang, W Ye, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Automatic evaluation methods for large language models (LLMs) are hindered by data
contamination, leading to inflated assessments of their effectiveness. Existing strategies …

Darg: Dynamic evaluation of large language models via adaptive reasoning graph

Z Zhang, J Chen, D Yang - Advances in Neural Information …, 2025 - proceedings.neurips.cc
The current paradigm of evaluating Large Language Models (LLMs) through static
benchmarks comes with significant limitations, such as vulnerability to data contamination …

The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey

T Masterman, S Besen, M Sawtell, A Chao - arxiv preprint arxiv …, 2024 - arxiv.org
This survey paper examines the recent advancements in AI agent implementations, with a
focus on their ability to achieve complex goals that require enhanced reasoning, planning …

Nphardeval: Dynamic benchmark on reasoning ability of large language models via complexity classes

L Fan, W Hua, L Li, H Ling, Y Zhang - arxiv preprint arxiv:2312.14890, 2023 - arxiv.org
Complex reasoning ability is one of the most important features of current LLMs, which has
also been leveraged to play an integral role in complex decision-making tasks. Therefore …

Graphinstruct: Empowering large language models with graph understanding and reasoning capability

Z Luo, X Song, H Huang, J Lian, C Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Evaluating and enhancing the general capabilities of large language models (LLMs) has
been an important research topic. Graph is a common data structure in the real world, and …

Co-occurrence is not factual association in language models

X Zhang, M Li, J Wu - Advances in Neural Information …, 2025 - proceedings.neurips.cc
Pretrained language models can encode a large amount of knowledge and utilize it for
various reasoning tasks, yet they can still struggle to learn novel factual knowledge …