A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations

MTR Laskar, S Alqahtani, MS Bari… - Proceedings of the …, 2024 - aclanthology.org
Abstract Large Language Models (LLMs) have recently gained significant attention due to
their remarkable capabilities in performing diverse tasks across various domains. However …

Benchmark data contamination of large language models: A survey

C Xu, S Guan, D Greene, M Kechadi - arxiv preprint arxiv:2406.04244, 2024 - arxiv.org
The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and
Gemini has transformed the field of natural language processing. However, it has also …

Chatbot arena: An open platform for evaluating llms by human preference

WL Chiang, L Zheng, Y Sheng… - … on Machine Learning, 2024 - openreview.net
Large Language Models (LLMs) have unlocked new capabilities and applications; however,
evaluating the alignment with human preferences still poses significant challenges. To …

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Don't make your LLM an evaluation benchmark cheater

K Zhou, Y Zhu, Z Chen, W Chen, WX Zhao… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models~(LLMs) have greatly advanced the frontiers of artificial intelligence,
attaining remarkable improvement in model capacity. To assess the model performance, a …

Livecodebench: Holistic and contamination free evaluation of large language models for code

N Jain, K Han, A Gu, WD Li, F Yan, T Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) applied to code-related applications have emerged as a
prominent field, attracting significant interest from both academia and industry. However, as …

LLM Dataset Inference: Did you train on my dataset?

P Maini, H Jia, N Papernot… - Advances in Neural …, 2025 - proceedings.neurips.cc
The proliferation of large language models (LLMs) in the real world has come with a rise in
copyright cases against companies for training their models on unlicensed data from the …

Bridging language and items for retrieval and recommendation

Y Hou, J Li, Z He, A Yan, X Chen, J McAuley - arxiv preprint arxiv …, 2024 - arxiv.org
This paper introduces BLaIR, a series of pretrained sentence embedding models
specialized for recommendation scenarios. BLaIR is trained to learn correlations between …

Task contamination: Language models may not be few-shot anymore

C Li, J Flanigan - Proceedings of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org
Large language models (LLMs) offer impressive performance in various zero-shot and few-
shot tasks. However, their success in zero-shot or few-shot settings may be affected by task …

Hallucination-free? assessing the reliability of leading ai legal research tools

V Magesh, F Surani, M Dahl, M Suzgun… - arxiv preprint arxiv …, 2024 - arxiv.org
Legal practice has witnessed a sharp rise in products incorporating artificial intelligence (AI).
Such tools are designed to assist with a wide range of core legal tasks, from search and …