A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations

MTR Laskar, S Alqahtani, MS Bari… - Proceedings of the …, 2024 - aclanthology.org
Abstract Large Language Models (LLMs) have recently gained significant attention due to
their remarkable capabilities in performing diverse tasks across various domains. However …

Qwen2. 5-coder technical report

B Hui, J Yang, Z Cui, J Yang, D Liu, L Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
In this report, we introduce the Qwen2. 5-Coder series, a significant upgrade from its
predecessor, CodeQwen1. 5. This series includes six models: Qwen2. 5-Coder-(0.5 B/1.5 …

Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing

Z Xu, F Jiang, L Niu, Y Deng, R Poovendran… - arxiv preprint arxiv …, 2024 - arxiv.org
High-quality instruction data is critical for aligning large language models (LLMs). Although
some models, such as Llama-3-Instruct, have open weights, their alignment data remain …

Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai

E Glazer, E Erdil, T Besiroglu, D Chicharro… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging
mathematics problems crafted and vetted by expert mathematicians. The questions cover …

The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism

Y Song, G Wang, S Li, BY Lin - arxiv preprint arxiv:2407.10457, 2024 - arxiv.org
Current evaluations of large language models (LLMs) often overlook non-determinism,
typically focusing on a single output per example. This limits our understanding of LLM …

Llm-as-a-judge & reward model: What they can and cannot do

G Son, H Ko, H Lee, Y Kim, S Hong - arxiv preprint arxiv:2409.11239, 2024 - arxiv.org
LLM-as-a-Judge and reward models are widely used alternatives of multiple-choice
questions or human annotators for large language model (LLM) evaluation. Their efficacy …

LLM Stability: A detailed analysis with some surprises

B Atil, A Chittams, L Fu, F Ture, L Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
LLM (large language model) practitioners commonly notice that outputs can vary for the
same inputs, but we have been unable to find work that evaluates LLM stability as the main …

Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation

S Singh, A Romanou, C Fourrier, DI Adelani… - arxiv preprint arxiv …, 2024 - arxiv.org
Cultural biases in multilingual datasets pose significant challenges for their effectiveness as
global benchmarks. These biases stem not only from language but also from the cultural …

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

TH Wu, G Biamby, J Quenum, R Gupta… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Multimodal Models (LMMs) have made significant strides in visual question-
answering for single images. Recent advancements like long-context LMMs have allowed …

Questionable practices in machine learning

G Leech, JJ Vazquez, N Kupper, M Yagudin… - arxiv preprint arxiv …, 2024 - arxiv.org
Evaluating modern ML models is hard. The strong incentive for researchers and companies
to report a state-of-the-art result on some metric often leads to questionable research …