A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations

MTR Laskar, S Alqahtani, MS Bari… - Proceedings of the …, 2024‏ - aclanthology.org
Abstract Large Language Models (LLMs) have recently gained significant attention due to
their remarkable capabilities in performing diverse tasks across various domains. However …

The llama 3 herd of models

A Dubey, A Jauhri, A Pandey, A Kadian… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …

Qwen2. 5 technical report

A Yang, B Yang, B Zhang, B Hui, B Zheng, B Yu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
In this report, we introduce Qwen2. 5, a comprehensive series of large language models
(LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has …

Deepseek-v3 technical report

A Liu, B Feng, B Xue, B Wang, B Wu, C Lu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B
total parameters with 37B activated for each token. To achieve efficient inference and cost …

Livebench: A challenging, contamination-free llm benchmark

C White, S Dooley, M Roberts, A Pal, B Feuer… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Test set contamination, wherein test data from a benchmark ends up in a newer model's
training set, is a well-documented obstacle for fair LLM evaluation and can quickly render …

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

X Yue, T Zheng, Y Ni, Y Wang, K Zhang, S Tong… - arxiv preprint arxiv …, 2024‏ - arxiv.org
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline
Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously …

Fusechat: Knowledge fusion of chat models

F Wan, L Zhong, Z Yang, R Chen, X Quan - arxiv preprint arxiv …, 2024‏ - arxiv.org
While training large language models (LLMs) from scratch can indeed lead to models with
distinct capabilities and strengths, it incurs substantial costs and may lead to redundancy in …

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

F Xu, Q Hao, Z Zong, J Wang, Y Zhang, J Wang… - arxiv preprint arxiv …, 2025‏ - arxiv.org
Language has long been conceived as an essential tool for human reasoning. The
breakthrough of Large Language Models (LLMs) has sparked significant research interest in …

Sciassess: Benchmarking llm proficiency in scientific literature analysis

H Cai, X Cai, J Chang, S Li, L Yao, C Wang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Recent breakthroughs in Large Language Models (LLMs) have revolutionized scientific
literature analysis. However, existing benchmarks fail to adequately evaluate the proficiency …

Med42-v2: A suite of clinical llms

C Christophe, PK Kanithi, T Raha, S Khan… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Med42-v2 introduces a suite of clinical large language models (LLMs) designed to address
the limitations of generic models in healthcare settings. These models are built on Llama3 …