A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations
MTR Laskar, S Alqahtani, MS Bari… - Proceedings of the …, 2024 - aclanthology.org
Abstract Large Language Models (LLMs) have recently gained significant attention due to
their remarkable capabilities in performing diverse tasks across various domains. However …
their remarkable capabilities in performing diverse tasks across various domains. However …
The llama 3 herd of models
Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …
presents a new set of foundation models, called Llama 3. It is a herd of language models …
Qwen2. 5 technical report
In this report, we introduce Qwen2. 5, a comprehensive series of large language models
(LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has …
(LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has …
Deepseek-v3 technical report
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B
total parameters with 37B activated for each token. To achieve efficient inference and cost …
total parameters with 37B activated for each token. To achieve efficient inference and cost …
Livebench: A challenging, contamination-free llm benchmark
Test set contamination, wherein test data from a benchmark ends up in a newer model's
training set, is a well-documented obstacle for fair LLM evaluation and can quickly render …
training set, is a well-documented obstacle for fair LLM evaluation and can quickly render …
Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline
Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously …
Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously …
Fusechat: Knowledge fusion of chat models
While training large language models (LLMs) from scratch can indeed lead to models with
distinct capabilities and strengths, it incurs substantial costs and may lead to redundancy in …
distinct capabilities and strengths, it incurs substantial costs and may lead to redundancy in …
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
Language has long been conceived as an essential tool for human reasoning. The
breakthrough of Large Language Models (LLMs) has sparked significant research interest in …
breakthrough of Large Language Models (LLMs) has sparked significant research interest in …
Sciassess: Benchmarking llm proficiency in scientific literature analysis
Recent breakthroughs in Large Language Models (LLMs) have revolutionized scientific
literature analysis. However, existing benchmarks fail to adequately evaluate the proficiency …
literature analysis. However, existing benchmarks fail to adequately evaluate the proficiency …
Med42-v2: A suite of clinical llms
Med42-v2 introduces a suite of clinical large language models (LLMs) designed to address
the limitations of generic models in healthcare settings. These models are built on Llama3 …
the limitations of generic models in healthcare settings. These models are built on Llama3 …