الباحث العلمي من Google

A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations‏

MTR Laskar, S Alqahtani, MS Bari… - Proceedings of the …, 2024‏ - aclanthology.org‏

Abstract Large Language Models (LLMs) have recently gained significant attention due to
their remarkable capabilities in performing diverse tasks across various domains. However …‏

حفظ اقتباس تم اقتباسها في عدد: 16 مقالات ذات صلة الإصدارات الـ 4كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

The llama 3 herd of models‏

A Dubey, A Jauhri, A Pandey, A Kadian… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …‏

حفظ اقتباس تم اقتباسها في عدد: 2493 مقالات ذات صلة الإصدارات الـ 4كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Qwen2. 5 technical report‏

A Yang, B Yang, B Zhang, B Hui, B Zheng, B Yu… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

In this report, we introduce Qwen2. 5, a comprehensive series of large language models
(LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has …‏

حفظ اقتباس تم اقتباسها في عدد: 881 مقالات ذات صلة الإصدارات الـ 6كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Deepseek-v3 technical report‏

A Liu, B Feng, B Xue, B Wang, B Wu, C Lu… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B
total parameters with 37B activated for each token. To achieve efficient inference and cost …‏

حفظ اقتباس تم اقتباسها في عدد: 63 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Livebench: A challenging, contamination-free llm benchmark‏

C White, S Dooley, M Roberts, A Pal, B Feuer… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Test set contamination, wherein test data from a benchmark ends up in a newer model's
training set, is a well-documented obstacle for fair LLM evaluation and can quickly render …‏

حفظ اقتباس تم اقتباسها في عدد: 43 مقالات ذات صلة إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark‏

X Yue, T Zheng, Y Ni, Y Wang, K Zhang, S Tong… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline
Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously …‏

حفظ اقتباس تم اقتباسها في عدد: 32 مقالات ذات صلة الإصدارات الـ 4كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Fusechat: Knowledge fusion of chat models‏

F Wan, L Zhong, Z Yang, R Chen, X Quan - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

While training large language models (LLMs) from scratch can indeed lead to models with
distinct capabilities and strengths, it incurs substantial costs and may lead to redundancy in …‏

حفظ اقتباس تم اقتباسها في عدد: 14 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models‏

F Xu, Q Hao, Z Zong, J Wang, Y Zhang, J Wang… - arxiv preprint arxiv …, 2025‏ - arxiv.org‏

Language has long been conceived as an essential tool for human reasoning. The
breakthrough of Large Language Models (LLMs) has sparked significant research interest in …‏

حفظ اقتباس تم اقتباسها في عدد: 1 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sciassess: Benchmarking llm proficiency in scientific literature analysis‏

H Cai, X Cai, J Chang, S Li, L Yao, C Wang… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Recent breakthroughs in Large Language Models (LLMs) have revolutionized scientific
literature analysis. However, existing benchmarks fail to adequately evaluate the proficiency …‏

حفظ اقتباس تم اقتباسها في عدد: 17 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Med42-v2: A suite of clinical llms‏

C Christophe, PK Kanithi, T Raha, S Khan… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Med42-v2 introduces a suite of clinical large language models (LLMs) designed to address
the limitations of generic models in healthcare settings. These models are built on Llama3 …‏

حفظ اقتباس تم اقتباسها في عدد: 16 مقالات ذات صلة الإصدارات الـ 2كلها إصدار HTML‏

اقتباس

بحث متقدم

تم حفظ المقالة في مكتبتي.

A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations‏

The llama 3 herd of models‏

Qwen2. 5 technical report‏

Deepseek-v3 technical report‏

Livebench: A challenging, contamination-free llm benchmark‏

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark‏

Fusechat: Knowledge fusion of chat models‏

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models‏

Sciassess: Benchmarking llm proficiency in scientific literature analysis‏

Med42-v2: A suite of clinical llms‏