Google Učenjak

Turnitin 降AI改写早检测系统早降重系统 Turnitin-UK版万方检测-期刊版维普编辑部版 Grammarly检测 Paperpass检测 checkpass检测 PaperYY检测

A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations

MTR Laskar, S Alqahtani, MS Bari… - Proceedings of the …, 2024 - aclanthology.org

Abstract Large Language Models (LLMs) have recently gained significant attention due to
their remarkable capabilities in performing diverse tasks across various domains. However …

Shrani Navedi Navedeno v 21 virih Sorodni članki Vse različice: 4 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Qwen2. 5-coder technical report

B Hui, J Yang, Z Cui, J Yang, D Liu, L Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

In this report, we introduce the Qwen2. 5-Coder series, a significant upgrade from its
predecessor, CodeQwen1. 5. This series includes six models: Qwen2. 5-Coder-(0.5 B/1.5 …

Shrani Navedi Navedeno v 132 virih Sorodni članki Vse različice: 3 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing

Z Xu, F Jiang, L Niu, Y Deng, R Poovendran… - arxiv preprint arxiv …, 2024 - arxiv.org

High-quality instruction data is critical for aligning large language models (LLMs). Although
some models, such as Llama-3-Instruct, have open weights, their alignment data remain …

Shrani Navedi Navedeno v 73 virih Sorodni članki Vse različice: 3 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai

E Glazer, E Erdil, T Besiroglu, D Chicharro… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging
mathematics problems crafted and vetted by expert mathematicians. The questions cover …

Shrani Navedi Navedeno v 25 virih Sorodni članki Vse različice: 5 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism

Y Song, G Wang, S Li, BY Lin - arxiv preprint arxiv:2407.10457, 2024 - arxiv.org

Current evaluations of large language models (LLMs) often overlook non-determinism,
typically focusing on a single output per example. This limits our understanding of LLM …

Shrani Navedi Navedeno v 17 virih Sorodni članki Vse različice: 2 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Llm-as-a-judge & reward model: What they can and cannot do

G Son, H Ko, H Lee, Y Kim, S Hong - arxiv preprint arxiv:2409.11239, 2024 - arxiv.org

LLM-as-a-Judge and reward models are widely used alternatives of multiple-choice
questions or human annotators for large language model (LLM) evaluation. Their efficacy …

Shrani Navedi Navedeno v 10 virih Sorodni članki Vse različice: 3 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

LLM Stability: A detailed analysis with some surprises

B Atil, A Chittams, L Fu, F Ture, L Xu… - arxiv preprint arxiv …, 2024 - arxiv.org

LLM (large language model) practitioners commonly notice that outputs can vary for the
same inputs, but we have been unable to find work that evaluates LLM stability as the main …

Shrani Navedi Navedeno v 11 virih Sorodni članki Vse različice: 3 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation

S Singh, A Romanou, C Fourrier, DI Adelani… - arxiv preprint arxiv …, 2024 - arxiv.org

Cultural biases in multilingual datasets pose significant challenges for their effectiveness as
global benchmarks. These biases stem not only from language but also from the cultural …

Shrani Navedi Navedeno v 7 virih Sorodni članki Vse različice: 2 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

TH Wu, G Biamby, J Quenum, R Gupta… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Multimodal Models (LMMs) have made significant strides in visual question-
answering for single images. Recent advancements like long-context LMMs have allowed …

Shrani Navedi Navedeno v 2 virih Sorodni članki Vse različice: 2 V obliki HTML

Questionable practices in machine learning

G Leech, JJ Vazquez, N Kupper, M Yagudin… - arxiv preprint arxiv …, 2024 - arxiv.org

Evaluating modern ML models is hard. The strong incentive for researchers and companies
to report a state-of-the-art result on some metric often leads to questionable research …

Shrani Navedi Navedeno v 5 virih Sorodni članki Vse različice: 3 Posnetek

Navedi

Napredno iskanje

Shranjeno v Mojo knjižnico

A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations

Qwen2. 5-coder technical report

Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing

Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai

The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism

Llm-as-a-judge & reward model: What they can and cannot do

LLM Stability: A detailed analysis with some surprises

Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

Questionable practices in machine learning