Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations
Abstract Large Language Models (LLMs) have recently gained significant attention due to
their remarkable capabilities in performing diverse tasks across various domains. However …
their remarkable capabilities in performing diverse tasks across various domains. However …
Qwen2. 5-coder technical report
In this report, we introduce the Qwen2. 5-Coder series, a significant upgrade from its
predecessor, CodeQwen1. 5. This series includes six models: Qwen2. 5-Coder-(0.5 B/1.5 …
predecessor, CodeQwen1. 5. This series includes six models: Qwen2. 5-Coder-(0.5 B/1.5 …
Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing
High-quality instruction data is critical for aligning large language models (LLMs). Although
some models, such as Llama-3-Instruct, have open weights, their alignment data remain …
some models, such as Llama-3-Instruct, have open weights, their alignment data remain …
Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai
We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging
mathematics problems crafted and vetted by expert mathematicians. The questions cover …
mathematics problems crafted and vetted by expert mathematicians. The questions cover …
The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism
Current evaluations of large language models (LLMs) often overlook non-determinism,
typically focusing on a single output per example. This limits our understanding of LLM …
typically focusing on a single output per example. This limits our understanding of LLM …
Llm-as-a-judge & reward model: What they can and cannot do
LLM-as-a-Judge and reward models are widely used alternatives of multiple-choice
questions or human annotators for large language model (LLM) evaluation. Their efficacy …
questions or human annotators for large language model (LLM) evaluation. Their efficacy …
LLM Stability: A detailed analysis with some surprises
LLM (large language model) practitioners commonly notice that outputs can vary for the
same inputs, but we have been unable to find work that evaluates LLM stability as the main …
same inputs, but we have been unable to find work that evaluates LLM stability as the main …
Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation
Cultural biases in multilingual datasets pose significant challenges for their effectiveness as
global benchmarks. These biases stem not only from language but also from the cultural …
global benchmarks. These biases stem not only from language but also from the cultural …
Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark
Large Multimodal Models (LMMs) have made significant strides in visual question-
answering for single images. Recent advancements like long-context LMMs have allowed …
answering for single images. Recent advancements like long-context LMMs have allowed …
Questionable practices in machine learning
Evaluating modern ML models is hard. The strong incentive for researchers and companies
to report a state-of-the-art result on some metric often leads to questionable research …
to report a state-of-the-art result on some metric often leads to questionable research …