Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Aligning with human judgement: The role of pairwise preference in large language model evaluators
Large Language Models (LLMs) have demonstrated promising capabilities as automatic
evaluators in assessing the quality of generated natural language. However, LLMs still …
evaluators in assessing the quality of generated natural language. However, LLMs still …
Hellobench: Evaluating long text generation capabilities of large language models
In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities
in various tasks (eg, long-context understanding), and many benchmarks have been …
in various tasks (eg, long-context understanding), and many benchmarks have been …
Debateqa: Evaluating question answering on debatable knowledge
The rise of large language models (LLMs) has enabled us to seek answers to inherently
debatable questions on LLM chatbots, necessitating a reliable way to evaluate their ability …
debatable questions on LLM chatbots, necessitating a reliable way to evaluate their ability …
LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios
As Large Language Models (LLMs) continue to advance in natural language processing
(NLP), their ability to stably follow instructions in long-context inputs has become crucial for …
(NLP), their ability to stably follow instructions in long-context inputs has become crucial for …
LCFO: Long context and long form output dataset and benchmarking
This paper presents the Long Context and Form Output (LCFO) benchmark, a novel
evaluation framework for assessing gradual summarization and summary expansion …
evaluation framework for assessing gradual summarization and summary expansion …
FormalAlign: Automated Alignment Evaluation for Autoformalization
Autoformalization aims to convert informal mathematical proofs into machine-verifiable
formats, bridging the gap between natural and formal languages. However, ensuring …
formats, bridging the gap between natural and formal languages. However, ensuring …
LongRAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall
Retrieval-augmented generation (RAG) is a promising approach to address the limitations of
fixed knowledge in large language models (LLMs). However, current benchmarks for …
fixed knowledge in large language models (LLMs). However, current benchmarks for …
SurveyX: Academic Survey Automation via Large Language Models
X Liang, J Yang, Y Wang, C Tang, Z Zheng… - arxiv preprint arxiv …, 2025 - arxiv.org
Large Language Models (LLMs) have demonstrated exceptional comprehension
capabilities and a vast knowledge base, suggesting that LLMs can serve as efficient tools for …
capabilities and a vast knowledge base, suggesting that LLMs can serve as efficient tools for …
A Cognitive Writing Perspective for Constrained Long-Form Text Generation
Like humans, Large Language Models (LLMs) struggle to generate high-quality long-form
text that adheres to strict requirements in a single pass. This challenge is unsurprising, as …
text that adheres to strict requirements in a single pass. This challenge is unsurprising, as …
Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments
I De la Iglesia, I Goenaga, J Ramirez-Romero… - arxiv preprint arxiv …, 2024 - arxiv.org
Evaluating LLM-generated text has become a key challenge, especially in domain-specific
contexts like the medical field. This work introduces a novel evaluation methodology for LLM …
contexts like the medical field. This work introduces a novel evaluation methodology for LLM …