Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
From generation to judgment: Opportunities and challenges of llm-as-a-judge
Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …
and natural language processing (NLP). However, traditional methods, whether matching …
Prometheus 2: An open source language model specialized in evaluating other language models
Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from
various LMs. However, concerns including transparency, controllability, and affordability …
various LMs. However, concerns including transparency, controllability, and affordability …
Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark
Multimodal Large Language Models (MLLMs) have gained significant attention recently,
showing remarkable potential in artificial general intelligence. However, assessing the utility …
showing remarkable potential in artificial general intelligence. However, assessing the utility …
Aligning with human judgement: The role of pairwise preference in large language model evaluators
Large Language Models (LLMs) have demonstrated promising capabilities as automatic
evaluators in assessing the quality of generated natural language. However, LLMs still …
evaluators in assessing the quality of generated natural language. However, LLMs still …
Evaluating task-oriented dialogue systems: A systematic review of measures, constructs and their operationalisations
This review gives an extensive overview of evaluation methods for task-oriented dialogue
systems, paying special attention to practical applications of dialogue systems, for example …
systems, paying special attention to practical applications of dialogue systems, for example …
Della-merging: Reducing interference in model merging through magnitude-based sampling
With the proliferation of domain-specific models, model merging has emerged as a set of
techniques that combine the capabilities of multiple models into one that can multitask …
techniques that combine the capabilities of multiple models into one that can multitask …
Calibrating long-form generations from large language models
To enhance Large Language Models'(LLMs) reliability, calibration is essential--the model's
assessed confidence scores should align with the actual likelihood of its responses being …
assessed confidence scores should align with the actual likelihood of its responses being …
Learning to refine with fine-grained natural language feedback
Recent work has explored the capability of large language models (LLMs) to identify and
correct errors in LLM-generated responses. These refinement approaches frequently …
correct errors in LLM-generated responses. These refinement approaches frequently …
Diahalu: A dialogue-level hallucination evaluation benchmark for large language models
Since large language models (LLMs) achieve significant success in recent years, the
hallucination issue remains a challenge, numerous benchmarks are proposed to detect the …
hallucination issue remains a challenge, numerous benchmarks are proposed to detect the …
DHP Benchmark: Are LLMs Good NLG Evaluators?
Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language
Generation (NLG) tasks. However, the capabilities of LLMs in scoring NLG quality remain …
Generation (NLG) tasks. However, the capabilities of LLMs in scoring NLG quality remain …