Študovňa Google

Turnitin 降AI改写早检测系统早降重系统 Turnitin-UK版万方检测-期刊版维普编辑部版 Grammarly检测 Paperpass检测 checkpass检测 PaperYY检测

From generation to judgment: Opportunities and challenges of llm-as-a-judge

D Li, B Jiang, L Huang, A Beigi, C Zhao, Z Tan… - arxiv preprint arxiv …, 2024 - arxiv.org

Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …

Uložiť Citovať Citované 25-krát Súvisiace články Všetky verzie 3 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Prometheus 2: An open source language model specialized in evaluating other language models

S Kim, J Suk, S Longpre, BY Lin, J Shin… - arxiv preprint arxiv …, 2024 - arxiv.org

Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from
various LMs. However, concerns including transparency, controllability, and affordability …

Uložiť Citovať Citované 119-krát Súvisiace články Všetky verzie 5 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark

D Chen, R Chen, S Zhang, Y Wang, Y Liu… - … on Machine Learning, 2024 - openreview.net

Multimodal Large Language Models (MLLMs) have gained significant attention recently,
showing remarkable potential in artificial general intelligence. However, assessing the utility …

Uložiť Citovať Citované 59-krát Súvisiace články Všetky verzie 7 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Aligning with human judgement: The role of pairwise preference in large language model evaluators

Y Liu, H Zhou, Z Guo, E Shareghi, I Vulić… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) have demonstrated promising capabilities as automatic
evaluators in assessing the quality of generated natural language. However, LLMs still …

Uložiť Citovať Citované 40-krát Súvisiace články Všetky verzie 7 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Evaluating task-oriented dialogue systems: A systematic review of measures, constructs and their operationalisations

A Braggaar, C Liebrecht, E van Miltenburg… - arxiv preprint arxiv …, 2023 - arxiv.org

This review gives an extensive overview of evaluation methods for task-oriented dialogue
systems, paying special attention to practical applications of dialogue systems, for example …

Uložiť Citovať Citované 7-krát Súvisiace články Všetky verzie 2 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Della-merging: Reducing interference in model merging through magnitude-based sampling

PT Deep, R Bhardwaj, S Poria - arxiv preprint arxiv:2406.11617, 2024 - arxiv.org

With the proliferation of domain-specific models, model merging has emerged as a set of
techniques that combine the capabilities of multiple models into one that can multitask …

Uložiť Citovať Citované 16-krát Súvisiace články Všetky verzie 3 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Calibrating long-form generations from large language models

Y Huang, Y Liu, R Thirukovalluru, A Cohan… - arxiv preprint arxiv …, 2024 - arxiv.org

To enhance Large Language Models'(LLMs) reliability, calibration is essential--the model's
assessed confidence scores should align with the actual likelihood of its responses being …

Uložiť Citovať Citované 15-krát Súvisiace články Všetky verzie 4 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Learning to refine with fine-grained natural language feedback

M Wadhwa, X Zhao, JJ Li, G Durrett - arxiv preprint arxiv:2407.02397, 2024 - arxiv.org

Recent work has explored the capability of large language models (LLMs) to identify and
correct errors in LLM-generated responses. These refinement approaches frequently …

Uložiť Citovať Citované 5-krát Súvisiace články Všetky verzie 3 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Diahalu: A dialogue-level hallucination evaluation benchmark for large language models

K Chen, Q Chen, J Zhou, Y He, L He - arxiv preprint arxiv:2403.00896, 2024 - arxiv.org

Since large language models (LLMs) achieve significant success in recent years, the
hallucination issue remains a challenge, numerous benchmarks are proposed to detect the …

Uložiť Citovať Citované 4-krát Súvisiace články Všetky verzie 3 HTML verzia

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

DHP Benchmark: Are LLMs Good NLG Evaluators?

Y Wang, J Yuan, YN Chuang, Z Wang, Y Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language
Generation (NLG) tasks. However, the capabilities of LLMs in scoring NLG quality remain …

Uložiť Citovať Citované 6-krát Súvisiace články Všetky verzie 3 HTML verzia

Citovať

Rozšírené vyhľadávanie

Uložené do mojej knižnice

From generation to judgment: Opportunities and challenges of llm-as-a-judge

Prometheus 2: An open source language model specialized in evaluating other language models

Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark

Aligning with human judgement: The role of pairwise preference in large language model evaluators

Evaluating task-oriented dialogue systems: A systematic review of measures, constructs and their operationalisations

Della-merging: Reducing interference in model merging through magnitude-based sampling

Calibrating long-form generations from large language models

Learning to refine with fine-grained natural language feedback

Diahalu: A dialogue-level hallucination evaluation benchmark for large language models

DHP Benchmark: Are LLMs Good NLG Evaluators?