Google Tudós

D Li, B Jiang, L Huang, A Beigi, C Zhao, Z Tan… - arxiv preprint arxiv …, 2024 - arxiv.org

Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …

Mentés Hivatkozás Idézetek száma: 23 Kapcsolódó cikkek Mind a(z) 3 változat HTML-változat

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Long-form factuality in large language models

J Wei, C Yang, X Song, Y Lu, N Hu, J Huang… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) often generate content that contains factual errors when
responding to fact-seeking prompts on open-ended topics. To benchmark a model's long …

Mentés Hivatkozás Idézetek száma: 47 Kapcsolódó cikkek Mind a(z) 3 változat HTML-változat

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Justice or prejudice? quantifying biases in llm-as-a-judge

J Ye, Y Wang, Y Huang, D Chen, Q Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

LLM-as-a-Judge has been widely utilized as an evaluation method in various benchmarks
and served as supervised rewards in model training. However, despite their excellence in …

Mentés Hivatkozás Idézetek száma: 26 Kapcsolódó cikkek Mind a(z) 4 változat HTML-változat

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

J Jung, F Brahman, Y Choi - arxiv preprint arxiv:2407.18370, 2024 - arxiv.org

We present a principled approach to provide LLM-based evaluation with a rigorous
guarantee of human agreement. We first propose that a reliable evaluation method should …

Mentés Hivatkozás Idézetek száma: 3 Kapcsolódó cikkek Mind a(z) 3 változat HTML-változat

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark

A Roger, P Humane, DZ Kaplan, K Gupta… - arxiv preprint arxiv …, 2025 - arxiv.org

The proliferation of Vision-Language Models (VLMs) in the past several years calls for
rigorous and comprehensive evaluation methods and benchmarks. This work analyzes …

Mentés Hivatkozás Idézetek száma: 1 Kapcsolódó cikkek Mind a(z) 3 változat HTML-változat

[Free GPT-4]
[DeepSeek]

[HTML] sciencedirect.com

[HTML][HTML] Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation

M Wysocka, O Wysocki, M Delmas, V Mutel… - Journal of Biomedical …, 2024 - Elsevier

Objective: The paper introduces a framework for the evaluation of the encoding of factual
scientific knowledge, designed to streamline the manual evaluation process typically …

Mentés Hivatkozás Idézetek száma: 2 Kapcsolódó cikkek Mind a(z) 6 változat

Értesítés létrehozása

Hivatkozás

Speciális keresés

Mentve a Saját könyvtárba

Benchmarking cognitive biases in large language models as evaluators, 2023

From generation to judgment: Opportunities and challenges of llm-as-a-judge

Long-form factuality in large language models

Justice or prejudice? quantifying biases in llm-as-a-judge

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark

[HTML][HTML] Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation