- Academic Search

A framework for human evaluation of large language models in healthcare derived from literature review

TYC Tam, S Sivarajkumar, S Kapoor, AV Stolyar… - NPJ Digital …, 2024 - nature.com

With generative artificial intelligence (GenAI), particularly large language models (LLMs),
continuing to make inroads in healthcare, assessing LLMs with human evaluations is …

Opslaan Citeren Geciteerd door 10 Verwante artikelen Alle 9 versies

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

S Min, K Krishna, X Lyu, M Lewis, W Yih… - arxiv preprint arxiv …, 2023 - arxiv.org

Evaluating the factuality of long-form text generated by large language models (LMs) is non-
trivial because (1) generations often contain a mixture of supported and unsupported pieces …

Opslaan Citeren Geciteerd door 474 Verwante artikelen Alle 8 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Fine-grained human feedback gives better rewards for language model training

Z Wu, Y Hu, W Shi, N Dziri, A Suhr… - Advances in …, 2023 - proceedings.neurips.cc

Abstract Language models (LMs) often exhibit undesirable text generation behaviors,
including generating false, toxic, or irrelevant outputs. Reinforcement learning from human …

Opslaan Citeren Geciteerd door 77 Verwante artikelen Alle 6 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Flask: Fine-grained language model evaluation based on alignment skill sets

S Ye, D Kim, S Kim, H Hwang, S Kim, Y Jo… - arxiv preprint arxiv …, 2023 - arxiv.org

Evaluation of Large Language Models (LLMs) is challenging because aligning to human
values requires the composition of multiple skills and the required set of skills varies …

Opslaan Citeren Geciteerd door 69 Verwante artikelen Alle 4 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Foundational autoraters: Taming large language models for better automatic evaluation

T Vu, K Krishna, S Alzubi, C Tar, M Faruqui… - arxiv preprint arxiv …, 2024 - arxiv.org

As large language models (LLMs) advance, it becomes more challenging to reliably
evaluate their output due to the high costs of human evaluation. To make progress towards …

Opslaan Citeren Geciteerd door 23 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Evallm: Interactive evaluation of large language model prompts on user-defined criteria

TS Kim, Y Lee, J Shin, YH Kim, J Kim - … of the CHI Conference on Human …, 2024 - dl.acm.org

By simply composing prompts, developers can prototype novel generative applications with
Large Language Models (LLMs). To refine prototypes into products, however, developers …

Opslaan Citeren Geciteerd door 52 Verwante artikelen Alle 7 versies

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Booookscore: A systematic exploration of book-length summarization in the era of llms

Y Chang, K Lo, T Goyal, M Iyyer - arxiv preprint arxiv:2310.00785, 2023 - arxiv.org

Summarizing book-length documents (> 100K tokens) that exceed the context window size
of large language models (LLMs) requires first breaking the input document into smaller …

Opslaan Citeren Geciteerd door 98 Verwante artikelen Alle 6 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Expertqa: Expert-curated questions and attributed answers

C Malaviya, S Lee, S Chen, E Sieber, M Yatskar… - arxiv preprint arxiv …, 2023 - arxiv.org

As language models are adapted by a more sophisticated and diverse set of users, the
importance of guaranteeing that they provide factually correct information supported by …

Opslaan Citeren Geciteerd door 61 Verwante artikelen Alle 4 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Summary of a haystack: A challenge to long-context llms and rag systems

P Laban, AR Fabbri, C **ong, CS Wu - arxiv preprint arxiv:2407.01370, 2024 - arxiv.org

LLMs and RAG systems are now capable of handling millions of input tokens or more.
However, evaluating the output quality of such systems on long-context tasks remains …

Opslaan Citeren Geciteerd door 19 Verwante artikelen Alle 3 versies HTML-versie

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Ares: An automated evaluation framework for retrieval-augmented generation systems

J Saad-Falcon, O Khattab, C Potts… - arxiv preprint arxiv …, 2023 - arxiv.org

Evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand
annotations for input queries, passages to retrieve, and responses to generate. We …

Opslaan Citeren Geciteerd door 91 Verwante artikelen Alle 3 versies HTML-versie

Melding maken

Citeren

Geavanceerd zoeken

Opgeslagen in Mijn bibliotheek

LongEval: Guidelines for human evaluation of faithfulness in long-form summarization

A framework for human evaluation of large language models in healthcare derived from literature review

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

Fine-grained human feedback gives better rewards for language model training

Flask: Fine-grained language model evaluation based on alignment skill sets

Foundational autoraters: Taming large language models for better automatic evaluation

Evallm: Interactive evaluation of large language model prompts on user-defined criteria

Booookscore: A systematic exploration of book-length summarization in the era of llms

Expertqa: Expert-curated questions and attributed answers

Summary of a haystack: A challenge to long-context llms and rag systems

Ares: An automated evaluation framework for retrieval-augmented generation systems