A framework for human evaluation of large language models in healthcare derived from literature review

TYC Tam, S Sivarajkumar, S Kapoor, AV Stolyar… - NPJ Digital …, 2024 - nature.com
With generative artificial intelligence (GenAI), particularly large language models (LLMs),
continuing to make inroads in healthcare, assessing LLMs with human evaluations is …

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

S Min, K Krishna, X Lyu, M Lewis, W Yih… - arxiv preprint arxiv …, 2023 - arxiv.org
Evaluating the factuality of long-form text generated by large language models (LMs) is non-
trivial because (1) generations often contain a mixture of supported and unsupported pieces …

Fine-grained human feedback gives better rewards for language model training

Z Wu, Y Hu, W Shi, N Dziri, A Suhr… - Advances in …, 2023 - proceedings.neurips.cc
Abstract Language models (LMs) often exhibit undesirable text generation behaviors,
including generating false, toxic, or irrelevant outputs. Reinforcement learning from human …

Flask: Fine-grained language model evaluation based on alignment skill sets

S Ye, D Kim, S Kim, H Hwang, S Kim, Y Jo… - arxiv preprint arxiv …, 2023 - arxiv.org
Evaluation of Large Language Models (LLMs) is challenging because aligning to human
values requires the composition of multiple skills and the required set of skills varies …

Foundational autoraters: Taming large language models for better automatic evaluation

T Vu, K Krishna, S Alzubi, C Tar, M Faruqui… - arxiv preprint arxiv …, 2024 - arxiv.org
As large language models (LLMs) advance, it becomes more challenging to reliably
evaluate their output due to the high costs of human evaluation. To make progress towards …

Evallm: Interactive evaluation of large language model prompts on user-defined criteria

TS Kim, Y Lee, J Shin, YH Kim, J Kim - … of the CHI Conference on Human …, 2024 - dl.acm.org
By simply composing prompts, developers can prototype novel generative applications with
Large Language Models (LLMs). To refine prototypes into products, however, developers …

Booookscore: A systematic exploration of book-length summarization in the era of llms

Y Chang, K Lo, T Goyal, M Iyyer - arxiv preprint arxiv:2310.00785, 2023 - arxiv.org
Summarizing book-length documents (> 100K tokens) that exceed the context window size
of large language models (LLMs) requires first breaking the input document into smaller …

Expertqa: Expert-curated questions and attributed answers

C Malaviya, S Lee, S Chen, E Sieber, M Yatskar… - arxiv preprint arxiv …, 2023 - arxiv.org
As language models are adapted by a more sophisticated and diverse set of users, the
importance of guaranteeing that they provide factually correct information supported by …

Summary of a haystack: A challenge to long-context llms and rag systems

P Laban, AR Fabbri, C **ong, CS Wu - arxiv preprint arxiv:2407.01370, 2024 - arxiv.org
LLMs and RAG systems are now capable of handling millions of input tokens or more.
However, evaluating the output quality of such systems on long-context tasks remains …

Ares: An automated evaluation framework for retrieval-augmented generation systems

J Saad-Falcon, O Khattab, C Potts… - arxiv preprint arxiv …, 2023 - arxiv.org
Evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand
annotations for input queries, passages to retrieve, and responses to generate. We …