A framework for human evaluation of large language models in healthcare derived from literature review
With generative artificial intelligence (GenAI), particularly large language models (LLMs),
continuing to make inroads in healthcare, assessing LLMs with human evaluations is …
continuing to make inroads in healthcare, assessing LLMs with human evaluations is …
Factscore: Fine-grained atomic evaluation of factual precision in long form text generation
Evaluating the factuality of long-form text generated by large language models (LMs) is non-
trivial because (1) generations often contain a mixture of supported and unsupported pieces …
trivial because (1) generations often contain a mixture of supported and unsupported pieces …
Fine-grained human feedback gives better rewards for language model training
Abstract Language models (LMs) often exhibit undesirable text generation behaviors,
including generating false, toxic, or irrelevant outputs. Reinforcement learning from human …
including generating false, toxic, or irrelevant outputs. Reinforcement learning from human …
Flask: Fine-grained language model evaluation based on alignment skill sets
Evaluation of Large Language Models (LLMs) is challenging because aligning to human
values requires the composition of multiple skills and the required set of skills varies …
values requires the composition of multiple skills and the required set of skills varies …
Foundational autoraters: Taming large language models for better automatic evaluation
As large language models (LLMs) advance, it becomes more challenging to reliably
evaluate their output due to the high costs of human evaluation. To make progress towards …
evaluate their output due to the high costs of human evaluation. To make progress towards …
Evallm: Interactive evaluation of large language model prompts on user-defined criteria
By simply composing prompts, developers can prototype novel generative applications with
Large Language Models (LLMs). To refine prototypes into products, however, developers …
Large Language Models (LLMs). To refine prototypes into products, however, developers …
Booookscore: A systematic exploration of book-length summarization in the era of llms
Summarizing book-length documents (> 100K tokens) that exceed the context window size
of large language models (LLMs) requires first breaking the input document into smaller …
of large language models (LLMs) requires first breaking the input document into smaller …
Expertqa: Expert-curated questions and attributed answers
As language models are adapted by a more sophisticated and diverse set of users, the
importance of guaranteeing that they provide factually correct information supported by …
importance of guaranteeing that they provide factually correct information supported by …
Summary of a haystack: A challenge to long-context llms and rag systems
LLMs and RAG systems are now capable of handling millions of input tokens or more.
However, evaluating the output quality of such systems on long-context tasks remains …
However, evaluating the output quality of such systems on long-context tasks remains …
Ares: An automated evaluation framework for retrieval-augmented generation systems
Evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand
annotations for input queries, passages to retrieve, and responses to generate. We …
annotations for input queries, passages to retrieve, and responses to generate. We …