Foundational autoraters: Taming large language models for better automatic evaluation
As large language models (LLMs) advance, it becomes more challenging to reliably
evaluate their output due to the high costs of human evaluation. To make progress towards …
evaluate their output due to the high costs of human evaluation. To make progress towards …
Summary of a haystack: A challenge to long-context llms and rag systems
LLMs and RAG systems are now capable of handling millions of input tokens or more.
However, evaluating the output quality of such systems on long-context tasks remains …
However, evaluating the output quality of such systems on long-context tasks remains …
VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation
Existing metrics for evaluating the factuality of long-form text, such as FACTSCORE (Min et
al., 2023) and SAFE (Wei et al., 2024), decompose an input text into" atomic claims" and …
al., 2023) and SAFE (Wei et al., 2024), decompose an input text into" atomic claims" and …
Beyond the chat: Executable and verifiable text-editing with llms
Conversational interfaces powered by Large Language Models (LLMs) have recently
become a popular way to obtain feedback during document editing. However, standard chat …
become a popular way to obtain feedback during document editing. However, standard chat …
Learning to refine with fine-grained natural language feedback
Recent work has explored the capability of large language models (LLMs) to identify and
correct errors in LLM-generated responses. These refinement approaches frequently …
correct errors in LLM-generated responses. These refinement approaches frequently …
Molecular facts: Desiderata for decontextualization in llm fact verification
Automatic factuality verification of large language model (LLM) generations is becoming
more and more widely used to combat hallucinations. A major point of tension in the …
more and more widely used to combat hallucinations. A major point of tension in the …
Improving model factuality with fine-grained critique-based evaluator
Factuality evaluation aims to detect factual errors produced by language models (LMs) and
hence guide the development of more factual models. Towards this goal, we train a factuality …
hence guide the development of more factual models. Towards this goal, we train a factuality …
Storysumm: Evaluating faithfulness in story summarization
Human evaluation has been the gold standard for checking faithfulness in abstractive
summarization. However, with a challenging source domain like narrative, multiple …
summarization. However, with a challenging source domain like narrative, multiple …
Halu-j: Critique-based hallucination judge
Large language models (LLMs) frequently generate non-factual content, known as
hallucinations. Existing retrieval-augmented-based hallucination detection approaches …
hallucinations. Existing retrieval-augmented-based hallucination detection approaches …
Claim verification in the age of large language models: A survey
The large and ever-increasing amount of data available on the Internet coupled with the
laborious task of manual claim and fact verification has sparked the interest in the …
laborious task of manual claim and fact verification has sparked the interest in the …