Foundational autoraters: Taming large language models for better automatic evaluation

T Vu, K Krishna, S Alzubi, C Tar, M Faruqui… - arxiv preprint arxiv …, 2024 - arxiv.org
As large language models (LLMs) advance, it becomes more challenging to reliably
evaluate their output due to the high costs of human evaluation. To make progress towards …

Summary of a haystack: A challenge to long-context llms and rag systems

P Laban, AR Fabbri, C **ong, CS Wu - arxiv preprint arxiv:2407.01370, 2024 - arxiv.org
LLMs and RAG systems are now capable of handling millions of input tokens or more.
However, evaluating the output quality of such systems on long-context tasks remains …

VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation

Y Song, Y Kim, M Iyyer - arxiv preprint arxiv:2406.19276, 2024 - arxiv.org
Existing metrics for evaluating the factuality of long-form text, such as FACTSCORE (Min et
al., 2023) and SAFE (Wei et al., 2024), decompose an input text into" atomic claims" and …

Beyond the chat: Executable and verifiable text-editing with llms

P Laban, J Vig, M Hearst, C **ong, CS Wu - Proceedings of the 37th …, 2024 - dl.acm.org
Conversational interfaces powered by Large Language Models (LLMs) have recently
become a popular way to obtain feedback during document editing. However, standard chat …

Learning to refine with fine-grained natural language feedback

M Wadhwa, X Zhao, JJ Li, G Durrett - arxiv preprint arxiv:2407.02397, 2024 - arxiv.org
Recent work has explored the capability of large language models (LLMs) to identify and
correct errors in LLM-generated responses. These refinement approaches frequently …

Molecular facts: Desiderata for decontextualization in llm fact verification

A Gunjal, G Durrett - arxiv preprint arxiv:2406.20079, 2024 - arxiv.org
Automatic factuality verification of large language model (LLM) generations is becoming
more and more widely used to combat hallucinations. A major point of tension in the …

Improving model factuality with fine-grained critique-based evaluator

Y **e, W Zhou, P Prakash, D **, Y Mao… - arxiv preprint arxiv …, 2024 - arxiv.org
Factuality evaluation aims to detect factual errors produced by language models (LMs) and
hence guide the development of more factual models. Towards this goal, we train a factuality …

Storysumm: Evaluating faithfulness in story summarization

M Subbiah, F Ladhak, A Mishra, G Adams… - arxiv preprint arxiv …, 2024 - arxiv.org
Human evaluation has been the gold standard for checking faithfulness in abstractive
summarization. However, with a challenging source domain like narrative, multiple …

Halu-j: Critique-based hallucination judge

B Wang, S Chern, E Chern, P Liu - arxiv preprint arxiv:2407.12943, 2024 - arxiv.org
Large language models (LLMs) frequently generate non-factual content, known as
hallucinations. Existing retrieval-augmented-based hallucination detection approaches …

Claim verification in the age of large language models: A survey

A Dmonte, R Oruche, M Zampieri, P Calyam… - arxiv preprint arxiv …, 2024 - arxiv.org
The large and ever-increasing amount of data available on the Internet coupled with the
laborious task of manual claim and fact verification has sparked the interest in the …