Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

S Min, K Krishna, X Lyu, M Lewis, W Yih… - arxiv preprint arxiv …, 2023 - arxiv.org
Evaluating the factuality of long-form text generated by large language models (LMs) is non-
trivial because (1) generations often contain a mixture of supported and unsupported pieces …

Evaluating correctness and faithfulness of instruction-following models for question answering

V Adlakha, P BehnamGhader, XH Lu… - Transactions of the …, 2024 - direct.mit.edu
Instruction-following models are attractive alternatives to fine-tuned approaches for question
answering (QA). By simply prepending relevant documents and an instruction to their input …

Large language model alignment: A survey

T Shen, R **, Y Huang, C Liu, W Dong, Z Guo… - arxiv preprint arxiv …, 2023 - arxiv.org
Recent years have witnessed remarkable progress made in large language models (LLMs).
Such advancements, while garnering significant attention, have concurrently elicited various …

Interpretable long-form legal question answering with retrieval-augmented large language models

A Louis, G van Dijck, G Spanakis - … of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org
Many individuals are likely to face a legal dispute at some point in their lives, but their lack of
understanding of how to navigate these complex issues often renders them vulnerable. The …

Expertqa: Expert-curated questions and attributed answers

C Malaviya, S Lee, S Chen, E Sieber, M Yatskar… - arxiv preprint arxiv …, 2023 - arxiv.org
As language models are adopted by a more sophisticated and diverse set of users, the
importance of guaranteeing that they provide factually correct information supported by …

Prd: Peer rank and discussion improve large language model based evaluations

R Li, T Patel, X Du - arxiv preprint arxiv:2307.02762, 2023 - arxiv.org
Nowadays, the quality of responses generated by different modern large language models
(LLMs) is hard to evaluate and compare automatically. Recent studies suggest and …

Evaluating very long-term conversational memory of llm agents

A Maharana, DH Lee, S Tulyakov, M Bansal… - arxiv preprint arxiv …, 2024 - arxiv.org
Existing works on long-term open-domain dialogues focus on evaluating model responses
within contexts spanning no more than five chat sessions. Despite advancements in long …

The responsible foundation model development cheatsheet: A review of tools & resources

S Longpre, S Biderman, A Albalak… - arxiv preprint arxiv …, 2024 - arxiv.org
Foundation model development attracts a rapidly expanding body of contributors, scientists,
and applications. To help shape responsible development practices, we introduce the …

CRAG-comprehensive RAG benchmark

X Yang, K Sun, H **n, Y Sun, N Bhalla… - Advances in …, 2025 - proceedings.neurips.cc
Abstract Retrieval-Augmented Generation (RAG) has recently emerged as a promising
solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing …

Benchmark evaluations, applications, and challenges of large vision language models: A survey

Z Li, X Wu, H Du, H Nghiem, G Shi - arxiv preprint arxiv:2501.02189, 2025 - arxiv.org
Multimodal Vision Language Models (VLMs) have emerged as a transformative technology
at the intersection of computer vision and natural language processing, enabling machines …