Reading subtext: Evaluating large language models on short story summarization with writers

M Subbiah, S Zhang, LB Chilton… - Transactions of the …, 2024 - direct.mit.edu
Abstract We evaluate recent Large Language Models (LLMs) on the challenging task of
summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled …

Delving into ChatGPT usage in academic writing through excess vocabulary

D Kobak, R González-Márquez, EÁ Horvát… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent large language models (LLMs) can generate and revise text with human-level
performance, and have been widely commercialized in systems like ChatGPT. These …

Multi-modal and multi-agent systems meet rationality: A survey

B Jiang, Y **e, X Wang, WJ Su, CJ Taylor… - ICML 2024 Workshop …, 2024 - openreview.net
Rationality is characterized by logical thinking and decision-making that align with evidence
and logical rules. This quality is essential for effective problem-solving, as it ensures that …

Instructing and prompting large language models for explainable cross-domain recommendations

A Petruzzelli, C Musto, L Laraspata, I Rinaldi… - Proceedings of the 18th …, 2024 - dl.acm.org
In this paper, we present a strategy to provide users with explainable cross-domain
recommendations (CDR) that exploits large language models (LLMs). Generally speaking …

Learning to refine with fine-grained natural language feedback

M Wadhwa, X Zhao, JJ Li, G Durrett - arxiv preprint arxiv:2407.02397, 2024 - arxiv.org
Recent work has explored the capability of large language models (LLMs) to identify and
correct errors in LLM-generated responses. These refinement approaches frequently …

Storysumm: Evaluating faithfulness in story summarization

M Subbiah, F Ladhak, A Mishra, G Adams… - arxiv preprint arxiv …, 2024 - arxiv.org
Human evaluation has been the gold standard for checking faithfulness in abstractive
summarization. However, with a challenging source domain like narrative, multiple …

FABLES: Evaluating faithfulness and content selection in book-length summarization

Y Kim, Y Chang, M Karpinska, A Garimella… - arxiv preprint arxiv …, 2024 - arxiv.org
While long-context large language models (LLMs) can technically summarize book-length
documents (> 100K tokens), the length and complexity of the documents have so far …

Disentangling Likes and Dislikes in Personalized Generative Explainable Recommendation

R Shimizu, T Wada, Y Wang, J Kruse, S O'Brien… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent research on explainable recommendation generally frames the task as a standard
text generation problem, and evaluates models simply based on the textual similarity …

Adacad: Adaptively decoding to balance conflicts between contextual and parametric knowledge

H Wang, A Prasad, E Stengel-Eskin… - arxiv preprint arxiv …, 2024 - arxiv.org
Knowledge conflict arises from discrepancies between information in the context of a large
language model (LLM) and the knowledge stored in its parameters. This can hurt …

Do automatic factuality metrics measure factuality? A critical evaluation

S Ramprasad, BC Wallace - arxiv preprint arxiv:2411.16638, 2024 - arxiv.org
Modern LLMs can now produce highly readable abstractive summaries, to the point where
traditional automated metrics for evaluating summary quality, such as ROUGE, have …