Reading subtext: Evaluating large language models on short story summarization with writers
M Subbiah, S Zhang, LB Chilton… - Transactions of the …, 2024 - direct.mit.edu
Abstract We evaluate recent Large Language Models (LLMs) on the challenging task of
summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled …
summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled …
Delving into ChatGPT usage in academic writing through excess vocabulary
Recent large language models (LLMs) can generate and revise text with human-level
performance, and have been widely commercialized in systems like ChatGPT. These …
performance, and have been widely commercialized in systems like ChatGPT. These …
Multi-modal and multi-agent systems meet rationality: A survey
Rationality is characterized by logical thinking and decision-making that align with evidence
and logical rules. This quality is essential for effective problem-solving, as it ensures that …
and logical rules. This quality is essential for effective problem-solving, as it ensures that …
Instructing and prompting large language models for explainable cross-domain recommendations
In this paper, we present a strategy to provide users with explainable cross-domain
recommendations (CDR) that exploits large language models (LLMs). Generally speaking …
recommendations (CDR) that exploits large language models (LLMs). Generally speaking …
Learning to refine with fine-grained natural language feedback
Recent work has explored the capability of large language models (LLMs) to identify and
correct errors in LLM-generated responses. These refinement approaches frequently …
correct errors in LLM-generated responses. These refinement approaches frequently …
Storysumm: Evaluating faithfulness in story summarization
Human evaluation has been the gold standard for checking faithfulness in abstractive
summarization. However, with a challenging source domain like narrative, multiple …
summarization. However, with a challenging source domain like narrative, multiple …
FABLES: Evaluating faithfulness and content selection in book-length summarization
While long-context large language models (LLMs) can technically summarize book-length
documents (> 100K tokens), the length and complexity of the documents have so far …
documents (> 100K tokens), the length and complexity of the documents have so far …
Disentangling Likes and Dislikes in Personalized Generative Explainable Recommendation
Recent research on explainable recommendation generally frames the task as a standard
text generation problem, and evaluates models simply based on the textual similarity …
text generation problem, and evaluates models simply based on the textual similarity …
Adacad: Adaptively decoding to balance conflicts between contextual and parametric knowledge
Knowledge conflict arises from discrepancies between information in the context of a large
language model (LLM) and the knowledge stored in its parameters. This can hurt …
language model (LLM) and the knowledge stored in its parameters. This can hurt …
Do automatic factuality metrics measure factuality? A critical evaluation
S Ramprasad, BC Wallace - arxiv preprint arxiv:2411.16638, 2024 - arxiv.org
Modern LLMs can now produce highly readable abstractive summaries, to the point where
traditional automated metrics for evaluating summary quality, such as ROUGE, have …
traditional automated metrics for evaluating summary quality, such as ROUGE, have …