Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

A survey of evaluation metrics used for NLG systems

AB Sai, AK Mohankumar, MM Khapra - ACM Computing Surveys (CSUR …, 2022 - dl.acm.org
In the last few years, a large number of automatic evaluation metrics have been proposed for
evaluating Natural Language Generation (NLG) systems. The rapid development and …

Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense

K Krishna, Y Song, M Karpinska… - Advances in Neural …, 2024 - proceedings.neurips.cc
The rise in malicious usage of large language models, such as fake content creation and
academic plagiarism, has motivated the development of approaches that identify AI …

Confident adaptive language modeling

T Schuster, A Fisch, J Gupta… - Advances in …, 2022 - proceedings.neurips.cc
Recent advances in Transformer-based large language models (LLMs) have led to
significant performance improvements across many tasks. These gains come with a drastic …

SimCLS: A simple framework for contrastive learning of abstractive summarization

Y Liu, P Liu - arxiv preprint arxiv:2106.01890, 2021 - arxiv.org
In this paper, we present a conceptually simple while empirically powerful framework for
abstractive summarization, SimCLS, which can bridge the gap between the learning …

Bertscore: Evaluating text generation with bert

T Zhang, V Kishore, F Wu, KQ Weinberger… - arxiv preprint arxiv …, 2019 - arxiv.org
We propose BERTScore, an automatic evaluation metric for text generation. Analogously to
common metrics, BERTScore computes a similarity score for each token in the candidate …

Reformulating unsupervised style transfer as paraphrase generation

K Krishna, J Wieting, M Iyyer - arxiv preprint arxiv:2010.05700, 2020 - arxiv.org
Modern NLP defines the task of style transfer as modifying the style of a given sentence
without appreciably changing its semantics, which implies that the outputs of style transfer …

You only prompt once: On the capabilities of prompt learning on large language models to tackle toxic content

X He, S Zannettou, Y Shen… - 2024 IEEE Symposium on …, 2024 - ieeexplore.ieee.org
The spread of toxic content online is an important problem that has adverse effects on user
experience online and in our society at large. Motivated by the importance and impact of the …

Semantic similarity metrics for evaluating source code summarization

S Haque, Z Eberhart, A Bansal… - Proceedings of the 30th …, 2022 - dl.acm.org
Source code summarization involves creating brief descriptions of source code in natural
language. These descriptions are a key component of software documentation such as …

LongEval: Guidelines for human evaluation of faithfulness in long-form summarization

K Krishna, E Bransom, B Kuehl, M Iyyer… - arxiv preprint arxiv …, 2023 - arxiv.org
While human evaluation remains best practice for accurately judging the faithfulness of
automatically-generated summaries, few solutions exist to address the increased difficulty …