Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …
but improved evaluation approaches are rarely widely adopted. This issue has become …
A survey of evaluation metrics used for NLG systems
In the last few years, a large number of automatic evaluation metrics have been proposed for
evaluating Natural Language Generation (NLG) systems. The rapid development and …
evaluating Natural Language Generation (NLG) systems. The rapid development and …
Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense
The rise in malicious usage of large language models, such as fake content creation and
academic plagiarism, has motivated the development of approaches that identify AI …
academic plagiarism, has motivated the development of approaches that identify AI …
Confident adaptive language modeling
Recent advances in Transformer-based large language models (LLMs) have led to
significant performance improvements across many tasks. These gains come with a drastic …
significant performance improvements across many tasks. These gains come with a drastic …
SimCLS: A simple framework for contrastive learning of abstractive summarization
In this paper, we present a conceptually simple while empirically powerful framework for
abstractive summarization, SimCLS, which can bridge the gap between the learning …
abstractive summarization, SimCLS, which can bridge the gap between the learning …
Bertscore: Evaluating text generation with bert
We propose BERTScore, an automatic evaluation metric for text generation. Analogously to
common metrics, BERTScore computes a similarity score for each token in the candidate …
common metrics, BERTScore computes a similarity score for each token in the candidate …
Reformulating unsupervised style transfer as paraphrase generation
Modern NLP defines the task of style transfer as modifying the style of a given sentence
without appreciably changing its semantics, which implies that the outputs of style transfer …
without appreciably changing its semantics, which implies that the outputs of style transfer …
You only prompt once: On the capabilities of prompt learning on large language models to tackle toxic content
The spread of toxic content online is an important problem that has adverse effects on user
experience online and in our society at large. Motivated by the importance and impact of the …
experience online and in our society at large. Motivated by the importance and impact of the …
Semantic similarity metrics for evaluating source code summarization
Source code summarization involves creating brief descriptions of source code in natural
language. These descriptions are a key component of software documentation such as …
language. These descriptions are a key component of software documentation such as …
LongEval: Guidelines for human evaluation of faithfulness in long-form summarization
While human evaluation remains best practice for accurately judging the faithfulness of
automatically-generated summaries, few solutions exist to address the increased difficulty …
automatically-generated summaries, few solutions exist to address the increased difficulty …