Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

A survey of active learning for natural language processing

Z Zhang, E Strubell, E Hovy - arxiv preprint arxiv:2210.10109, 2022 - arxiv.org
In this work, we provide a survey of active learning (AL) for its applications in natural
language processing (NLP). In addition to a fine-grained categorization of query strategies …

Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling

J Ruan, X Pu, M Gao, X Wan, Y Zhu - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Human evaluation is viewed as a reliable evaluation method for NLG which is expensive
and time-consuming. In order to save labor and costs, researchers usually perform human …

Label-efficient model selection for text generation

SA Tahan, A Gera, B Sznajder, L Choshen… - Proceedings of the …, 2024 - aclanthology.org
Abstract Model selection for a given target task can be costly, as it may entail extensive
annotation of the quality of outputs of different models. We introduce DiffUse, an efficient …

Label-efficient model selection for text generation

S Ashury-Tahan, A Gera, B Sznajder… - arxiv preprint arxiv …, 2024 - arxiv.org
Model selection for a given target task can be costly, as it may entail extensive annotation of
the quality of outputs of different models. We introduce DiffUse, an efficient method to make …

On the effectiveness of automated metrics for text generation systems

P Von Däniken, J Deriu, D Tuggener… - arxiv preprint arxiv …, 2022 - arxiv.org
A major challenge in the field of Text Generation is evaluation because we lack a sound
theory that can be leveraged to extract guidelines for evaluation campaigns. In this work, we …

Baby Bear: Seeking a Just Right Rating Scale for Scalar Annotations

X Han, F Yu, J Sedoc, B Van Durme - arxiv preprint arxiv:2408.09765, 2024 - arxiv.org
Our goal is a mechanism for efficiently assigning scalar ratings to each of a large set of
elements. For example," what percent positive or negative is this product review?" When …

[PDF][PDF] Exploring Language Structured Prediction in Resource-limited Scenarios

Z Zhang - 2023 - lti.cmu.edu
In natural language processing (NLP), many tasks involve structured prediction: predicting
structured outputs consisting of a group of interdependent variables. This allows extracting …