Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text

S Gehrmann, E Clark, T Sellam - Journal of Artificial Intelligence Research, 2023 - jair.org
Abstract Evaluation practices in natural language generation (NLG) have many known flaws,
but improved evaluation approaches are rarely widely adopted. This issue has become …

[HTML][HTML] Leakage and the reproducibility crisis in machine-learning-based science

S Kapoor, A Narayanan - Patterns, 2023 - cell.com
Machine-learning (ML) methods have gained prominence in the quantitative sciences.
However, there are many known methodological pitfalls, including data leakage, in ML …

A taxonomy and review of generalization research in NLP

D Hupkes, M Giulianelli, V Dankers, M Artetxe… - Nature Machine …, 2023 - nature.com
The ability to generalize well is one of the primary desiderata for models of natural language
processing (NLP), but what 'good generalization'entails and how it should be evaluated is …

Impact of pretraining term frequencies on few-shot reasoning

Y Razeghi, RL Logan IV, M Gardner… - arxiv preprint arxiv …, 2022 - arxiv.org
Pretrained Language Models (LMs) have demonstrated ability to perform numerical
reasoning by extrapolating from a few examples in few-shot settings. However, the extent to …

SemEval-2020 task 12: Multilingual offensive language identification in social media (OffensEval 2020)

M Zampieri, P Nakov, S Rosenthal, P Atanasova… - arxiv preprint arxiv …, 2020 - arxiv.org
We present the results and main findings of SemEval-2020 Task 12 on Multilingual
Offensive Language Identification in Social Media (OffensEval 2020). The task involves …

Show your work: Improved reporting of experimental results

J Dodge, S Gururangan, D Card, R Schwartz… - arxiv preprint arxiv …, 2019 - arxiv.org
Research in natural language processing proceeds, in part, by demonstrating that new
models achieve superior performance (eg, accuracy) on held-out test data, compared to …

Probing toxic content in large pre-trained language models

N Ousidhoum, X Zhao, T Fang, Y Song… - Proceedings of the …, 2021 - aclanthology.org
Large pre-trained language models (PTLMs) have been shown to carry biases towards
different social groups which leads to the reproduction of stereotypical and toxic content by …

Toxicity detection: Does context really matter?

J Pavlopoulos, J Sorensen, L Dixon, N Thain… - arxiv preprint arxiv …, 2020 - arxiv.org
Moderation is crucial to promoting healthy on-line discussions. Although severaltoxicity'
detection datasets and models have been published, most of them ignore the context of the …

MultiEURLEX--A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

I Chalkidis, M Fergadiotis, I Androutsopoulos - arxiv preprint arxiv …, 2021 - arxiv.org
We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal
documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 …

On the value of out-of-distribution testing: An example of goodhart's law

D Teney, E Abbasnejad, K Kafle… - Advances in neural …, 2020 - proceedings.neurips.cc
Abstract Out-of-distribution (OOD) testing is increasingly popular for evaluating a machine
learning system's ability to generalize beyond the biases of a training set. OOD benchmarks …