Dynabench: Rethinking benchmarking in NLP

D Kiela, M Bartolo, Y Nie, D Kaushik, A Geiger… - arxiv preprint arxiv …, 2021 - arxiv.org
We introduce Dynabench, an open-source platform for dynamic dataset creation and model
benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the …

Machine learning testing: Survey, landscapes and horizons

JM Zhang, M Harman, L Ma… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
This paper provides a comprehensive survey of techniques for testing machine learning
systems; Machine Learning Testing (ML testing) research. It covers 144 papers on testing …

An empirical study on robustness to spurious correlations using pre-trained language models

L Tu, G Lalwani, S Gella, H He - Transactions of the Association for …, 2020 - direct.mit.edu
Recent work has shown that pre-trained language models such as BERT improve
robustness to spurious correlations in the dataset. Intrigued by these results, we find that the …

Robustness gym: Unifying the NLP evaluation landscape

K Goel, N Rajani, J Vig, S Tan, J Wu, S Zheng… - arxiv preprint arxiv …, 2021 - arxiv.org
Despite impressive performance on standard benchmarks, deep neural networks are often
brittle when deployed in real-world systems. Consequently, recent research has focused on …

Towards debiasing NLU models from unknown biases

PA Utama, NS Moosavi, I Gurevych - arxiv preprint arxiv:2009.12303, 2020 - arxiv.org
NLU models often exploit biases to achieve high dataset-specific performance without
properly learning the intended task. Recently proposed debiasing methods are shown to be …

A fine-grained comparison of pragmatic language understanding in humans and language models

J Hu, S Floyd, O Jouravlev, E Fedorenko… - arxiv preprint arxiv …, 2022 - arxiv.org
Pragmatics and non-literal language understanding are essential to human communication,
and present a long-standing challenge for artificial language models. We perform a fine …

Quality assurance strategies for machine learning applications in big data analytics: an overview

M Ogrizović, D Drašković, D Bojić - Journal of Big Data, 2024 - Springer
Abstract Machine learning (ML) models have gained significant attention in a variety of
applications, from computer vision to natural language processing, and are almost always …

DISCO: Distilling counterfactuals with large language models

Z Chen, Q Gao, A Bosselut, A Sabharwal… - arxiv preprint arxiv …, 2022 - arxiv.org
Models trained with counterfactually augmented data learn representations of the causal
structure of tasks, enabling robust generalization. However, high-quality counterfactual data …

Are natural language inference models IMPPRESsive? Learning IMPlicature and PRESupposition

P Jeretic, A Warstadt, S Bhooshan… - arxiv preprint arxiv …, 2020 - arxiv.org
Natural language inference (NLI) is an increasingly important task for natural language
understanding, which requires one to infer whether a sentence entails another. However …

Text-crs: A generalized certified robustness framework against textual adversarial attacks

X Zhang, H Hong, Y Hong, P Huang… - … IEEE Symposium on …, 2024 - ieeexplore.ieee.org
The language models, especially the basic text classification models, have been shown to
be susceptible to textual adversarial attacks such as synonym substitution and word …