Are we learning yet? a meta review of evaluation failures across machine learning

T Liao, R Taori, ID Raji, L Schmidt - Thirty-fifth Conference on …, 2021 - openreview.net
Many subfields of machine learning share a common stumbling block: evaluation. Advances
in machine learning often evaporate under closer scrutiny or turn out to be less widely …

[HTML][HTML] Pre-trained transformers: an empirical comparison

S Casola, I Lauriola, A Lavelli - Machine Learning with Applications, 2022 - Elsevier
Pre-trained transformers have rapidly become very popular in the Natural Language
Processing (NLP) community, surpassing the previous state of the art in a wide variety of …

Xstest: A test suite for identifying exaggerated safety behaviours in large language models

P Röttger, HR Kirk, B Vidgen, G Attanasio… - arxiv preprint arxiv …, 2023 - arxiv.org
Without proper safeguards, large language models will readily follow malicious instructions
and generate toxic content. This risk motivates safety efforts such as red-teaming and large …

An introduction to deep learning in natural language processing: Models, techniques, and tools

I Lauriola, A Lavelli, F Aiolli - Neurocomputing, 2022 - Elsevier
Abstract Natural Language Processing (NLP) is a branch of artificial intelligence that
involves the design and implementation of systems and algorithms able to interact through …

Dynabench: Rethinking benchmarking in NLP

D Kiela, M Bartolo, Y Nie, D Kaushik, A Geiger… - arxiv preprint arxiv …, 2021 - arxiv.org
We introduce Dynabench, an open-source platform for dynamic dataset creation and model
benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the …

Underspecification presents challenges for credibility in modern machine learning

A D'Amour, K Heller, D Moldovan, B Adlam… - Journal of Machine …, 2022 - jmlr.org
Machine learning (ML) systems often exhibit unexpectedly poor behavior when they are
deployed in real-world domains. We identify underspecification in ML pipelines as a key …

Wilds: A benchmark of in-the-wild distribution shifts

PW Koh, S Sagawa, H Marklund… - International …, 2021 - proceedings.mlr.press
Distribution shifts—where the training distribution differs from the test distribution—can
substantially degrade the accuracy of machine learning (ML) systems deployed in the wild …

Tandem mass spectrum prediction for small molecules using graph transformers

A Young, H Röst, B Wang - Nature Machine Intelligence, 2024 - nature.com
Tandem mass spectra capture fragmentation patterns that provide key structural information
about molecules. Although mass spectrometry is applied in many areas, the vast majority of …

HateCheck: Functional tests for hate speech detection models

P Röttger, B Vidgen, D Nguyen, Z Waseem… - arxiv preprint arxiv …, 2020 - arxiv.org
Detecting online hate is a difficult task that even state-of-the-art models struggle with.
Typically, hate speech detection models are evaluated by measuring their performance on …

Towards debiasing NLU models from unknown biases

PA Utama, NS Moosavi, I Gurevych - arxiv preprint arxiv:2009.12303, 2020 - arxiv.org
NLU models often exploit biases to achieve high dataset-specific performance without
properly learning the intended task. Recently proposed debiasing methods are shown to be …