The neuroconnectionist research programme

A Doerig, RP Sommers, K Seeliger… - Nature Reviews …, 2023 - nature.com
Artificial neural networks (ANNs) inspired by biology are beginning to be widely used to
model behavioural and neural data, an approach we call 'neuroconnectionism'. ANNs have …

Data and its (dis) contents: A survey of dataset development and use in machine learning research

A Paullada, ID Raji, EM Bender, E Denton, A Hanna - Patterns, 2021 - cell.com
In this work, we survey a breadth of literature that has revealed the limitations of
predominant practices for dataset collection and use in the field of machine learning. We …

The'Problem'of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation

B Plank - arxiv preprint arxiv:2211.02570, 2022 - arxiv.org
Human variation in labeling is often considered noise. Annotation projects for machine
learning (ML) aim at minimizing human label variation, with the assumption to maximize …

AI and the everything in the whole wide world benchmark

ID Raji, EM Bender, A Paullada, E Denton… - arxiv preprint arxiv …, 2021 - arxiv.org
There is a tendency across different subfields in AI to valorize a small collection of influential
benchmarks. These benchmarks operate as stand-ins for a range of anointed common …

Reduced, reused and recycled: The life of a dataset in machine learning research

B Koch, E Denton, A Hanna, JG Foster - arxiv preprint arxiv:2112.01716, 2021 - arxiv.org
Benchmark datasets play a central role in the organization of machine learning research.
They coordinate researchers around shared research problems and serve as a measure of …

Benchmarks for automated commonsense reasoning: A survey

E Davis - ACM Computing Surveys, 2023 - dl.acm.org
More than one hundred benchmarks have been developed to test the commonsense
knowledge and commonsense reasoning abilities of artificial intelligence (AI) systems …

Evaluation gaps in machine learning practice

B Hutchinson, N Rostamzadeh, C Greer… - Proceedings of the …, 2022 - dl.acm.org
Forming a reliable judgement of a machine learning (ML) model's appropriateness for an
application ecosystem is critical for its responsible use, and requires considering a broad …

Position: Key claims in llm research have a long tail of footnotes

A Rogers, S Luccioni - Forty-first International Conference on …, 2024 - openreview.net
Much of the recent discourse within the ML community has been centered around Large
Language Models (LLMs), their functionality and potential--yet not only do we not have a …

Evaluation examples are not equally informative: How should that change NLP leaderboards?

P Rodriguez, J Barrow, AM Hoyle… - Proceedings of the …, 2021 - aclanthology.org
Leaderboards are widely used in NLP and push the field forward. While leaderboards are a
straightforward ranking of NLP models, this simplicity can mask nuances in evaluation items …

Underspecification in scene description-to-depiction tasks

B Hutchinson, J Baldridge, V Prabhakaran - arxiv preprint arxiv …, 2022 - arxiv.org
Questions regarding implicitness, ambiguity and underspecification are crucial for
understanding the task validity and ethical concerns of multimodal image+ text systems, yet …