Performance evaluation in machine learning: the good, the bad, the ugly, and the way forward

P Flach - Proceedings of the AAAI conference on artificial …, 2019 - aaai.org
This paper gives an overview of some ways in which our understanding of performance
evaluation measures for machine-learned classifiers has improved over the last twenty …

Evaluation examples are not equally informative: How should that change NLP leaderboards?

P Rodriguez, J Barrow, AM Hoyle… - Proceedings of the …, 2021 - aclanthology.org
Leaderboards are widely used in NLP and push the field forward. While leaderboards are a
straightforward ranking of NLP models, this simplicity can mask nuances in evaluation items …

Comparing Bayesian models of annotation

S Paun, B Carpenter, J Chamberlain, D Hovy… - Transactions of the …, 2018 - direct.mit.edu
The analysis of crowdsourced annotations in natural language processing is concerned with
identifying (1) gold standard labels,(2) annotator accuracies and biases, and (3) item …

Item response theory in AI: Analysing machine learning classifiers at the instance level

F Martínez-Plumed, RBC Prudêncio, A Martínez-Usó… - Artificial intelligence, 2019 - Elsevier
AI systems are usually evaluated on a range of problem instances and compared to other AI
systems that use different strategies. These instances are rarely independent. Machine …

The quest for the reliability of machine learning models in binary classification on tabular data

VC Araujo Santos, L Cardoso, R Alves - Scientific Reports, 2023 - nature.com
In this paper we explore the reliability of contexts of machine learning (ML) models. There
are several evaluation procedures commonly used to validate a model (precision, F1 Score …

Content Modeling in Smart Learning Environments: A systematic literature review

A Jiménez-Macías, PJ Muñoz-Merino… - Journal of Universal …, 2024 - search.proquest.com
Educational content has become a key element for improving the quality and effectiveness
of teaching. Many studies have been conducted on user and knowledge modeling using …

Item response theory based ensemble in machine learning

Z Chen, H Ahn - International Journal of Automation and Computing, 2020 - Springer
In this article, we propose a novel probabilistic framework to improve the accuracy of a
weighted majority voting algorithm. In order to assign higher weights to the classifiers which …

[HTML][HTML] Learning latent parameters without human response patterns: Item response theory with artificial crowds

JP Lalor, H Wu, H Yu - Proceedings of the Conference on Empirical …, 2019 - ncbi.nlm.nih.gov
Abstract Incorporating Item Response Theory (IRT) into NLP tasks can provide valuable
information about model performance and behavior. Traditionally, IRT models are learned …

Unveiling the robustness of machine learning families

R Fabra-Boluda, C Ferri… - Machine Learning …, 2024 - iopscience.iop.org
The evaluation of machine learning systems has typically been limited to performance
measures on clean and curated datasets, which may not accurately reflect their robustness …

Dual indicators to analyze ai benchmarks: Difficulty, discrimination, ability, and generality

F Martinez-Plumed… - IEEE Transactions on …, 2018 - ieeexplore.ieee.org
With the purpose of better analyzing the result of artificial intelligence (AI) benchmarks, we
present two indicators on the side of the AI problems, difficulty and discrimination, and two …