Performance evaluation in machine learning: the good, the bad, the ugly, and the way forward
P Flach - Proceedings of the AAAI conference on artificial …, 2019 - aaai.org
This paper gives an overview of some ways in which our understanding of performance
evaluation measures for machine-learned classifiers has improved over the last twenty …
evaluation measures for machine-learned classifiers has improved over the last twenty …
Evaluation examples are not equally informative: How should that change NLP leaderboards?
Leaderboards are widely used in NLP and push the field forward. While leaderboards are a
straightforward ranking of NLP models, this simplicity can mask nuances in evaluation items …
straightforward ranking of NLP models, this simplicity can mask nuances in evaluation items …
Comparing Bayesian models of annotation
The analysis of crowdsourced annotations in natural language processing is concerned with
identifying (1) gold standard labels,(2) annotator accuracies and biases, and (3) item …
identifying (1) gold standard labels,(2) annotator accuracies and biases, and (3) item …
Item response theory in AI: Analysing machine learning classifiers at the instance level
AI systems are usually evaluated on a range of problem instances and compared to other AI
systems that use different strategies. These instances are rarely independent. Machine …
systems that use different strategies. These instances are rarely independent. Machine …
The quest for the reliability of machine learning models in binary classification on tabular data
In this paper we explore the reliability of contexts of machine learning (ML) models. There
are several evaluation procedures commonly used to validate a model (precision, F1 Score …
are several evaluation procedures commonly used to validate a model (precision, F1 Score …
Content Modeling in Smart Learning Environments: A systematic literature review
Educational content has become a key element for improving the quality and effectiveness
of teaching. Many studies have been conducted on user and knowledge modeling using …
of teaching. Many studies have been conducted on user and knowledge modeling using …
Item response theory based ensemble in machine learning
In this article, we propose a novel probabilistic framework to improve the accuracy of a
weighted majority voting algorithm. In order to assign higher weights to the classifiers which …
weighted majority voting algorithm. In order to assign higher weights to the classifiers which …
[HTML][HTML] Learning latent parameters without human response patterns: Item response theory with artificial crowds
Abstract Incorporating Item Response Theory (IRT) into NLP tasks can provide valuable
information about model performance and behavior. Traditionally, IRT models are learned …
information about model performance and behavior. Traditionally, IRT models are learned …
Unveiling the robustness of machine learning families
R Fabra-Boluda, C Ferri… - Machine Learning …, 2024 - iopscience.iop.org
The evaluation of machine learning systems has typically been limited to performance
measures on clean and curated datasets, which may not accurately reflect their robustness …
measures on clean and curated datasets, which may not accurately reflect their robustness …
Dual indicators to analyze ai benchmarks: Difficulty, discrimination, ability, and generality
F Martinez-Plumed… - IEEE Transactions on …, 2018 - ieeexplore.ieee.org
With the purpose of better analyzing the result of artificial intelligence (AI) benchmarks, we
present two indicators on the side of the AI problems, difficulty and discrimination, and two …
present two indicators on the side of the AI problems, difficulty and discrimination, and two …