Big data preprocessing: methods and prospects
The massive growth in the scale of data has been observed in recent years being a key
factor of the Big Data scenario. Big Data can be defined as high volume, velocity and variety …
factor of the Big Data scenario. Big Data can be defined as high volume, velocity and variety …
Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study
Semi-supervised classification methods are suitable tools to tackle training sets with large
amounts of unlabeled data and a small quantity of labeled data. This problem has been …
amounts of unlabeled data and a small quantity of labeled data. This problem has been …
Big data: New tricks for econometrics
HR Varian - Journal of economic perspectives, 2014 - aeaweb.org
Computers are now involved in many economic transactions and can capture data
associated with these transactions, which can then be manipulated and analyzed …
associated with these transactions, which can then be manipulated and analyzed …
kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data
Abstract The k-Nearest Neighbors classifier is a simple yet effective widely renowned
method in data mining. The actual application of this model in the big data domain is not …
method in data mining. The actual application of this model in the big data domain is not …
Web-scale k-means clustering
D Sculley - Proceedings of the 19th international conference on …, 2010 - dl.acm.org
We present two modifications to the popular k-means clustering algorithm to address the
extreme requirements for latency, scalability, and sparsity encountered in user-facing web …
extreme requirements for latency, scalability, and sparsity encountered in user-facing web …
On tackling explanation redundancy in decision trees
Decision trees (DTs) epitomize the ideal of interpretability of machine learning (ML) models.
The interpretability of decision trees motivates explainability approaches by so-called …
The interpretability of decision trees motivates explainability approaches by so-called …
Advance and prospects of AdaBoost algorithm
C Ying, M Qi-Guang, L Jia-Chen, G Lin - Acta Automatica Sinica, 2013 - Elsevier
AdaBoost is one of the most excellent Boosting algorithms. It has a solid theoretical basis
and has made great success in practical applications. AdaBoost can boost a weak learning …
and has made great success in practical applications. AdaBoost can boost a weak learning …
Data mining: practical machine learning tools and techniques with Java implementations
Witten and Frank's textbook was one of two books that 1 used for a data mining class in the
Fall of 2001. The book covers all major methods of data mining that produce a knowledge …
Fall of 2001. The book covers all major methods of data mining that produce a knowledge …
A benchmark study on time series clustering
This paper presents the first time series clustering benchmark utilizing all time series
datasets currently available in the University of California Riverside (UCR) archive—the …
datasets currently available in the University of California Riverside (UCR) archive—the …