[PDF][PDF] Data cleaning: Problems and current approaches
E Rahm, HH Do - IEEE Data Eng. Bull., 2000 - cs.brown.edu
We classify data quality problems that are addressed by data cleaning and provide an
overview of the main solution approaches. Data cleaning is especially required when …
overview of the main solution approaches. Data cleaning is especially required when …
Semantic integration research in the database community: A brief survey
Semantic integration has been a long-standing challenge for the database community. It has
received steady attention over the past two decades, and has now become a prominent area …
received steady attention over the past two decades, and has now become a prominent area …
[ΒΙΒΛΙΟ][B] Data cleaning
This is an overview of the end-to-end data cleaning process. Data quality is one of the most
important problems in data management, since dirty data often leads to inaccurate data …
important problems in data management, since dirty data often leads to inaccurate data …
[ΒΙΒΛΙΟ][B] The data matching process
P Christen, P Christen - 2012 - Springer
This chapter provides an overview of the data matching process, and describes the five
major steps involved in this process: data pre-processing (cleaning and standardisation) …
major steps involved in this process: data pre-processing (cleaning and standardisation) …
A multidimensional approach for detecting irony in twitter
Irony is a pervasive aspect of many online texts, one made all the more difficult by the
absence of face-to-face contact and vocal intonation. As our media increasingly become …
absence of face-to-face contact and vocal intonation. As our media increasingly become …
Duplicate record detection: A survey
Often, in the real world, entities have two or more representations in databases. Duplicate
records do not share a common key and/or they contain errors that make duplicate matching …
records do not share a common key and/or they contain errors that make duplicate matching …
[PDF][PDF] A Comparison of String Distance Metrics for Name-Matching Tasks.
Using an open-source, Java toolkit of name-matching methods, we experimentally compare
string distance metrics on the task of matching entity names. We investigate a number of …
string distance metrics on the task of matching entity names. We investigate a number of …
[PDF][PDF] Efficient clustering of high-dimensional data sets with application to reference matching
Many important problems involve clustering large datasets. Although naive implementations
of clustering are computationally expensive, there are established efficient techniques for …
of clustering are computationally expensive, there are established efficient techniques for …
[PDF][PDF] Using the triangle inequality to accelerate k-means
C Elkan - Proceedings of the 20th international conference on …, 2003 - cdn.aaai.org
The¡-means algorithm is by far the most widely used method for discovering clusters in data.
We show how to accelerate it dramatically, while still always computing exactly the same …
We show how to accelerate it dramatically, while still always computing exactly the same …
Data-Centric Systems and Applications
The rapid growth of the Web in the past two decades has made it the largest publicly
accessible data source in the world. Web mining aims to discover useful information or …
accessible data source in the world. Web mining aims to discover useful information or …