From frequency to meaning: Vector space models of semantics
Computers understand very little of the meaning of human language. This profoundly limits
our ability to give instructions to computers, the ability of computers to explain their actions to …
our ability to give instructions to computers, the ability of computers to explain their actions to …
An overview of end-to-end entity resolution for big data
One of the most critical tasks for improving data quality and increasing the reliability of data
analytics is Entity Resolution (ER), which aims to identify different descriptions that refer to …
analytics is Entity Resolution (ER), which aims to identify different descriptions that refer to …
Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets
The all-pairs-similarity-search (or similarity join) problem has been extensively studied for
text and a handful of other datatypes. However, surprisingly little progress has been made …
text and a handful of other datatypes. However, surprisingly little progress has been made …
[BOOK][B] The data matching process
P Christen, P Christen - 2012 - Springer
This chapter provides an overview of the data matching process, and describes the five
major steps involved in this process: data pre-processing (cleaning and standardisation) …
major steps involved in this process: data pre-processing (cleaning and standardisation) …
[BOOK][B] Data cleaning
This is an overview of the end-to-end data cleaning process. Data quality is one of the most
important problems in data management, since dirty data often leads to inaccurate data …
important problems in data management, since dirty data often leads to inaccurate data …
Efficient k-nearest neighbor graph construction for generic similarity measures
K-Nearest Neighbor Graph (K-NNG) construction is an important operation with many web
related applications, including collaborative filtering, similarity search, and many others in …
related applications, including collaborative filtering, similarity search, and many others in …
Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS)
A Shrivastava, P Li - Advances in neural information …, 2014 - proceedings.neurips.cc
We present the first provably sublinear time hashing algorithm for approximate\emph
{Maximum Inner Product Search}(MIPS). Searching with (un-normalized) inner product as …
{Maximum Inner Product Search}(MIPS). Searching with (un-normalized) inner product as …
Crowder: Crowdsourcing entity resolution
Entity resolution is central to data integration and data cleaning. Algorithmic approaches
have been improving in quality, but remain far from perfect. Crowdsourcing platforms offer a …
have been improving in quality, but remain far from perfect. Crowdsourcing platforms offer a …
Efficient similarity joins for near-duplicate detection
With the increasing amount of data and the need to integrate data from multiple data
sources, one of the challenging issues is to identify near-duplicate records efficiently. In this …
sources, one of the challenging issues is to identify near-duplicate records efficiently. In this …
Blocking and filtering techniques for entity resolution: A survey
Entity Resolution (ER), a core task of Data Integration, detects different entity profiles that
correspond to the same real-world object. Due to its inherently quadratic complexity, a series …
correspond to the same real-world object. Due to its inherently quadratic complexity, a series …