From frequency to meaning: Vector space models of semantics

PD Turney, P Pantel - Journal of artificial intelligence research, 2010 - jair.org
Computers understand very little of the meaning of human language. This profoundly limits
our ability to give instructions to computers, the ability of computers to explain their actions to …

An overview of end-to-end entity resolution for big data

V Christophides, V Efthymiou, T Palpanas… - ACM Computing …, 2020 - dl.acm.org
One of the most critical tasks for improving data quality and increasing the reliability of data
analytics is Entity Resolution (ER), which aims to identify different descriptions that refer to …

Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets

CCM Yeh, Y Zhu, L Ulanova, N Begum… - 2016 IEEE 16th …, 2016 - ieeexplore.ieee.org
The all-pairs-similarity-search (or similarity join) problem has been extensively studied for
text and a handful of other datatypes. However, surprisingly little progress has been made …

[BOOK][B] The data matching process

P Christen, P Christen - 2012 - Springer
This chapter provides an overview of the data matching process, and describes the five
major steps involved in this process: data pre-processing (cleaning and standardisation) …

[BOOK][B] Data cleaning

IF Ilyas, X Chu - 2019 - books.google.com
This is an overview of the end-to-end data cleaning process. Data quality is one of the most
important problems in data management, since dirty data often leads to inaccurate data …

Efficient k-nearest neighbor graph construction for generic similarity measures

W Dong, C Moses, K Li - … of the 20th international conference on World …, 2011 - dl.acm.org
K-Nearest Neighbor Graph (K-NNG) construction is an important operation with many web
related applications, including collaborative filtering, similarity search, and many others in …

Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS)

A Shrivastava, P Li - Advances in neural information …, 2014 - proceedings.neurips.cc
We present the first provably sublinear time hashing algorithm for approximate\emph
{Maximum Inner Product Search}(MIPS). Searching with (un-normalized) inner product as …

Crowder: Crowdsourcing entity resolution

J Wang, T Kraska, MJ Franklin, J Feng - arxiv preprint arxiv:1208.1927, 2012 - arxiv.org
Entity resolution is central to data integration and data cleaning. Algorithmic approaches
have been improving in quality, but remain far from perfect. Crowdsourcing platforms offer a …

Efficient similarity joins for near-duplicate detection

C **ao, W Wang, X Lin, JX Yu, G Wang - ACM Transactions on Database …, 2011 - dl.acm.org
With the increasing amount of data and the need to integrate data from multiple data
sources, one of the challenging issues is to identify near-duplicate records efficiently. In this …

Blocking and filtering techniques for entity resolution: A survey

G Papadakis, D Skoutas, E Thanos… - ACM Computing Surveys …, 2020 - dl.acm.org
Entity Resolution (ER), a core task of Data Integration, detects different entity profiles that
correspond to the same real-world object. Due to its inherently quadratic complexity, a series …