Blocking and filtering techniques for entity resolution: A survey

G Papadakis, D Skoutas, E Thanos… - ACM Computing Surveys …, 2020 - dl.acm.org
Entity Resolution (ER), a core task of Data Integration, detects different entity profiles that
correspond to the same real-world object. Due to its inherently quadratic complexity, a series …

ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms

M Aumüller, E Bernhardsson, A Faithfull - Information Systems, 2020 - Elsevier
This paper describes ANN-Benchmarks, a tool for evaluating the performance of in-memory
approximate nearest neighbor algorithms. It provides a standard interface for measuring the …

A survey of blocking and filtering techniques for entity resolution

G Papadakis, D Skoutas, E Thanos… - arxiv preprint arxiv …, 2019 - arxiv.org
Efficiency techniques are an integral part of Entity Resolution, since its infancy. In this
survey, we organized the bulk of works in the field into Blocking, Filtering and hybrid …

MR-MVPP: A map-reduce-based approach for creating MVPP in data warehouses for big data applications

H Azgomi, MK Sohrabi - Information Sciences, 2021 - Elsevier
Materialized view selection (MVS) is the problem of selecting an appropriate set of views to
be materialized to speed up analytical query processing of data warehouses. Online …

Pigeonring: A principle for faster thresholded similarity search

J Qin, C **ao - arxiv preprint arxiv:1804.01614, 2018 - arxiv.org
The pigeonhole principle states that if $ n $ items are contained in $ m $ boxes, then at least
one box has no more than $ n/m $ items. It is utilized to solve many data management …

Tokenjoin: efficient filtering for set similarity join with maximumweighted bipartite matching

A Zeakis, D Skoutas, D Sacharidis… - Proceedings of the …, 2022 - research.tue.nl
Set similarity join is an important problem with many applications in data discovery, cleaning
and integration. To increase robustness, fuzzy set similarity join calculates the similarity of …

Fast locality-sensitive hashing frameworks for approximate near neighbor search

T Christiani - International Conference on Similarity Search and …, 2019 - Springer
Abstract The Indyk-Motwani Locality-Sensitive Hashing (LSH) framework (STOC 1998) is a
general technique for constructing a data structure to answer approximate near neighbor …

Higher-order count sketch: Dimensionality reduction that retains efficient tensor operations

Y Shi, A Anandkumar - arxiv preprint arxiv:1901.11261, 2019 - arxiv.org
Sketching is a randomized dimensionality-reduction method that aims to preserve relevant
information in large-scale datasets. Count sketch is a simple popular sketch which uses a …

PPIS-JOIN: A novel privacy-preserving image similarity join method

C Zhang, F **e, H Yu, J Zhang, L Zhu, Y Li - Neural Processing Letters, 2022 - Springer
Recently, massive multimedia data (especially images) is moved to the cloud environment
for analysis and retrieval, which makes data security issue become particularly significant …

Metricjoin: Leveraging metric properties for robust exact set similarity joins

M Widmoser, D Kocher, N Augsten… - 2023 IEEE 39th …, 2023 - ieeexplore.ieee.org
Given two collections of sets, the set similarity join reports all pairs of sets that are within a
given distance threshold. State-of-the-art solutions employ an inverted list index and several …