Blocking and filtering techniques for entity resolution: A survey
Entity Resolution (ER), a core task of Data Integration, detects different entity profiles that
correspond to the same real-world object. Due to its inherently quadratic complexity, a series …
correspond to the same real-world object. Due to its inherently quadratic complexity, a series …
String similarity search and join: a survey
String similarity search and join are two important operations in data cleaning and
integration, which extend traditional exact search and exact join operations in databases by …
integration, which extend traditional exact search and exact join operations in databases by …
Fuzzy keyword search over encrypted data in cloud computing
As Cloud Computing becomes prevalent, more and more sensitive information are being
centralized into the cloud. For the protection of data privacy, sensitive data usually have to …
centralized into the cloud. For the protection of data privacy, sensitive data usually have to …
Josie: Overlap set similarity search for finding joinable tables in data lakes
We present a new solution for finding joinable tables in massive data lakes: given a table
and one join column, find tables that can be joined with the given table on the largest …
and one join column, find tables that can be joined with the given table on the largest …
Efficient similarity joins for near-duplicate detection
With the increasing amount of data and the need to integrate data from multiple data
sources, one of the challenging issues is to identify near-duplicate records efficiently. In this …
sources, one of the challenging issues is to identify near-duplicate records efficiently. In this …
Semantics-aware dataset discovery from data lakes with contextualized column-based representation learning
Dataset discovery from data lakes is essential in many real application scenarios. In this
paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes …
paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes …
Can we beat the prefix filtering? An adaptive framework for similarity join and search
As two important operations in data cleaning, similarity join and similarity search have
attracted much attention recently. Existing methods to support similarity join usually adopt a …
attracted much attention recently. Existing methods to support similarity join usually adopt a …
V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors
A Metwally, C Faloutsos - arxiv preprint arxiv:1204.6077, 2012 - arxiv.org
This work proposes V-SMART-Join, a scalable MapReduce-based framework for
discovering all pairs of similar entities. The V-SMART-Join framework is applicable to sets …
discovering all pairs of similar entities. The V-SMART-Join framework is applicable to sets …
Deep entity matching: Challenges and opportunities
Entity matching refers to the task of determining whether two different representations refer
to the same real-world entity. It continues to be a prevalent problem for many organizations …
to the same real-world entity. It continues to be a prevalent problem for many organizations …