Blocking and filtering techniques for entity resolution: A survey

G Papadakis, D Skoutas, E Thanos… - ACM Computing Surveys …, 2020 - dl.acm.org
Entity Resolution (ER), a core task of Data Integration, detects different entity profiles that
correspond to the same real-world object. Due to its inherently quadratic complexity, a series …

String similarity search and join: a survey

M Yu, G Li, D Deng, J Feng - Frontiers of Computer Science, 2016 - Springer
String similarity search and join are two important operations in data cleaning and
integration, which extend traditional exact search and exact join operations in databases by …

[PDF][PDF] 大数据的-个重要方面数据可用性

**建中, 刘显敏 - 计算机研究与发展, 2013 - cs.sjtu.edu.cn
摘要!"# $% &'()*+,-.# $/0 123 4567893:;% &'<=>?@ ABCDEF GFHI# $8 J'KLMN
OPQRSTU@'VWIABXYZ [\],@ AB'KLVW^ _I!" AB'aZbc deABQ!^ fS ABXYZghiKjk l# $8 J …

Fuzzy keyword search over encrypted data in cloud computing

J Li, Q Wang, C Wang, N Cao, K Ren… - 2010 Proceedings IEEE …, 2010 - ieeexplore.ieee.org
As Cloud Computing becomes prevalent, more and more sensitive information are being
centralized into the cloud. For the protection of data privacy, sensitive data usually have to …

Josie: Overlap set similarity search for finding joinable tables in data lakes

E Zhu, D Deng, F Nargesian, RJ Miller - Proceedings of the 2019 …, 2019 - dl.acm.org
We present a new solution for finding joinable tables in massive data lakes: given a table
and one join column, find tables that can be joined with the given table on the largest …

Efficient similarity joins for near-duplicate detection

C **ao, W Wang, X Lin, JX Yu, G Wang - ACM Transactions on Database …, 2011 - dl.acm.org
With the increasing amount of data and the need to integrate data from multiple data
sources, one of the challenging issues is to identify near-duplicate records efficiently. In this …

Semantics-aware dataset discovery from data lakes with contextualized column-based representation learning

G Fan, J Wang, Y Li, D Zhang, R Miller - arxiv preprint arxiv:2210.01922, 2022 - arxiv.org
Dataset discovery from data lakes is essential in many real application scenarios. In this
paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes …

Can we beat the prefix filtering? An adaptive framework for similarity join and search

J Wang, G Li, J Feng - Proceedings of the 2012 ACM SIGMOD …, 2012 - dl.acm.org
As two important operations in data cleaning, similarity join and similarity search have
attracted much attention recently. Existing methods to support similarity join usually adopt a …

V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors

A Metwally, C Faloutsos - arxiv preprint arxiv:1204.6077, 2012 - arxiv.org
This work proposes V-SMART-Join, a scalable MapReduce-based framework for
discovering all pairs of similar entities. The V-SMART-Join framework is applicable to sets …

Deep entity matching: Challenges and opportunities

Y Li, J Li, Y Suhara, J Wang, W Hirota… - Journal of Data and …, 2021 - dl.acm.org
Entity matching refers to the task of determining whether two different representations refer
to the same real-world entity. It continues to be a prevalent problem for many organizations …