Blocking and filtering techniques for entity resolution: A survey

G Papadakis, D Skoutas, E Thanos… - ACM Computing Surveys …, 2020 - dl.acm.org
Entity Resolution (ER), a core task of Data Integration, detects different entity profiles that
correspond to the same real-world object. Due to its inherently quadratic complexity, a series …

Deep entity matching: Challenges and opportunities

Y Li, J Li, Y Suhara, J Wang, W Hirota… - Journal of Data and …, 2021 - dl.acm.org
Entity matching refers to the task of determining whether two different representations refer
to the same real-world entity. It continues to be a prevalent problem for many organizations …

Semantics-aware dataset discovery from data lakes with contextualized column-based representation learning

G Fan, J Wang, Y Li, D Zhang, R Miller - arxiv preprint arxiv:2210.01922, 2022 - arxiv.org
Dataset discovery from data lakes is essential in many real application scenarios. In this
paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes …

Efficient joinable table discovery in data lakes: A high-dimensional similarity-based approach

Y Dong, K Takeoka, C **ao… - 2021 IEEE 37th …, 2021 - ieeexplore.ieee.org
Finding joinable tables in data lakes is key procedure in many applications such as data
integration, data augmentation, data analysis, and data market. Traditional approaches that …

Deep learning approaches for similarity computation: A survey

P Yang, H Wang, J Yang, Z Qian… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
The requirement for appropriate ways to measure the similarity between data objects is a
common but vital task in various domains, such as data mining, machine learning and so on …

Machop: an end-to-end generalized entity matching framework

J Wang, Y Li, W Hirota, E Kandogan - Proceedings of the Fifth …, 2022 - dl.acm.org
Real-world applications frequently seek to solve a general form of the Entity Matching (EM)
problem to find associated entities. Such scenarios include matching jobs to candidates in …

OmniMatch: Effective Self-Supervised Any-Join Discovery in Tabular Data Repositories

C Koutras, J Zhang, X Qin, C Lei, V Ioannidis… - arxiv preprint arxiv …, 2024 - arxiv.org
How can we discover join relationships among columns of tabular data in a data repository?
Can this be done effectively when metadata is missing? Traditional column matching works …

CrowdMed-II: a blockchain-based framework for efficient consent management in health data sharing

C Hu, C Li, G Zhang, Z Lei, M Shah, Y Zhang, C **ng… - World Wide Web, 2022 - Springer
The healthcare industry faces serious problems with health data. Firstly, health data is
fragmented and its quality needs to be improved. Data fragmentation means that it is difficult …

A transformation-based framework for KNN set similarity search

Y Zhang, J Wu, J Wang, C **ng - IEEE Transactions on …, 2018 - ieeexplore.ieee.org
Set similarity search is a fundamental operation in a variety of applications. While many
previous studies focus on threshold based set similarity search and join, few efforts have …

Boosting approximate dictionary-based entity extraction with synonyms

J Wang, C Lin, M Li, C Zaniolo - Information Sciences, 2020 - Elsevier
Dictionary-based entity extraction is an important task in many data analysis applications,
such as academic search, document classification, and code auto-debugging. To improve …