String similarity search and join: a survey

M Yu, G Li, D Deng, J Feng - Frontiers of Computer Science, 2016 - Springer
String similarity search and join are two important operations in data cleaning and
integration, which extend traditional exact search and exact join operations in databases by …

Hierarchical classification of protein folds using a novel ensemble classifier

C Lin, Y Zou, J Qin, X Liu, Y Jiang, C Ke, Q Zou - PloS one, 2013 - journals.plos.org
The analysis of biological information from protein sequences is important for the study of
cellular functions and interactions, and protein fold recognition plays a key role in the …

Massjoin: A mapreduce-based method for scalable string similarity joins

D Deng, G Li, S Hao, J Wang… - 2014 IEEE 30th …, 2014 - ieeexplore.ieee.org
String similarity join is an essential operation in data integration. The era of big data calls for
scalable algorithms to support large-scale string similarity joins. In this paper, we study …

Human-in-the-loop data integration

G Li - Proceedings of the VLDB Endowment, 2017 - dl.acm.org
Data integration aims to integrate data in different sources and provide users with a unified
view. However, data integration cannot be completely addressed by purely automated …

Efficient approximate entity matching using jaro-winkler distance

Y Wang, J Qin, W Wang - International conference on web information …, 2017 - Springer
Jaro-Winkler distance is a measurement to measure the similarity between two strings.
Since Jaro-Winkler distance performs well in matching personal and entity names, it is …

Discovering similarity inclusion dependencies

Y Kaminsky, EHM Pena, F Naumann - … of the ACM on Management of …, 2023 - dl.acm.org
Inclusion dependencies (INDs) are a well-known type of data dependency, specifying that
the values of one column are contained in those of another column. INDs can be used for …

Efficient graph similarity search over large graph databases

W Zheng, L Zou, X Lian, D Wang… - IEEE Transactions on …, 2014 - ieeexplore.ieee.org
Since many graph data are often noisy and incomplete in real applications, it has become
increasingly important to retrieve graphs in the graph database that approximately match the …

Top-k similarity join in heterogeneous information networks

Y **ong, Y Zhu, SY Philip - IEEE Transactions on Knowledge …, 2014 - ieeexplore.ieee.org
As a newly emerging network model, heterogeneous information networks (HINs) have
received growing attention. Many data mining tasks have been explored in HINs, including …

Fast subtrajectory similarity search in road networks under weighted edit distance constraints

S Koide, C **ao, Y Ishikawa - arxiv preprint arxiv:2006.05564, 2020 - arxiv.org
In this paper, we address a similarity search problem for spatial trajectories in road networks.
In particular, we focus on the subtrajectory similarity search problem, which involves finding …

A pivotal prefix based filtering algorithm for string similarity search

D Deng, G Li, J Feng - Proceedings of the 2014 ACM SIGMOD …, 2014 - dl.acm.org
We study the string similarity search problem with edit-distance constraints, which, given a
set of data strings and a query string, finds the similar strings to the query. Existing …