Probminhash–a class of locality-sensitive hash algorithms for the (probability) jaccard similarity

O Ertl - IEEE Transactions on Knowledge and Data …, 2020 - ieeexplore.ieee.org
The probability Jaccard similarity was recently proposed as a natural generalization of the
Jaccard similarity to measure the proximity of sets whose elements are associated with …

Bidirectionally densifying lsh sketches with empty bins

P Jia, P Wang, J Zhao, S Zhang, Y Qi, M Hu… - Proceedings of the …, 2021 - dl.acm.org
As an efficient tool for approximate similarity computation and search, Locality Sensitive
Hashing (LSH) has been widely used in many research areas including databases, data …

Setsketch: Filling the gap between minhash and hyperloglog

O Ertl - arxiv preprint arxiv:2101.00314, 2021 - arxiv.org
MinHash and HyperLogLog are sketching algorithms that have become indispensable for
set summaries in big data applications. While HyperLogLog allows counting different …

A Compact and Accurate Sketch for Estimating a Large Range of Set Difference Cardinalities

P Jia, P Wang, R Li, J Zhao, J Feng… - 2024 IEEE 40th …, 2024 - ieeexplore.ieee.org
Computing set difference cardinalities is a critical task in database optimization, network
management, and anomaly detection. Due to the limited computational and mem-ory …

Streaming algorithms for estimating high set similarities in loglog space

Y Qi, P Wang, Y Zhang, Q Zhai, C Wang… - … on Knowledge and …, 2020 - ieeexplore.ieee.org
Estimating set similarity and detecting highly similar sets are fundamental problems in areas
such as databases and machine learning. MinHash is a well-known technique for …

BinDash 2.0: new MinHash scheme allows ultra-fast and accurate genome search and comparisons

J Zhao, XF Zhao, J Pierre-Both, KT Konstantinidis - bioRxiv, 2024 - biorxiv.org
Motivation: Comparing large number of genomes in term of their genomic distance is
becoming more and more challenging because there is an increasing number of microbial …

Toward optimal fingerprint indexing for large scale genomics

C Agret, B Cazaux, A Limasset - bioRxiv, 2021 - biorxiv.org
Motivation To keep up with the scale of genomic databases, several methods rely on local
sensitive hashing methods to efficiently find potential matches within large genome …

Generating overlap estimations between high-volume digital data sets based on multiple sketch vector similarity estimators

A Rao, T Mai, M Kapilevich - US Patent 11,449,523, 2022 - Google Patents
The present disclosure relates to systems, methods, and non-transitory computer-readable
media that estimate the overlap between sets of data samples. In particular, in one or more …

Generating overlap estimations between high-volume digital data sets based on multiple sketch vector similarity estimators

A Rao, T Mai, M Kapilevich - US Patent 11,720,592, 2023 - Google Patents
The present disclosure relates to systems, methods, and non-transitory computer-readable
media that estimate the overlap between sets of data samples. In particular, in one or more …