Probminhash–a class of locality-sensitive hash algorithms for the (probability) jaccard similarity
O Ertl - IEEE Transactions on Knowledge and Data …, 2020 - ieeexplore.ieee.org
The probability Jaccard similarity was recently proposed as a natural generalization of the
Jaccard similarity to measure the proximity of sets whose elements are associated with …
Jaccard similarity to measure the proximity of sets whose elements are associated with …
Bidirectionally densifying lsh sketches with empty bins
As an efficient tool for approximate similarity computation and search, Locality Sensitive
Hashing (LSH) has been widely used in many research areas including databases, data …
Hashing (LSH) has been widely used in many research areas including databases, data …
Setsketch: Filling the gap between minhash and hyperloglog
O Ertl - arxiv preprint arxiv:2101.00314, 2021 - arxiv.org
MinHash and HyperLogLog are sketching algorithms that have become indispensable for
set summaries in big data applications. While HyperLogLog allows counting different …
set summaries in big data applications. While HyperLogLog allows counting different …
A Compact and Accurate Sketch for Estimating a Large Range of Set Difference Cardinalities
Computing set difference cardinalities is a critical task in database optimization, network
management, and anomaly detection. Due to the limited computational and mem-ory …
management, and anomaly detection. Due to the limited computational and mem-ory …
Streaming algorithms for estimating high set similarities in loglog space
Estimating set similarity and detecting highly similar sets are fundamental problems in areas
such as databases and machine learning. MinHash is a well-known technique for …
such as databases and machine learning. MinHash is a well-known technique for …
BinDash 2.0: new MinHash scheme allows ultra-fast and accurate genome search and comparisons
Motivation: Comparing large number of genomes in term of their genomic distance is
becoming more and more challenging because there is an increasing number of microbial …
becoming more and more challenging because there is an increasing number of microbial …
Toward optimal fingerprint indexing for large scale genomics
Motivation To keep up with the scale of genomic databases, several methods rely on local
sensitive hashing methods to efficiently find potential matches within large genome …
sensitive hashing methods to efficiently find potential matches within large genome …
Generating overlap estimations between high-volume digital data sets based on multiple sketch vector similarity estimators
A Rao, T Mai, M Kapilevich - US Patent 11,449,523, 2022 - Google Patents
The present disclosure relates to systems, methods, and non-transitory computer-readable
media that estimate the overlap between sets of data samples. In particular, in one or more …
media that estimate the overlap between sets of data samples. In particular, in one or more …
Generating overlap estimations between high-volume digital data sets based on multiple sketch vector similarity estimators
A Rao, T Mai, M Kapilevich - US Patent 11,720,592, 2023 - Google Patents
The present disclosure relates to systems, methods, and non-transitory computer-readable
media that estimate the overlap between sets of data samples. In particular, in one or more …
media that estimate the overlap between sets of data samples. In particular, in one or more …