One-pass diversified sampling with application to terabyte-scale genomic sequence streams
A popular approach to reduce the size of a massive dataset is to apply efficient online
sampling to the stream of data as it is read or generated. Online sampling routines are …
sampling to the stream of data as it is read or generated. Online sampling routines are …
Fast Comparative Analysis of Merge Trees Using Locality Sensitive Hashing
Scalar field comparison is a fundamental task in scientific visualization. In topological data
analysis, we compare topological descriptors of scalar fields—such as persistence diagrams …
analysis, we compare topological descriptors of scalar fields—such as persistence diagrams …
[HTML][HTML] Unconventional application of k-means for distributed approximate similarity search
Similarity search based on a distance function in metric spaces is a fundamental problem for
many applications. Queries for similar objects lead to the well-known machine learning task …
many applications. Queries for similar objects lead to the well-known machine learning task …
Efficient Parallel Output-Sensitive Edit Distance
We study efficient parallel algorithms for output-sensitive edit distance, achieving
asymptotically better cost bounds than the standard Θ (nm) dynamic programming algorithm …
asymptotically better cost bounds than the standard Θ (nm) dynamic programming algorithm …
Authenticating q-Gram-Based Similarity Search Results for Outsourced String Databases
L Yang, H Ye, X Liu, Y Mao, J Zhang - Mathematics, 2023 - mdpi.com
Approximate string searches have been widely applied in many fields, such as
bioinformatics, text retrieval, search engines, and location-based services (LBS). However …
bioinformatics, text retrieval, search engines, and location-based services (LBS). However …
Improved Space-Efficient Approximate Nearest Neighbor Search Using Function Inversion
S McCauley - arxiv preprint arxiv:2407.02468, 2024 - arxiv.org
Approximate nearest neighbor search (ANN) data structures have widespread applications
in machine learning, computational biology, and text processing. The goal of ANN is to …
in machine learning, computational biology, and text processing. The goal of ANN is to …
Toward Efficient Similarity Search under Edit Distance on Hybrid Architectures
Edit distance is the most widely used method to quantify similarity between two strings. We
investigate the problem of similarity search under edit distance. Given a collection of …
investigate the problem of similarity search under edit distance. Given a collection of …
Index structures for fast similarity search for symbol strings
DA Rachkovskij - Cybernetics and Systems Analysis, 2019 - Springer
This article surveys index structures for fast similarity search for objects represented by
symbol strings. Index structures both for exact and approximate searches by edit distance …
symbol strings. Index structures both for exact and approximate searches by edit distance …
Diversified RACE sampling on data streams applied to metagenomic sequence analysis
The rise of whole-genome shotgun sequencing (WGS) has enabled numerous
breakthroughs in large-scale comparative genomics research. However, the size of genomic …
breakthroughs in large-scale comparative genomics research. However, the size of genomic …
Locality-sensitive bucketing functions for the edit distance
Background Many bioinformatics applications involve bucketing a set of sequences where
each sequence is allowed to be assigned into multiple buckets. To achieve both high …
each sequence is allowed to be assigned into multiple buckets. To achieve both high …