One-pass diversified sampling with application to terabyte-scale genomic sequence streams

B Coleman, B Geordie, L Chou… - International …, 2022 - proceedings.mlr.press
A popular approach to reduce the size of a massive dataset is to apply efficient online
sampling to the stream of data as it is read or generated. Online sampling routines are …

Fast Comparative Analysis of Merge Trees Using Locality Sensitive Hashing

W Lyu, R Sridharamurthy, JM Phillips… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Scalar field comparison is a fundamental task in scientific visualization. In topological data
analysis, we compare topological descriptors of scalar fields—such as persistence diagrams …

[HTML][HTML] Unconventional application of k-means for distributed approximate similarity search

F Ortega, MJ Algar, IM de Diego, JM Moguerza - Information Sciences, 2023 - Elsevier
Similarity search based on a distance function in metric spaces is a fundamental problem for
many applications. Queries for similar objects lead to the well-known machine learning task …

Efficient Parallel Output-Sensitive Edit Distance

X Ding, X Dong, Y Gu, Y Liu, Y Sun - … of the 2024 ACM Workshop on …, 2024 - dl.acm.org
We study efficient parallel algorithms for output-sensitive edit distance, achieving
asymptotically better cost bounds than the standard Θ (nm) dynamic programming algorithm …

Authenticating q-Gram-Based Similarity Search Results for Outsourced String Databases

L Yang, H Ye, X Liu, Y Mao, J Zhang - Mathematics, 2023 - mdpi.com
Approximate string searches have been widely applied in many fields, such as
bioinformatics, text retrieval, search engines, and location-based services (LBS). However …

Improved Space-Efficient Approximate Nearest Neighbor Search Using Function Inversion

S McCauley - arxiv preprint arxiv:2407.02468, 2024 - arxiv.org
Approximate nearest neighbor search (ANN) data structures have widespread applications
in machine learning, computational biology, and text processing. The goal of ANN is to …

Toward Efficient Similarity Search under Edit Distance on Hybrid Architectures

M Khalid, MM Yousaf, MU Sadiq - Information, 2022 - mdpi.com
Edit distance is the most widely used method to quantify similarity between two strings. We
investigate the problem of similarity search under edit distance. Given a collection of …

Index structures for fast similarity search for symbol strings

DA Rachkovskij - Cybernetics and Systems Analysis, 2019 - Springer
This article surveys index structures for fast similarity search for objects represented by
symbol strings. Index structures both for exact and approximate searches by edit distance …

Diversified RACE sampling on data streams applied to metagenomic sequence analysis

B Coleman, B Geordie, L Chou, RAL Elworth… - bioRxiv, 2019 - biorxiv.org
The rise of whole-genome shotgun sequencing (WGS) has enabled numerous
breakthroughs in large-scale comparative genomics research. However, the size of genomic …

Locality-sensitive bucketing functions for the edit distance

K Chen, M Shao - Algorithms for Molecular Biology, 2023 - Springer
Background Many bioinformatics applications involve bucketing a set of sequences where
each sequence is allowed to be assigned into multiple buckets. To achieve both high …