Data structures based on k-mers for querying large collections of sequencing data sets

C Marchet, C Boucher, SJ Puglisi, P Medvedev… - Genome …, 2021 - genome.cshlp.org
High-throughput sequencing data sets are usually deposited in public repositories (eg, the
European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached …

Data Structures to Represent a Set of k-long DNA Sequences

R Chikhi, J Holub, P Medvedev - ACM Computing Surveys (CSUR), 2021 - dl.acm.org
The analysis of biological sequencing data has been one of the biggest applications of
string algorithms. The approaches used in many such applications are based on the …

Mash Screen: high-throughput sequence containment estimation for genome discovery

BD Ondov, GJ Starrett, A Sap**ton, A Kostic, S Koren… - Genome biology, 2019 - Springer
The MinHash algorithm has proven effective for rapidly estimating the resemblance of two
genomes or metagenomes. However, this method cannot reliably estimate the containment …

GraphAligner: rapid and versatile sequence-to-graph alignment

M Rautiainen, T Marschall - Genome biology, 2020 - Springer
Genome graphs can represent genetic variation and sequence uncertainty. Aligning
sequences to genome graphs is key to many applications, including error correction …

Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT

A Cracco, AI Tomescu - Genome Research, 2023 - genome.cshlp.org
Compacted de Bruijn graphs are one of the most fundamental data structures in
computational genomics. Colored compacted de Bruijn graphs are a variant built on a …

Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs

G Holley, P Melsted - Genome biology, 2020 - Springer
Memory consumption of de Bruijn graphs is often prohibitive. Most de Bruijn graph-based
assemblers reduce the complexity by compacting paths into single vertices, but this is …

A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events

M Jaillard, L Lima, M Tournoud, P Mahé… - PLoS …, 2018 - journals.plos.org
Genome-wide association study (GWAS) methods applied to bacterial genomes have
shown promising results for genetic marker discovery or detailed assessment of marker …

Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2

J Khan, M Kokot, S Deorowicz, R Patro - Genome biology, 2022 - Springer
The de Bruijn graph is a key data structure in modern computational genomics, and
construction of its compacted variant resides upstream of many genomic analyses. As the …

A space and time-efficient index for the compacted colored de Bruijn graph

F Almodaresi, H Sarkar, A Srivastava, R Patro - Bioinformatics, 2018 - academic.oup.com
Motivation Indexing reference sequences for search—both individual genomes and
collections of genomes—is an important building block for many sequence analysis tasks …

REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets

C Marchet, Z Iqbal, D Gautheret, M Salson… - …, 2020 - academic.oup.com
Motivation In this work we present REINDEER, a novel computational method that performs
indexing of sequences and records their abundances across a collection of datasets. To the …