Sketching algorithms for genomic data analysis and querying in a secure enclave
Genome-wide association studies (GWAS), especially on rare diseases, may necessitate
exchange of sensitive genomic data between multiple institutions. Since genomic data …
exchange of sensitive genomic data between multiple institutions. Since genomic data …
FQSqueezer: k-mer-based compression of sequencing data
S Deorowicz - Scientific reports, 2020 - nature.com
The amount of data produced by modern sequencing instruments that needs to be stored is
huge. Therefore it is not surprising that a lot of work has been done in the field of specialized …
huge. Therefore it is not surprising that a lot of work has been done in the field of specialized …
Index suffix–prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression
Motivation Advanced high-throughput sequencing technologies have produced massive
amount of reads data, and algorithms have been specially designed to contract the size of …
amount of reads data, and algorithms have been specially designed to contract the size of …
Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression
Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been
widely adopted for the de novo assembly of genomic short reads. This work studies another …
widely adopted for the de novo assembly of genomic short reads. This work studies another …
PgRC: pseudogenome-based read compressor
TM Kowalski, S Grabowski - Bioinformatics, 2020 - academic.oup.com
Motivation The amount of sequencing data from high-throughput sequencing technologies
grows at a pace exceeding the one predicted by Moore's law. One of the basic requirements …
grows at a pace exceeding the one predicted by Moore's law. One of the basic requirements …
FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model
D Lee, G Song - Bioinformatics, 2022 - academic.oup.com
Motivation Over the past decades, vast amounts of genome sequencing data have been
produced, requiring an enormous level of storage capacity. The time and resources needed …
produced, requiring an enormous level of storage capacity. The time and resources needed …
A novel approach to T-cell receptor beta chain (TCRB) repertoire encoding using lossless string compression
Motivation T-cell receptor beta chain (TCRB) repertoires are crucial for understanding
immune responses. However, their high diversity and complexity present significant …
immune responses. However, their high diversity and complexity present significant …
Transformation of FASTA files into feature vectors for unsupervised compression of short reads databases
T Tang, J Li - Journal of bioinformatics and computational biology, 2021 - World Scientific
FASTA data sets of short reads are usually generated in tens or hundreds for a biomedical
study. However, current compression of these data sets is carried out one-by-one without …
study. However, current compression of these data sets is carried out one-by-one without …
Tackling the challenges of FASTQ referential compression
The exponential growth of genomic data has recently motivated the development of
compression algorithms to tackle the storage capacity limitations in bioinformatics centers …
compression algorithms to tackle the storage capacity limitations in bioinformatics centers …
Genomic compression with read alignment at the decoder
We propose a new compression scheme for genomic data given as sequence fragments
called reads. The scheme uses a reference genome at the decoder side only, freeing the …
called reads. The scheme uses a reference genome at the decoder side only, freeing the …