Handling Massive N-Gram Datasets Efficiently
Two fundamental problems concern the handling of large n-gram language models:
indexing, that is, compressing the n-grams and associated satellite values without …
indexing, that is, compressing the n-grams and associated satellite values without …
Show some love to your n-grams: A bit of progress and stronger n-gram language modeling baselines
In recent years neural language models (LMs) have set state-of-the-art performance for
several benchmarking datasets. While the reasons for their success and their computational …
several benchmarking datasets. While the reasons for their success and their computational …
[PDF][PDF] Automatic understanding of unwritten languages
O Adams - 2017 - minerva-access.unimelb.edu.au
Many of the world's languages are falling out of use without a written record and minimal
linguistic documentation. Language documentation is a slow process and there are an …
linguistic documentation. Language documentation is a slow process and there are an …
Koala: An index for quantifying overlaps with pre-training corpora
In very recent years more attention has been placed on probing the role of pre-training data
in Large Language Models (LLMs) downstream behaviour. Despite the importance, there is …
in Large Language Models (LLMs) downstream behaviour. Despite the importance, there is …
A framework for space-efficient variable-order Markov models
Motivation Markov models with contexts of variable length are widely used in bioinformatics
for representing sets of sequences with similar biological properties. When models contain …
for representing sets of sequences with similar biological properties. When models contain …
Compressed nonparametric language modelling
Abstract Hierarchical Pitman-Yor Process priors are compelling for learning language
models, outperforming point-estimate based methods. However, these models remain …
models, outperforming point-estimate based methods. However, these models remain …
Succinct data structures for NLP-at-scale
Succinct data structures involve the use of novel data structures, compression technologies,
and other mechanisms to allow data to be stored in extremely small memory or disk …
and other mechanisms to allow data to be stored in extremely small memory or disk …
[PDF][PDF] Space-Efficient Algorithms for Strings and Prefix-Sortable Graphs.
J Alanko - 2020 - helda.helsinki.fi
Abstract Space-efficient data structures are an active field of research that has found many
applications in combinatorial pattern matching and bioinformatics. The idea is to build data …
applications in combinatorial pattern matching and bioinformatics. The idea is to build data …
Space and Time-Efficient Data Structures for Massive Datasets
GE Pibiri - 2019 - tesidottorato.depositolegale.it
This thesis concerns the design of compressed data structures for the efficient storage of
massive datasets of integer sequences and short strings. The studied problems arise in …
massive datasets of integer sequences and short strings. The studied problems arise in …
[PDF][PDF] GPU-accelerated k-mer counting
P Jylhä-Ollila - 2020 - core.ac.uk
A common task in bioinformatics algorithms is k-mer counting [MK11, KDD17]. Given a string
S, the problem is to count the frequency of each unique substring of length k in S. K-mer …
S, the problem is to count the frequency of each unique substring of length k in S. K-mer …