Handling Massive N-Gram Datasets Efficiently

GE Pibiri, R Venturini - ACM Transactions on Information Systems (TOIS), 2019 - dl.acm.org
Two fundamental problems concern the handling of large n-gram language models:
indexing, that is, compressing the n-grams and associated satellite values without …

Show some love to your n-grams: A bit of progress and stronger n-gram language modeling baselines

E Shareghi, D Gerz, I Vulic - 2019 - repository.cam.ac.uk
In recent years neural language models (LMs) have set state-of-the-art performance for
several benchmarking datasets. While the reasons for their success and their computational …

[PDF][PDF] Automatic understanding of unwritten languages

O Adams - 2017 - minerva-access.unimelb.edu.au
Many of the world's languages are falling out of use without a written record and minimal
linguistic documentation. Language documentation is a slow process and there are an …

Koala: An index for quantifying overlaps with pre-training corpora

TT Vu, X He, G Haffari, E Shareghi - arxiv preprint arxiv:2303.14770, 2023 - arxiv.org
In very recent years more attention has been placed on probing the role of pre-training data
in Large Language Models (LLMs) downstream behaviour. Despite the importance, there is …

A framework for space-efficient variable-order Markov models

F Cunial, J Alanko, D Belazzougui - Bioinformatics, 2019 - academic.oup.com
Motivation Markov models with contexts of variable length are widely used in bioinformatics
for representing sets of sequences with similar biological properties. When models contain …

Compressed nonparametric language modelling

E Shareghi, G Haffari, T Cohn - International Joint Conference …, 2017 - research.monash.edu
Abstract Hierarchical Pitman-Yor Process priors are compelling for learning language
models, outperforming point-estimate based methods. However, these models remain …

Succinct data structures for NLP-at-scale

M Petri, T Cohn - Proceedings of COLING 2016, the 26th …, 2016 - aclanthology.org
Succinct data structures involve the use of novel data structures, compression technologies,
and other mechanisms to allow data to be stored in extremely small memory or disk …

[PDF][PDF] Space-Efficient Algorithms for Strings and Prefix-Sortable Graphs.

J Alanko - 2020 - helda.helsinki.fi
Abstract Space-efficient data structures are an active field of research that has found many
applications in combinatorial pattern matching and bioinformatics. The idea is to build data …

Space and Time-Efficient Data Structures for Massive Datasets

GE Pibiri - 2019 - tesidottorato.depositolegale.it
This thesis concerns the design of compressed data structures for the efficient storage of
massive datasets of integer sequences and short strings. The studied problems arise in …

[PDF][PDF] GPU-accelerated k-mer counting

P Jylhä-Ollila - 2020 - core.ac.uk
A common task in bioinformatics algorithms is k-mer counting [MK11, KDD17]. Given a string
S, the problem is to count the frequency of each unique substring of length k in S. K-mer …