Rpair: rescaling RePair with rsync
Data compression is a powerful tool for managing massive but repetitive datasets, especially
schemes such as grammar-based compression that support computation over the data …
schemes such as grammar-based compression that support computation over the data …
Theoretical Analysis of Byte-Pair Encoding
L Kozma, J Voderholzer - arxiv preprint arxiv:2411.08671, 2024 - arxiv.org
Byte-Pair Encoding (BPE) is a widely used method for subword tokenization, with origins in
grammar-based text compression. It is employed in a variety of language processing tasks …
grammar-based text compression. It is employed in a variety of language processing tasks …
Practical random access to SLP-compressed texts
Grammar-based compression is a popular and powerful approach to compressing repetitive
texts but until recently its relatively poor time-space trade-offs during real-life construction …
texts but until recently its relatively poor time-space trade-offs during real-life construction …
Fast and space-efficient construction of AVL grammars from the LZ77 parsing
Grammar compression is, next to Lempel–Ziv (LZ77) and run-length Burrows–Wheeler
transform (RLBWT), one of the most flexible approaches to representing and processing …
transform (RLBWT), one of the most flexible approaches to representing and processing …
Grammar boosting: A new technique for proving lower bounds for computation over compressed data
R De, D Kempa - Proceedings of the 2024 Annual ACM-SIAM …, 2024 - SIAM
Computation over compressed data is a new paradigm in the design of algorithms and data
structures that can reduce space usage and speed up computation by orders of magnitude …
structures that can reduce space usage and speed up computation by orders of magnitude …
Grammar compression by induced suffix sorting
A grammar compression algorithm, called GCIS, is introduced in this work. GCIS is based on
the induced suffix sorting algorithm SAIS, presented by Nong et al. in 2009. The proposed …
the induced suffix sorting algorithm SAIS, presented by Nong et al. in 2009. The proposed …
The smallest grammar problem revisited
H Bannai, M Hirayama, D Hucke… - IEEE Transactions …, 2020 - ieeexplore.ieee.org
In a seminal paper, Charikar et al. derive upper and lower bounds on the approximation
ratios for several grammar-based compressors, but in all cases there is a gap between the …
ratios for several grammar-based compressors, but in all cases there is a gap between the …
R-enum: Enumeration of characteristic substrings in BWT-runs bounded space
T Nishimoto, Y Tabei - arxiv preprint arxiv:2004.01493, 2020 - arxiv.org
Enumerating characteristic substrings (eg, maximal repeats, minimal unique substrings, and
minimal absent words) in a given string has been an important research topic because there …
minimal absent words) in a given string has been an important research topic because there …
A new algorithm for compression of partially commutative alphabets
In this paper, we will address the problem of how to compress sequences defined in partially
commutative alphabets. The use of partial order between symbols may result in far more …
commutative alphabets. The use of partial order between symbols may result in far more …
Constructing the CDAWG CFG using LCP-intervals
It is known that a context-free grammar (CFG) that produces a single string can be derived
from the compact directed acyclic word graph (CDAWG) for the same string. In this work, we …
from the compact directed acyclic word graph (CDAWG) for the same string. In this work, we …