RLZ Blockwise Access Techniques
- RLZ blockwise access is a compression method that partitions data into fixed-size blocks using a sampled dictionary for efficient random access.
- It employs static metadata and RLZ phrase encoding to decode independent blocks, balancing compression ratio with fast retrieval.
- Empirical evaluations show that RLZ blockwise systems scale effectively for massive, repetitive corpora, optimizing both storage and access time.
RLZ blockwise access refers to techniques for enabling efficient random access in relative Lempel–Ziv (RLZ) compressed representations. The RLZ paradigm achieves high compression on large repetitive corpora by factorizing the input relative to a dictionary sampled from the corpus, and then encoding blocks, documents, or differentials as sequences of static-encoded factors. By partitioning the compressed data into blocks and storing dedicated metadata structures (such as offsets or phrase samples), RLZ permits fast retrieval of arbitrary segments or entries without decompressing the entire archive. This methodology is fundamental in scalable archiving, fast web retrieval, and succinct data structure construction.
1. Dictionary Construction and Block Partitioning
RLZ begins by extracting a global dictionary from the input corpus of size bytes. A sampling period and a target dictionary size (with ) are selected. The dictionary is formed by taking fixed-length samples of size from at equally spaced positions. For example, let and . Then
with , where typically (i.e., – of the corpus) (Petri et al., 2016, Hoobin et al., 2011). The dictionary is typically stored uncompressed in memory, with an overhead of (e.g., a $256$ MB dictionary for a $100$ GB corpus yields overhead).
The input corpus is partitioned into fixed-size blocks of bytes (), each serving as an independent unit of factorization and random-access decoding. Partitioning granularity (and its trade-offs) varies with the data type; for web collections, a document can serve as the block (Hoobin et al., 2011).
2. RLZ Block Encoding and Static Metadata
Each block is parsed from left to right into RLZ phrases: for position in the block, the algorithm finds the longest prefix matching a substring in ; if none exists, the byte is emitted as a literal. Maximal matches are encoded as pairs , with as the offset in and as the match length. Efficient substring indexing in (e.g., via a suffix array) permits phrase lookup per emitted factor (Hoobin et al., 2011).
Factors are encoded into two (or three) byte-aligned streams:
- Offsets: 32-bit unsigned (RLZ-UV) or packed to bits (RLZ-PV).
- Lengths: Variable byte code (vbyte); 1 byte for small lengths, expanding as needed.
- Literals: Optionally stored in a separate stream or mixed with factor representations.
Per-block, the stream(s) are concatenated—often prefixed with a factor count—forming the on-disk layout. Blockwise RLZ thus enables each block to be decompressed independently given and the block metadata.
An index array stores the on-disk offsets of each compressed block, supporting O(1) access to block .
3. Blockwise Random-Access Decoding Algorithms
For direct extraction of range , one computes the block indices and . Each affected block is decoded in isolation via the following operations:
- Use to locate and read the compressed block bytes.
- Parse the factor/literal streams for the block header.
- Sequentially reconstruct the block in a buffer, applying:
- For factor : copy into the output buffer.
- For literal: emit the byte directly.
- Extract the relevant segment from the output buffer for edge blocks; intermediate blocks are concatenable.
This procedure is captured in the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 |
def decode_range(a, b): i0 = a // B; i1 = b // B outBuf = bytearray() for i in range(i0, i1 + 1): pos, next = I[i], I[i+1] compData = disk_read(pos, next - pos) blockBuf = decode_block(compData, D) start = (a % B) if i == i0 else 0 end = (b % B) if i == i1 else len(blockBuf) - 1 outBuf += blockBuf[start:end+1] return outBuf |
In the context of differential suffix array blockwise access, RLZ parsing of (the differential array of a suffix array ) is stored as sequences of RLZ phrases, and blocks are sampled at every phrases. Efficient predecessor/successor data structures enable identifying the block containing any index , and block-local decoding reconstructs the relevant entries. A sample table stores block boundary positions, yielding storage overhead for total RLZ phrases (Dinklage et al., 23 Jul 2025).
4. Theoretical Time–Space Trade-offs
The access time to retrieve a random block or entry comprises seek latency (), transfer cost (which depends on the size and compression ratio), and decoding time. The total time for random block access is:
where:
- is compression ratio (compressed block size / ),
- is number of factors per block ( with average factor length increasing with dictionary size),
- is the per-factor decode cost.
Larger dictionaries both improve compression and reduce the number of factors , reducing both and decoding cost.
For RLZ-compressed suffix arrays, blockwise access time is
where is the phrase sampling step (block size in phrases) and bounds the longest phrase length. The requisite extra storage per sampled block is words. The overall space is:
(Dinklage et al., 23 Jul 2025).
5. Comparative Evaluation and Empirical Performance
Extensive evaluations have demonstrated that RLZ blockwise access achieves both high compression and rapid random-access. For a 426 GB web crawl (GOV2), a dictionary as small as ($0.5$ GB) achieved compression to $10$– of the original, with random-access throughput of documents/sec and sequential access of docs/sec using RLZ-UV encoding. Adaptive GZIP-block systems with 1 MB blocks achieved only $600$ docs/sec random-access, while LZMA-blocking provided similar compression but much slower access ($5$ docs/sec) (Hoobin et al., 2011).
For archive block sizes in , the Pareto-optimal point in practice is ; larger dictionaries () provide diminishing marginal returns beyond of corpus size for web-scale data. For HDDs ( ms, MB/s), compression is the dominant driver of access speed, while SSDs ( ms, MB/s) benefit more from optimization of block and decoding overheads (Petri et al., 2016).
A table of RLZ blockwise access in compressed suffix arrays illustrates the trade-off:
| Input | (phrases) | Bytes/int (RLZ) | /access (RLZ) |
|---|---|---|---|
| einstein | 2.6 M | 3.1 | 0.50 |
| sars2 | 8.3 M | 2.9 | 0.39 |
| dewiki | 12.5 M | 3.2 | 0.24 |
| chr19 | 0.82 M | 3.5 | 0.25 |
| english | 68.0 M | 3.2 | 0.23 |
RLZ-block occupies bytes/int and achieves sub-microsecond random access—LZ-End block compression may optimize space to $2.5$–$3$ B/int but with access latency $0.5$–$1.0$ μs (Dinklage et al., 23 Jul 2025).
6. Design Principles and Application Contexts
Design guidelines for RLZ blockwise archives depend on storage medium, access pattern, and memory availability:
- HDD archives: Prioritize compression (large , e.g., $256$–$1024$ MB) and select block size KB; RLZ-ZZ yields optimal space-speed. For random small reads, maximizing compression rate by increasing dictionary size is optimal.
- SSD archives: Reduce dictionary size ( MB), keep –$64$ KB, and use RLZ-PV for highest throughput, as decoding time rather than transfer is often the bottleneck.
- Memory constraints: Reduce if necessary but restrict KB to prevent decoding cost from dominating.
- Mostly-sequential scans: Block size becomes less relevant; increasing (e.g., $256$–$512$ KB) may minimize index overhead.
For compressed suffix arrays, RLZ parsing of differentials with blockwise sampling (e.g., block size or $16$) offers a middle ground between compressed and uncompressed representations in both space and access time (Dinklage et al., 23 Jul 2025). Blockwise RLZ supports efficient locate queries and pattern enumeration at scale, especially on texts with high BWT-run redundancy, such as genomic or large web collections.
7. Relationship to Alternatives and Trade-Offs
Blockwise RLZ should be contrasted with both block-oriented adaptive compressors (e.g., zlib, LZ4) and compressed suffix array representations (e.g., LZ-End).
- Adaptive block compressors: RLZ surpasses adaptive compressors in random-access latency and comparable or better compression, especially because RLZ blocks are self-contained and immediately decodable (stateless).
- LZ-End/r-index: LZ-End may yield fewer phrases than RLZ on the same data () and hence provides superior compression rates, but at the expense of higher worst-case access time (unbounded phrase length). RLZ blockwise sampling ensures phrase length can be bounded and per-query scan is limited to (Dinklage et al., 23 Jul 2025).
- Uncompressed SA: Plain SA achieves access but with bits space; RLZ blockwise methods shrink space to bits (with for repetitive texts) and sub-microsecond access, balancing succinctness and performance.
Experimental results corroborate these trade-offs across large-scale text collections (Petri et al., 2016, Hoobin et al., 2011, Dinklage et al., 23 Jul 2025).
This suggests RLZ blockwise access architectures are particularly well suited for massive, highly repetitive datasets requiring a mix of sequential scan and rapid “true” random access, a property exploited in modern web indexers and genomics.