RLZ Blockwise Access Techniques

Updated 24 March 2026

RLZ blockwise access is a compression method that partitions data into fixed-size blocks using a sampled dictionary for efficient random access.
It employs static metadata and RLZ phrase encoding to decode independent blocks, balancing compression ratio with fast retrieval.
Empirical evaluations show that RLZ blockwise systems scale effectively for massive, repetitive corpora, optimizing both storage and access time.

RLZ blockwise access refers to techniques for enabling efficient random access in relative Lempel–Ziv (RLZ) compressed representations. The RLZ paradigm achieves high compression on large repetitive corpora by factorizing the input relative to a dictionary sampled from the corpus, and then encoding blocks, documents, or differentials as sequences of static-encoded factors. By partitioning the compressed data into blocks and storing dedicated metadata structures (such as offsets or phrase samples), RLZ permits fast retrieval of arbitrary segments or entries without decompressing the entire archive. This methodology is fundamental in scalable archiving, fast web retrieval, and succinct data structure construction.

1. Dictionary Construction and Block Partitioning

RLZ begins by extracting a global dictionary from the input corpus $C$ of size $|C| = N$ bytes. A sampling period $s$ and a target dictionary size $D$ (with $D \ll N$ ) are selected. The dictionary $D$ is formed by taking ${D}/{s}$ fixed-length samples of size $s$ from $C$ at equally spaced positions. For example, let $i_j = \lfloor (j-1)\cdot N/S \rfloor + 1$ and $S = \lfloor N/s \rfloor$ . Then

$D = C[i_1..i_1+s-1] \ \Vert\ C[i_2..i_2+s-1] \ \Vert\ \ldots \ \Vert\ C[i_S..i_S+s-1]$

with $|D| = m = s\cdot S \approx \alpha N$ , where typically $\alpha \in [0.001, 0.01]$ (i.e., $0.1\%$ – $1\%$ of the corpus) (Petri et al., 2016, Hoobin et al., 2011). The dictionary is typically stored uncompressed in memory, with an overhead of $D/|C|$ (e.g., a $256$ MB dictionary for a $100$ GB corpus yields $0.25\%$ overhead).

The input corpus is partitioned into fixed-size blocks of $B$ bytes ( $B\in[16\,\mathrm{KB}, 256\,\mathrm{KB}]$ ), each serving as an independent unit of factorization and random-access decoding. Partitioning granularity (and its trade-offs) varies with the data type; for web collections, a document can serve as the block (Hoobin et al., 2011).

2. RLZ Block Encoding and Static Metadata

Each block is parsed from left to right into RLZ phrases: for position $p$ in the block, the algorithm finds the longest prefix matching a substring in $D$ ; if none exists, the byte is emitted as a literal. Maximal matches are encoded as pairs $(o, \ell)$ , with $o$ as the offset in $D$ and $\ell$ as the match length. Efficient substring indexing in $D$ (e.g., via a suffix array) permits $O(\ell + \log m)$ phrase lookup per emitted factor (Hoobin et al., 2011).

Factors are encoded into two (or three) byte-aligned streams:

Offsets: 32-bit unsigned (RLZ-UV) or packed to $\lceil\log_2 D\rceil$ bits (RLZ-PV).
Lengths: Variable byte code (vbyte); 1 byte for small lengths, expanding as needed.
Literals: Optionally stored in a separate stream or mixed with factor representations.

Per-block, the stream(s) are concatenated—often prefixed with a factor count—forming the on-disk layout. Blockwise RLZ thus enables each block to be decompressed independently given $D$ and the block metadata.

An index array $I[0..N]$ stores the on-disk offsets of each compressed block, supporting O(1) access to block $i$ .

3. Blockwise Random-Access Decoding Algorithms

For direct extraction of range $C[a..b]$ , one computes the block indices $i_0 = \lfloor a/B \rfloor$ and $i_1 = \lfloor b/B \rfloor$ . Each affected block is decoded in isolation via the following operations:

Use $I[i]$ to locate and read the compressed block bytes.
Parse the factor/literal streams for the block header.
Sequentially reconstruct the block in a buffer, applying:
- For factor $(o, \ell)$ : copy $D[o..o+\ell-1]$ into the output buffer.
- For literal: emit the byte directly.
Extract the relevant segment from the output buffer for edge blocks; intermediate blocks are concatenable.

This procedure is captured in the following pseudocode:

def decode_range(a, b):
    i0 = a // B;  i1 = b // B
    outBuf = bytearray()
    for i in range(i0, i1 + 1):
        pos, next = I[i], I[i+1]
        compData = disk_read(pos, next - pos)
        blockBuf = decode_block(compData, D)
        start = (a % B) if i == i0 else 0
        end = (b % B) if i == i1 else len(blockBuf) - 1
        outBuf += blockBuf[start:end+1]
    return outBuf

(Petri et al., 2016)

In the context of differential suffix array blockwise access, RLZ parsing of $A^d$ (the differential array of a suffix array $A$ ) is stored as sequences of RLZ phrases, and blocks are sampled at every $a$ phrases. Efficient predecessor/successor data structures enable identifying the block containing any index $i$ , and block-local decoding reconstructs the relevant entries. A sample table $SCP$ stores block boundary positions, yielding $O(z_r/a)$ storage overhead for $z_r$ total RLZ phrases (Dinklage et al., 23 Jul 2025).

4. Theoretical Time–Space Trade-offs

The access time to retrieve a random block or entry comprises seek latency ( $T_{\mathrm{seek}}$ ), transfer cost (which depends on the size and compression ratio), and decoding time. The total time for random block access is:

$T_{\mathrm{access}} = T_\mathrm{seek} + \frac{cB}{R} + F\cdot t_\mathrm{decode}$

where:

$c$ is compression ratio (compressed block size / $B$ ),
$F$ is number of factors per block ( $F \approx B/L_{\mathrm{avg}}$ with average factor length $L_{\mathrm{avg}}$ increasing with dictionary size),
$t_{\mathrm{decode}}$ is the per-factor decode cost.

Larger dictionaries both improve compression and reduce the number of factors $F$ , reducing both $c$ and decoding cost.

For RLZ-compressed suffix arrays, blockwise access time is

$T_{\mathrm{acc}}(i) = O(\log\log(n/z_r) + a + h)$

where $a$ is the phrase sampling step (block size in phrases) and $h$ bounds the longest phrase length. The requisite extra storage per sampled block is $O(z_r/a)$ words. The overall space is:

$|R| \log n + z_r \log (n/z_r) + (z_r/a) \log n$

(Dinklage et al., 23 Jul 2025).

5. Comparative Evaluation and Empirical Performance

Extensive evaluations have demonstrated that RLZ blockwise access achieves both high compression and rapid random-access. For a 426 GB web crawl (GOV2), a dictionary as small as $0.1\%$ ($0.5$ GB) achieved compression to $10$– $11\%$ of the original, with random-access throughput of $\approx 11\,000$ documents/sec and sequential access of $16\,000$ docs/sec using RLZ-UV encoding. Adaptive GZIP-block systems with 1 MB blocks achieved only $600$ docs/sec random-access, while LZMA-blocking provided similar compression but much slower access ($5$ docs/sec) (Hoobin et al., 2011).

For archive block sizes in $[16\,\mathrm{KB}, 256\,\mathrm{KB}]$ , the Pareto-optimal point in practice is $B\approx64\,\mathrm{KB}$ ; larger dictionaries ( $\geq 256\,\mathrm{MB}$ ) provide diminishing marginal returns beyond $1\%$ of corpus size for web-scale data. For HDDs ( $T_{\mathrm{seek}}\approx8.5$ ms, $R\approx150$ MB/s), compression is the dominant driver of access speed, while SSDs ( $T_{\mathrm{seek}}\approx0.2$ ms, $R\approx1000$ MB/s) benefit more from optimization of block and decoding overheads (Petri et al., 2016).

A table of RLZ blockwise access in compressed suffix arrays illustrates the trade-off:

Input	$z_r$ (phrases)	Bytes/int (RLZ)	$\mu s$ /access (RLZ)
einstein	2.6 M	3.1	0.50
sars2	8.3 M	2.9	0.39
dewiki	12.5 M	3.2	0.24
chr19	0.82 M	3.5	0.25
english	68.0 M	3.2	0.23

RLZ-block occupies $\approx 3$ bytes/int and achieves sub-microsecond random access—LZ-End block compression may optimize space to $2.5$–$3$ B/int but with access latency $0.5$–$1.0$ μs (Dinklage et al., 23 Jul 2025).

6. Design Principles and Application Contexts

Design guidelines for RLZ blockwise archives depend on storage medium, access pattern, and memory availability:

HDD archives: Prioritize compression (large $D$ , e.g., $256$–$1024$ MB) and select block size $B \sim 64$ KB; RLZ-ZZ yields optimal space-speed. For random small reads, maximizing compression rate by increasing dictionary size is optimal.
SSD archives: Reduce dictionary size ( $\sim256$ MB), keep $B=32$ –$64$ KB, and use RLZ-PV for highest throughput, as decoding time rather than transfer is often the bottleneck.
Memory constraints: Reduce $D$ if necessary but restrict $B<128$ KB to prevent decoding cost from dominating.
Mostly-sequential scans: Block size becomes less relevant; increasing $B$ (e.g., $256$–$512$ KB) may minimize index overhead.

For compressed suffix arrays, RLZ parsing of differentials with blockwise sampling (e.g., block size $a=4$ or $16$) offers a middle ground between compressed and uncompressed representations in both space and access time (Dinklage et al., 23 Jul 2025). Blockwise RLZ supports efficient locate queries and pattern enumeration at scale, especially on texts with high BWT-run redundancy, such as genomic or large web collections.

7. Relationship to Alternatives and Trade-Offs

Blockwise RLZ should be contrasted with both block-oriented adaptive compressors (e.g., zlib, LZ4) and compressed suffix array representations (e.g., LZ-End).

Adaptive block compressors: RLZ surpasses adaptive compressors in random-access latency and comparable or better compression, especially because RLZ blocks are self-contained and immediately decodable (stateless).
LZ-End/r-index: LZ-End may yield fewer phrases than RLZ on the same data ( $z_e < z_r$ ) and hence provides superior compression rates, but at the expense of higher worst-case access time (unbounded phrase length). RLZ blockwise sampling ensures phrase length $h$ can be bounded and per-query scan is limited to $\approx a+h$ (Dinklage et al., 23 Jul 2025).
Uncompressed SA: Plain SA achieves $O(1)$ access but with $n \lceil \log n \rceil$ bits space; RLZ blockwise methods shrink space to $O(z_r \log (n/z_r))$ bits (with $z_r \ll n$ for repetitive texts) and sub-microsecond access, balancing succinctness and performance.

Experimental results corroborate these trade-offs across large-scale text collections (Petri et al., 2016, Hoobin et al., 2011, Dinklage et al., 23 Jul 2025).

This suggests RLZ blockwise access architectures are particularly well suited for massive, highly repetitive datasets requiring a mix of sequential scan and rapid “true” random access, a property exploited in modern web indexers and genomics.

Markdown Report Issue Upgrade to Chat

References (3)

Access Time Tradeoffs in Archive Compression (2016)

Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections (2011)

RLZ-r and LZ-End-r: Enhancing Move-r (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RLZ Blockwise Access.