Papers
Topics
Authors
Recent
Search
2000 character limit reached

RLZ Blockwise Access Techniques

Updated 24 March 2026
  • RLZ blockwise access is a compression method that partitions data into fixed-size blocks using a sampled dictionary for efficient random access.
  • It employs static metadata and RLZ phrase encoding to decode independent blocks, balancing compression ratio with fast retrieval.
  • Empirical evaluations show that RLZ blockwise systems scale effectively for massive, repetitive corpora, optimizing both storage and access time.

RLZ blockwise access refers to techniques for enabling efficient random access in relative Lempel–Ziv (RLZ) compressed representations. The RLZ paradigm achieves high compression on large repetitive corpora by factorizing the input relative to a dictionary sampled from the corpus, and then encoding blocks, documents, or differentials as sequences of static-encoded factors. By partitioning the compressed data into blocks and storing dedicated metadata structures (such as offsets or phrase samples), RLZ permits fast retrieval of arbitrary segments or entries without decompressing the entire archive. This methodology is fundamental in scalable archiving, fast web retrieval, and succinct data structure construction.

1. Dictionary Construction and Block Partitioning

RLZ begins by extracting a global dictionary from the input corpus CC of size C=N|C| = N bytes. A sampling period ss and a target dictionary size DD (with DND \ll N) are selected. The dictionary DD is formed by taking D/s{D}/{s} fixed-length samples of size ss from CC at equally spaced positions. For example, let ij=(j1)N/S+1i_j = \lfloor (j-1)\cdot N/S \rfloor + 1 and S=N/sS = \lfloor N/s \rfloor. Then

D=C[i1..i1+s1]  C[i2..i2+s1]    C[iS..iS+s1]D = C[i_1..i_1+s-1] \ \Vert\ C[i_2..i_2+s-1] \ \Vert\ \ldots \ \Vert\ C[i_S..i_S+s-1]

with D=m=sSαN|D| = m = s\cdot S \approx \alpha N, where typically α[0.001,0.01]\alpha \in [0.001, 0.01] (i.e., 0.1%0.1\%1%1\% of the corpus) (Petri et al., 2016, Hoobin et al., 2011). The dictionary is typically stored uncompressed in memory, with an overhead of D/CD/|C| (e.g., a $256$ MB dictionary for a $100$ GB corpus yields 0.25%0.25\% overhead).

The input corpus is partitioned into fixed-size blocks of BB bytes (B[16KB,256KB]B\in[16\,\mathrm{KB}, 256\,\mathrm{KB}]), each serving as an independent unit of factorization and random-access decoding. Partitioning granularity (and its trade-offs) varies with the data type; for web collections, a document can serve as the block (Hoobin et al., 2011).

2. RLZ Block Encoding and Static Metadata

Each block is parsed from left to right into RLZ phrases: for position pp in the block, the algorithm finds the longest prefix matching a substring in DD; if none exists, the byte is emitted as a literal. Maximal matches are encoded as pairs (o,)(o, \ell), with oo as the offset in DD and \ell as the match length. Efficient substring indexing in DD (e.g., via a suffix array) permits O(+logm)O(\ell + \log m) phrase lookup per emitted factor (Hoobin et al., 2011).

Factors are encoded into two (or three) byte-aligned streams:

  • Offsets: 32-bit unsigned (RLZ-UV) or packed to log2D\lceil\log_2 D\rceil bits (RLZ-PV).
  • Lengths: Variable byte code (vbyte); 1 byte for small lengths, expanding as needed.
  • Literals: Optionally stored in a separate stream or mixed with factor representations.

Per-block, the stream(s) are concatenated—often prefixed with a factor count—forming the on-disk layout. Blockwise RLZ thus enables each block to be decompressed independently given DD and the block metadata.

An index array I[0..N]I[0..N] stores the on-disk offsets of each compressed block, supporting O(1) access to block ii.

3. Blockwise Random-Access Decoding Algorithms

For direct extraction of range C[a..b]C[a..b], one computes the block indices i0=a/Bi_0 = \lfloor a/B \rfloor and i1=b/Bi_1 = \lfloor b/B \rfloor. Each affected block is decoded in isolation via the following operations:

  1. Use I[i]I[i] to locate and read the compressed block bytes.
  2. Parse the factor/literal streams for the block header.
  3. Sequentially reconstruct the block in a buffer, applying:
    • For factor (o,)(o, \ell): copy D[o..o+1]D[o..o+\ell-1] into the output buffer.
    • For literal: emit the byte directly.
  4. Extract the relevant segment from the output buffer for edge blocks; intermediate blocks are concatenable.

This procedure is captured in the following pseudocode:

1
2
3
4
5
6
7
8
9
10
11
def decode_range(a, b):
    i0 = a // B;  i1 = b // B
    outBuf = bytearray()
    for i in range(i0, i1 + 1):
        pos, next = I[i], I[i+1]
        compData = disk_read(pos, next - pos)
        blockBuf = decode_block(compData, D)
        start = (a % B) if i == i0 else 0
        end = (b % B) if i == i1 else len(blockBuf) - 1
        outBuf += blockBuf[start:end+1]
    return outBuf
(Petri et al., 2016)

In the context of differential suffix array blockwise access, RLZ parsing of AdA^d (the differential array of a suffix array AA) is stored as sequences of RLZ phrases, and blocks are sampled at every aa phrases. Efficient predecessor/successor data structures enable identifying the block containing any index ii, and block-local decoding reconstructs the relevant entries. A sample table SCPSCP stores block boundary positions, yielding O(zr/a)O(z_r/a) storage overhead for zrz_r total RLZ phrases (Dinklage et al., 23 Jul 2025).

4. Theoretical Time–Space Trade-offs

The access time to retrieve a random block or entry comprises seek latency (TseekT_{\mathrm{seek}}), transfer cost (which depends on the size and compression ratio), and decoding time. The total time for random block access is:

Taccess=Tseek+cBR+FtdecodeT_{\mathrm{access}} = T_\mathrm{seek} + \frac{cB}{R} + F\cdot t_\mathrm{decode}

where:

  • cc is compression ratio (compressed block size / BB),
  • FF is number of factors per block (FB/LavgF \approx B/L_{\mathrm{avg}} with average factor length LavgL_{\mathrm{avg}} increasing with dictionary size),
  • tdecodet_{\mathrm{decode}} is the per-factor decode cost.

Larger dictionaries both improve compression and reduce the number of factors FF, reducing both cc and decoding cost.

For RLZ-compressed suffix arrays, blockwise access time is

Tacc(i)=O(loglog(n/zr)+a+h)T_{\mathrm{acc}}(i) = O(\log\log(n/z_r) + a + h)

where aa is the phrase sampling step (block size in phrases) and hh bounds the longest phrase length. The requisite extra storage per sampled block is O(zr/a)O(z_r/a) words. The overall space is:

Rlogn+zrlog(n/zr)+(zr/a)logn|R| \log n + z_r \log (n/z_r) + (z_r/a) \log n

(Dinklage et al., 23 Jul 2025).

5. Comparative Evaluation and Empirical Performance

Extensive evaluations have demonstrated that RLZ blockwise access achieves both high compression and rapid random-access. For a 426 GB web crawl (GOV2), a dictionary as small as 0.1%0.1\% ($0.5$ GB) achieved compression to $10$–11%11\% of the original, with random-access throughput of 11000\approx 11\,000 documents/sec and sequential access of 1600016\,000 docs/sec using RLZ-UV encoding. Adaptive GZIP-block systems with 1 MB blocks achieved only $600$ docs/sec random-access, while LZMA-blocking provided similar compression but much slower access ($5$ docs/sec) (Hoobin et al., 2011).

For archive block sizes in [16KB,256KB][16\,\mathrm{KB}, 256\,\mathrm{KB}], the Pareto-optimal point in practice is B64KBB\approx64\,\mathrm{KB}; larger dictionaries (256MB\geq 256\,\mathrm{MB}) provide diminishing marginal returns beyond 1%1\% of corpus size for web-scale data. For HDDs (Tseek8.5T_{\mathrm{seek}}\approx8.5 ms, R150R\approx150 MB/s), compression is the dominant driver of access speed, while SSDs (Tseek0.2T_{\mathrm{seek}}\approx0.2 ms, R1000R\approx1000 MB/s) benefit more from optimization of block and decoding overheads (Petri et al., 2016).

A table of RLZ blockwise access in compressed suffix arrays illustrates the trade-off:

Input zrz_r (phrases) Bytes/int (RLZ) μs\mu s/access (RLZ)
einstein 2.6 M 3.1 0.50
sars2 8.3 M 2.9 0.39
dewiki 12.5 M 3.2 0.24
chr19 0.82 M 3.5 0.25
english 68.0 M 3.2 0.23

RLZ-block occupies 3\approx 3 bytes/int and achieves sub-microsecond random access—LZ-End block compression may optimize space to $2.5$–$3$ B/int but with access latency $0.5$–$1.0$ μs (Dinklage et al., 23 Jul 2025).

6. Design Principles and Application Contexts

Design guidelines for RLZ blockwise archives depend on storage medium, access pattern, and memory availability:

  • HDD archives: Prioritize compression (large DD, e.g., $256$–$1024$ MB) and select block size B64B \sim 64 KB; RLZ-ZZ yields optimal space-speed. For random small reads, maximizing compression rate by increasing dictionary size is optimal.
  • SSD archives: Reduce dictionary size (256\sim256 MB), keep B=32B=32–$64$ KB, and use RLZ-PV for highest throughput, as decoding time rather than transfer is often the bottleneck.
  • Memory constraints: Reduce DD if necessary but restrict B<128B<128 KB to prevent decoding cost from dominating.
  • Mostly-sequential scans: Block size becomes less relevant; increasing BB (e.g., $256$–$512$ KB) may minimize index overhead.

For compressed suffix arrays, RLZ parsing of differentials with blockwise sampling (e.g., block size a=4a=4 or $16$) offers a middle ground between compressed and uncompressed representations in both space and access time (Dinklage et al., 23 Jul 2025). Blockwise RLZ supports efficient locate queries and pattern enumeration at scale, especially on texts with high BWT-run redundancy, such as genomic or large web collections.

7. Relationship to Alternatives and Trade-Offs

Blockwise RLZ should be contrasted with both block-oriented adaptive compressors (e.g., zlib, LZ4) and compressed suffix array representations (e.g., LZ-End).

  • Adaptive block compressors: RLZ surpasses adaptive compressors in random-access latency and comparable or better compression, especially because RLZ blocks are self-contained and immediately decodable (stateless).
  • LZ-End/r-index: LZ-End may yield fewer phrases than RLZ on the same data (ze<zrz_e < z_r) and hence provides superior compression rates, but at the expense of higher worst-case access time (unbounded phrase length). RLZ blockwise sampling ensures phrase length hh can be bounded and per-query scan is limited to a+h\approx a+h (Dinklage et al., 23 Jul 2025).
  • Uncompressed SA: Plain SA achieves O(1)O(1) access but with nlognn \lceil \log n \rceil bits space; RLZ blockwise methods shrink space to O(zrlog(n/zr))O(z_r \log (n/z_r)) bits (with zrnz_r \ll n for repetitive texts) and sub-microsecond access, balancing succinctness and performance.

Experimental results corroborate these trade-offs across large-scale text collections (Petri et al., 2016, Hoobin et al., 2011, Dinklage et al., 23 Jul 2025).

This suggests RLZ blockwise access architectures are particularly well suited for massive, highly repetitive datasets requiring a mix of sequential scan and rapid “true” random access, a property exploited in modern web indexers and genomics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RLZ Blockwise Access.