Random-Access Scanning Techniques

Updated 19 December 2025

Random-access scanning is a technique that retrieves arbitrary data fragments from large, compressed archives using efficient indexing and block-based decoding.
It underpins diverse applications like web archiving, DNA storage, columnar analytics, and neural texture rendering by optimizing latency and I/O overhead.
State-of-the-art methods balance compression efficiency with rapid, selective decoding through innovative code designs and structured index frameworks.

Random-access scanning denotes the capacity to efficiently retrieve arbitrary, small fragments from large datasets or encoded media without the necessity of full decompression or sequential access. The paradigm underpins numerous applications: web archiving, DNA molecular storage, NVMe-backed columnar analytics, and real-time texture decompression in graphics rendering. At its core, random-access scanning demands indexing structures, encoding layouts, and code designs that minimize both I/O and computational amplification for point queries, while maximizing compression/encoding efficiency. The following sections synthesize state-of-the-art approaches and theoretical frameworks for random-access scanning, as exemplified by compression in repetitive archives (Petri et al., 2016), DNA strand recovery (Boruchovsky et al., 21 Jan 2025, Gruica et al., 2024), adaptive columnar storage (Pace et al., 21 Apr 2025), and neural graphics texture decoding (Farhadzadeh et al., 2024).

1. Foundational Concepts and Definitions

Random-access scanning encapsulates the ability to recover an arbitrary small substring or data element (e.g., a 16 B text snippet or a single column value) from a compressed or encoded archive with low latency and minimal overhead (Petri et al., 2016, Pace et al., 21 Apr 2025). In the classic model, segmented archives employ block-wise compression accompanied by block indexes, facilitating location and selective decompression of only the portions intersecting the queried range. In molecular storage, random-access scanning equates to querying coverage depth—how many unordered samples (reads) are required to reconstruct a specific encoded information strand (Boruchovsky et al., 21 Jan 2025, Gruica et al., 2024).

Distinct fields adapt these principles to different modalities:

In text and web archives, RLZ (relative Lempel-Ziv) compression leverages a semi-static dictionary and static integer coding to enable direct block retrieval via in-memory indexes.
In DNA storage, code design determines the expected number of reads for any strand, with optimal structures achieving sublinear expectations.
In columnar database storage, physical encoding layout—e.g., mini-block versus full-zip—directly impacts random access performance and RAM utilization.
In texture compression, random-access decoding is realized via block-indexed latent grids and direct coordinate mapping, supporting real-time rendering.

2. Algorithmic Mechanisms for Random-Access Decoding

Archive Compression: RLZ Blockwise Access

The RLZ pipeline comprises dictionary construction (sampling s-byte substrings from the corpus C to build D), blockwise greedy factorization (partitioning C into B-byte blocks and LZ77-style matching against D with factor/literal emission), and efficient random-access decoding utilizing a byte-level block index (Petri et al., 2016). Random-access pseudocode is as follows:

def RLZ_RandomAccess(start, k):
    firstBlock = start // B
    lastBlock = (start + k - 1) // B
    for b in range(firstBlock, lastBlock + 1):
        addr = BlockIndex[b]
        buf = decode_block(compressed[addr])
        output.append(select(buf, overlap))
    return output

The model achieves millisecond-scale latencies, with access time

$T_{\text{access}}(k) \approx S_{\text{seek}} + n \left[ \frac{B R}{T_{\text{transfer}}} + f(D,B,R) \right]$

where $n = \lceil k / B \rceil$ , $R$ is compression rate, $f$ encapsulates decode overhead per block.

DNA Storage: Linear Codes and Read-Span Analysis

Random-access in DNA storage is characterized via the expectation formula

$E[\tau_i(G)] = n H_n - \sum_{s=1}^{n-1} \frac{\alpha_i^s(G)}{\binom{n-1}{s}}$

where $G$ is a $k \times n$ generator matrix, $H_n$ is the $n$ th harmonic number, and $\alpha_i^s(G)$ counts $s$ -cardinality subsets spanning $e_i$ (Boruchovsky et al., 21 Jan 2025, Gruica et al., 2024). Optimized codes for $k=2$ attain

$T(2) = 1 + \frac{2}{\sqrt{2} + 1} \approx 0.914 \cdot 2$

with generalizations using $B_{k-1}$ sequences in $\mathbb{Z}_{q-1}$ yielding sublinear bounds for larger $k$ .

Columnar Storage: Adaptive Encodings

Lance’s method alternates between full-zip and mini-block encodings based on value size (Pace et al., 21 Apr 2025). Full-zip transposes all control words and data buffers, constructing an auxiliary repetition index for direct address mapping; mini-block organizes compressed buffers into small, indexed chunks. Random access on NVMe is thereby distilled to one or two IOPs (row index search, buffer retrieval), producing near-device IOPS for wide types and minimal search cache RAM overhead.

Neural Graphics Texture Compression

Random-access in neural texture compression is realized by mapping coordinates $(u, v)$ and mip level $m$ to quantized latent grids $G_0$ , $G_1$ via nearest/bilinear sampling, assembling decoder input with positional encoding, and emitting multi-channel reconstructions through a lightweight fully-connected synthesizer (Farhadzadeh et al., 2024). GPU shader-pseudo code:

vec<4*c_g0> Y0 = gather_corners(G0, i0, j0, i1, j1);
vec<c_g1> Y1 = bilinear_interpolate(G1, i0, j0, i1, j1, tx, ty);
vec c_out = D_decoder(concat(Y0, Y1, mNorm, P));
return c_out;

3. Theoretical Bounds and Comparative Metrics

Archive Compression

RLZ achieves superior compression ( $R$ as low as 16.3–18% for large dictionaries) and block retrieval rates up to $\sim1900$ blocks/s on SSD, exceeding adaptive blockwise schemes (GZip, xz, LZ4) at comparable block sizes (Petri et al., 2016). The key trade-off dimension is between dictionary size, block size, and storage medium; on HDD, seek time dominates and lower $R$ is paramount; on SSD, decode amplification and transfer rate interplay.

Scheme	Compression Rate (R)	SSD Blocks/s	HDD Latency
RLZ-ZZ (256kB)	16.3%	~1600	Seek-bound
RLZ-PV (256kB)	16.5%	~1900	Seek-bound
GZip (256kB)	21.5%	~1500	Seek-bound
LZ4 (16kB)	30%	~2000	Seek-bound

DNA Codes

The asymptotic random-access expectation per strand is lower than $k$ for optimized codes:

$k=2$ : $0.914 \cdot 2$
$k=3$ : $0.882 \cdot 3$
$k=4$ : $0.8515 \cdot 4$ Rate-$1/2$ codes in arbitrary dimension achieve $E[\tau_F(G_k)]/k < 0.9456$ (Boruchovsky et al., 21 Jan 2025, Gruica et al., 2024).

Columnar Databases

Lance 2.1 attains $330-860$k rows/s for random access (depending on type), matches Parquet’s compression ratios (within 2%), and reduces RAM index footprint by an order of magnitude (≤1.3 GB per $10^9$ rows) (Pace et al., 21 Apr 2025).

Format	Scalar Rows/s	String Rows/s	RAM Footprint (1B rows)
Parquet (default)	5.5k	1.2k	~20 GB
Parquet (optimized)	350k	260k	~20 GB
Lance (mini-block)	330k	300k	~1.3 GB
Lance (full-zip)	860k	860k	Zero (no cache)

4. Code Design Frameworks and Index Structures

The structure of block indexes, repetition maps, and code matrices is central to random-access scan efficiency.

RLZ BlockIndex: in-memory, maps block number to file offset, supports rapid lookup.
DNA Codes: generator matrices $G$ engineered with $B_{k-1}$ sequences or balanced quasi-arcs, ensuring coverage depth is minimized for all strands. Analysis via projective geometry provides bounds and constructive methods.
Columnar Encodings: per-row repetition indexes (full-zip), chunk headers and search cache indices (mini-block) reduce amplification and page cache footprint.
Texture Compression: quantized latent grid layouts, coalesced corner fetches, and coordinate-mapping allow sub-millisecond viewport random decoding.

5. Practical Guidelines, Trade-Offs, and Performance Bottlenecks

Optimal random-access scan configuration is workload, medium, and data-type dependent.

Archive compression: maximize dictionary size and block size for HDD; on SSD, moderate block size ($64$kB–$256$kB) balances transfer and decode overhead. Dictionary construction by uniform sampling, not by ad hoc placement, yields near-optimal average factor length (Petri et al., 2016).
DNA storage: encode with matrices whose columns unequally repeat systematic and low-weight parity supports, reducing mean reads per strand (Boruchovsky et al., 21 Jan 2025). Projective geometry aids in constructing such codes (Gruica et al., 2024).
Columnar storage: select encoding scheme by value size threshold ($128$B in Lance) (Pace et al., 21 Apr 2025), minimize page size for random workloads, and prefer full-zip for wide types, mini-block for narrow scalars.
Texture compression: architectural choices (channel stacking, stride mapping, per-fragment decoding via compact MLPs) ensure real-time random-access and multi-resolution support (Farhadzadeh et al., 2024).

A plausible implication is that as storage latency and bandwidth approach theoretical maxima (e.g., on NVMe or in-memory GPUs), further gains stem from minimizing compute amplification and transforming structural layout, not from additional compressive encodings.

6. Future Directions and Theoretical Limits

Emerging directions include the extension of random-access scanning techniques to increasingly nested or high-dimensional modalities (e.g., hypergraphs in DNA storage, arbitrarily nested columns in databases, multi-mip-level neural compression). In DNA archiving, the existence of large $B_{k-1}$ sequences for all $k$ over large alphabets guarantees scalability of optimized codes to massive data volumes (Boruchovsky et al., 21 Jan 2025). In columnar storage, minimizing search cache size and coalescing IOPs presages systems tailored to composable workloads.

Systematic study of nonuniform code repetition (impractically in MDS codes but tractable in geometric constructions) informs the design of next-generation molecular barcodes, storage file layouts, and rendering pipelines aiming to couple compression gains with true point-access scalability. As multimodal archives proliferate, the principle of random-access scanning—localizing not just data but computation—will increasingly govern performance boundaries across analytic, archival, and multimedia domains.

Markdown Upgrade to Chat

References (5)

Access Time Tradeoffs in Archive Compression (2016)

Making it to First: The Random Access Problem in DNA Storage (2025)

The Geometry of Codes for Random Access in DNA Storage (2024)

Lance: Efficient Random Access in Columnar Storage through Adaptive Structural Encodings (2025)

Neural Graphics Texture Compression Supporting Random Access (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Random-Access Scanning.

Random-Access Scanning Techniques

1. Foundational Concepts and Definitions

2. Algorithmic Mechanisms for Random-Access Decoding

Archive Compression: RLZ Blockwise Access

DNA Storage: Linear Codes and Read-Span Analysis

Columnar Storage: Adaptive Encodings

Neural Graphics Texture Compression

3. Theoretical Bounds and Comparative Metrics

Archive Compression

DNA Codes

Columnar Databases

4. Code Design Frameworks and Index Structures

5. Practical Guidelines, Trade-Offs, and Performance Bottlenecks

6. Future Directions and Theoretical Limits

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Random-Access Scanning Techniques

1. Foundational Concepts and Definitions

2. Algorithmic Mechanisms for Random-Access Decoding

Archive Compression: RLZ Blockwise Access

DNA Storage: Linear Codes and Read-Span Analysis

Columnar Storage: Adaptive Encodings

Neural Graphics Texture Compression

3. Theoretical Bounds and Comparative Metrics

Archive Compression

DNA Codes

Columnar Databases

4. Code Design Frameworks and Index Structures

5. Practical Guidelines, Trade-Offs, and Performance Bottlenecks

6. Future Directions and Theoretical Limits

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research