Random-Access Scanning Techniques
- Random-access scanning is a technique that retrieves arbitrary data fragments from large, compressed archives using efficient indexing and block-based decoding.
- It underpins diverse applications like web archiving, DNA storage, columnar analytics, and neural texture rendering by optimizing latency and I/O overhead.
- State-of-the-art methods balance compression efficiency with rapid, selective decoding through innovative code designs and structured index frameworks.
Random-access scanning denotes the capacity to efficiently retrieve arbitrary, small fragments from large datasets or encoded media without the necessity of full decompression or sequential access. The paradigm underpins numerous applications: web archiving, DNA molecular storage, NVMe-backed columnar analytics, and real-time texture decompression in graphics rendering. At its core, random-access scanning demands indexing structures, encoding layouts, and code designs that minimize both I/O and computational amplification for point queries, while maximizing compression/encoding efficiency. The following sections synthesize state-of-the-art approaches and theoretical frameworks for random-access scanning, as exemplified by compression in repetitive archives (Petri et al., 2016), DNA strand recovery (Boruchovsky et al., 21 Jan 2025, Gruica et al., 12 Nov 2024), adaptive columnar storage (Pace et al., 21 Apr 2025), and neural graphics texture decoding (Farhadzadeh et al., 6 May 2024).
1. Foundational Concepts and Definitions
Random-access scanning encapsulates the ability to recover an arbitrary small substring or data element (e.g., a 16 B text snippet or a single column value) from a compressed or encoded archive with low latency and minimal overhead (Petri et al., 2016, Pace et al., 21 Apr 2025). In the classic model, segmented archives employ block-wise compression accompanied by block indexes, facilitating location and selective decompression of only the portions intersecting the queried range. In molecular storage, random-access scanning equates to querying coverage depth—how many unordered samples (reads) are required to reconstruct a specific encoded information strand (Boruchovsky et al., 21 Jan 2025, Gruica et al., 12 Nov 2024).
Distinct fields adapt these principles to different modalities:
- In text and web archives, RLZ (relative Lempel-Ziv) compression leverages a semi-static dictionary and static integer coding to enable direct block retrieval via in-memory indexes.
- In DNA storage, code design determines the expected number of reads for any strand, with optimal structures achieving sublinear expectations.
- In columnar database storage, physical encoding layout—e.g., mini-block versus full-zip—directly impacts random access performance and RAM utilization.
- In texture compression, random-access decoding is realized via block-indexed latent grids and direct coordinate mapping, supporting real-time rendering.
2. Algorithmic Mechanisms for Random-Access Decoding
Archive Compression: RLZ Blockwise Access
The RLZ pipeline comprises dictionary construction (sampling s-byte substrings from the corpus C to build D), blockwise greedy factorization (partitioning C into B-byte blocks and LZ77-style matching against D with factor/literal emission), and efficient random-access decoding utilizing a byte-level block index (Petri et al., 2016). Random-access pseudocode is as follows:
1 2 3 4 5 6 7 8 |
def RLZ_RandomAccess(start, k): firstBlock = start // B lastBlock = (start + k - 1) // B for b in range(firstBlock, lastBlock + 1): addr = BlockIndex[b] buf = decode_block(compressed[addr]) output.append(select(buf, overlap)) return output |
The model achieves millisecond-scale latencies, with access time
where , is compression rate, encapsulates decode overhead per block.
DNA Storage: Linear Codes and Read-Span Analysis
Random-access in DNA storage is characterized via the expectation formula
where is a generator matrix, is the th harmonic number, and counts -cardinality subsets spanning (Boruchovsky et al., 21 Jan 2025, Gruica et al., 12 Nov 2024). Optimized codes for attain
with generalizations using sequences in yielding sublinear bounds for larger .
Columnar Storage: Adaptive Encodings
Lance’s method alternates between full-zip and mini-block encodings based on value size (Pace et al., 21 Apr 2025). Full-zip transposes all control words and data buffers, constructing an auxiliary repetition index for direct address mapping; mini-block organizes compressed buffers into small, indexed chunks. Random access on NVMe is thereby distilled to one or two IOPs (row index search, buffer retrieval), producing near-device IOPS for wide types and minimal search cache RAM overhead.
Neural Graphics Texture Compression
Random-access in neural texture compression is realized by mapping coordinates and mip level to quantized latent grids , via nearest/bilinear sampling, assembling decoder input with positional encoding, and emitting multi-channel reconstructions through a lightweight fully-connected synthesizer (Farhadzadeh et al., 6 May 2024). GPU shader-pseudo code:
1 2 3 4 |
vec<4*c_g0> Y0 = gather_corners(G0, i0, j0, i1, j1); vec<c_g1> Y1 = bilinear_interpolate(G1, i0, j0, i1, j1, tx, ty); vec c_out = D_decoder(concat(Y0, Y1, mNorm, P)); return c_out; |
3. Theoretical Bounds and Comparative Metrics
Archive Compression
RLZ achieves superior compression ( as low as 16.3–18% for large dictionaries) and block retrieval rates up to blocks/s on SSD, exceeding adaptive blockwise schemes (GZip, xz, LZ4) at comparable block sizes (Petri et al., 2016). The key trade-off dimension is between dictionary size, block size, and storage medium; on HDD, seek time dominates and lower is paramount; on SSD, decode amplification and transfer rate interplay.
| Scheme | Compression Rate (R) | SSD Blocks/s | HDD Latency |
|---|---|---|---|
| RLZ-ZZ (256kB) | 16.3% | ~1600 | Seek-bound |
| RLZ-PV (256kB) | 16.5% | ~1900 | Seek-bound |
| GZip (256kB) | 21.5% | ~1500 | Seek-bound |
| LZ4 (16kB) | 30% | ~2000 | Seek-bound |
DNA Codes
The asymptotic random-access expectation per strand is lower than for optimized codes:
- :
- :
- : Rate-$1/2$ codes in arbitrary dimension achieve (Boruchovsky et al., 21 Jan 2025, Gruica et al., 12 Nov 2024).
Columnar Databases
Lance 2.1 attains $330-860$k rows/s for random access (depending on type), matches Parquet’s compression ratios (within 2%), and reduces RAM index footprint by an order of magnitude (≤1.3 GB per rows) (Pace et al., 21 Apr 2025).
| Format | Scalar Rows/s | String Rows/s | RAM Footprint (1B rows) |
|---|---|---|---|
| Parquet (default) | 5.5k | 1.2k | ~20 GB |
| Parquet (optimized) | 350k | 260k | ~20 GB |
| Lance (mini-block) | 330k | 300k | ~1.3 GB |
| Lance (full-zip) | 860k | 860k | Zero (no cache) |
4. Code Design Frameworks and Index Structures
The structure of block indexes, repetition maps, and code matrices is central to random-access scan efficiency.
- RLZ BlockIndex: in-memory, maps block number to file offset, supports rapid lookup.
- DNA Codes: generator matrices engineered with sequences or balanced quasi-arcs, ensuring coverage depth is minimized for all strands. Analysis via projective geometry provides bounds and constructive methods.
- Columnar Encodings: per-row repetition indexes (full-zip), chunk headers and search cache indices (mini-block) reduce amplification and page cache footprint.
- Texture Compression: quantized latent grid layouts, coalesced corner fetches, and coordinate-mapping allow sub-millisecond viewport random decoding.
5. Practical Guidelines, Trade-Offs, and Performance Bottlenecks
Optimal random-access scan configuration is workload, medium, and data-type dependent.
- Archive compression: maximize dictionary size and block size for HDD; on SSD, moderate block size ($64$kB–$256$kB) balances transfer and decode overhead. Dictionary construction by uniform sampling, not by ad hoc placement, yields near-optimal average factor length (Petri et al., 2016).
- DNA storage: encode with matrices whose columns unequally repeat systematic and low-weight parity supports, reducing mean reads per strand (Boruchovsky et al., 21 Jan 2025). Projective geometry aids in constructing such codes (Gruica et al., 12 Nov 2024).
- Columnar storage: select encoding scheme by value size threshold ($128$B in Lance) (Pace et al., 21 Apr 2025), minimize page size for random workloads, and prefer full-zip for wide types, mini-block for narrow scalars.
- Texture compression: architectural choices (channel stacking, stride mapping, per-fragment decoding via compact MLPs) ensure real-time random-access and multi-resolution support (Farhadzadeh et al., 6 May 2024).
A plausible implication is that as storage latency and bandwidth approach theoretical maxima (e.g., on NVMe or in-memory GPUs), further gains stem from minimizing compute amplification and transforming structural layout, not from additional compressive encodings.
6. Future Directions and Theoretical Limits
Emerging directions include the extension of random-access scanning techniques to increasingly nested or high-dimensional modalities (e.g., hypergraphs in DNA storage, arbitrarily nested columns in databases, multi-mip-level neural compression). In DNA archiving, the existence of large sequences for all over large alphabets guarantees scalability of optimized codes to massive data volumes (Boruchovsky et al., 21 Jan 2025). In columnar storage, minimizing search cache size and coalescing IOPs presages systems tailored to composable workloads.
Systematic study of nonuniform code repetition (impractically in MDS codes but tractable in geometric constructions) informs the design of next-generation molecular barcodes, storage file layouts, and rendering pipelines aiming to couple compression gains with true point-access scalability. As multimodal archives proliferate, the principle of random-access scanning—localizing not just data but computation—will increasingly govern performance boundaries across analytic, archival, and multimedia domains.