OpenZL: A Graph-Based Model for Compression (2510.03203v1)
Abstract: Research in general-purpose lossless compression over the last decade has largely found improvements in compression ratio that come at great cost to resource utilization and processing throughput. However, most production workloads require high throughput and low resource utilization, so most research systems have seen little adoption. Instead, real world improvements in compression are increasingly often realized by building application-specific compressors which can exploit knowledge about the structure and semantics of the data being compressed. These systems easily outperform even the best generic compressors, but application-specific compression schemes are not without drawbacks. They are inherently limited in applicability and are difficult to maintain and deploy. We show that these challenges can be overcome with a new way of thinking about compression. We propose the ``graph model'' of compression, a new theoretical framework for representing compression as a directed acyclic graph of modular codecs. This motivates OpenZL, an implementation of this model that compresses data into a self-describing wire format, any configuration of which can be decompressed by a universal decoder. OpenZL's design enables rapid development of tailored compressors with minimal code, its universal decoder eliminates deployment lag, and its investment in a well-vetted standard component library minimizes security risks. Experimental results demonstrate that OpenZL achieves superior compression ratios and speeds compared to state-of-the-art general-purpose compressors on a variety of real-world datasets. Internal deployments at Meta have also shown consistent improvements in size and/or speed, with development timelines reduced from months to days. OpenZL thus represents an advance in practical, scalable, and maintainable data compression for modern data-intensive applications.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper introduces a new way to think about and build lossless data compressors called OpenZL. Instead of one giant, complicated program, OpenZL treats compression like a flowchart (a directed acyclic graph, or DAG) made of small, plug‑and‑play building blocks called “codecs.” Each compressed file carries a description of the exact flowchart used to compress it, so a single “universal decoder” can always figure out how to decompress it. The goal is to make fast, secure, and easy‑to‑maintain compressors that can be tailored to different kinds of data (like tables, images, audio, or model weights) without years of specialized work.
What questions are they trying to answer?
To make this idea practical, the paper explores simple versions of these questions:
- Can we represent compression as a graph of small steps so it’s easier to build custom compressors?
- Can a single decoder safely and quickly decompress any file if the file includes its own instructions?
- Does this approach compress real‑world data better and faster than popular, general‑purpose tools?
- Can teams build and update custom compressors quickly, safely, and at scale?
How does their approach work?
Think of compression like organizing a messy room:
- You start by sorting things (parsing).
- You group related things together (grouping).
- You reshape items to make them stack better (transforming).
- You pack them tightly into boxes (compressing).
OpenZL follows this idea with four common stages.
The graph model (the flowchart of codecs)
- A “codec” is just a small function that takes some data in and puts some data out (for example, it might turn text numbers like "123" into actual integers, or find repeated chunks).
- Codecs become nodes in a graph; arrows show how data flows from one step to the next.
- Because the graph has no loops (it’s a DAG), there’s a clear order to run the steps for compression and a reverse order for decompression.
- Since each step knows exactly what kind of data it expects and produces, later steps can use clever tricks that only work on that kind of data.
A simple example is “tokenize”:
- If your data is a list like [alice, bob, bob, eve, alice, bob, alice], tokenization splits it into:
- A list of unique words: [alice, bob, eve]
- A list of positions (indices) pointing to those words: [0, 1, 1, 2, 0, 1, 0]
 
- Now you can compress words and index numbers with techniques that suit each type best.
Self‑describing wire format and a universal decoder
- Each compressed file includes the graph (the flowchart) describing how it was compressed.
- The decoder doesn’t need to know your app’s special rules in advance; it just follows the instructions in the file.
- This means you can change or improve your compressor over time without breaking old readers or waiting for every device to update—very helpful for mobile and IoT.
The typical four‑stage setup
Most OpenZL compressors look like this:
- Parse: Split the input into logical pieces. Example: a CSV file is split into one stream per column plus a stream for commas and newlines.
- Group: Combine streams that are related so patterns across them can be used.
- Transform: Apply smart, reversible changes that make patterns clearer. Examples: convert ASCII "42" to the number 42; subtract each number from the previous (delta coding) if the list is sorted; detect repeated chunks.
- Compress: Use general‑purpose compression on the cleaned‑up streams (like LZ77/Zstandard for repeats, or Huffman/ANS for probability‑based coding).
Training and automation tools
- OpenZL includes tools that can learn good graph designs from sample data:
- Clustering groups related streams automatically.
- ACE (Automatic Compression Explorer) tries different backends to pick what compresses your data best.
 
- Because the decoder is universal, you can use different graphs for different data or different goals (smaller size vs. faster speed) at runtime.
Security and maintainability
- Small, well‑tested codecs make it easier to secure each piece.
- OpenZL ships a standard library of vetted components, so you can assemble compressors with stronger safety guarantees.
- Updating the compressor doesn’t require updating the decoder everywhere, reducing rollout headaches.
What did they find?
Based on the abstract and the reported deployments:
- OpenZL compressed various real‑world datasets better (smaller files) and faster than leading general‑purpose compressors.
- Inside Meta, teams saw steady wins in size and/or speed on different applications.
- Building custom compressors got much quicker—development time dropped from months to days.
- The universal decoder and self‑describing format removed rollout delays, making it easier to ship updates even when some devices lag behind.
Why this matters:
- Many machine‑learning compressors get great ratios but are far too slow (kilobytes per second and often need GPUs).
- Most production systems need high speed and low resource use.
- Domain‑specific compressors are powerful but usually hard to build and maintain—OpenZL aims to keep the power while removing the pain.
What’s the impact?
- Faster and smaller: Apps, services, and data pipelines can move and store less data or move the same data faster—saving time and money.
- Easier customization: Teams in fields like tables (CSV/Parquet), trees (Thrift/Protobuf), audio, graphics, and AI model weights can build compressors tailored to their formats without massive effort.
- Safer and simpler deployment: One decoder handles all variants, so improving compressors doesn’t stall while you update every reader.
- Scales across many use cases: Because graphs are small and share components, supporting lots of different data types becomes manageable.
- Encourages learning and iteration: You can train compressors offline on sample data, then deploy smarter graphs broadly while keeping the on‑disk format stable.
In short, OpenZL turns compression into a flexible “recipe” made of small steps. That lets you use the structure of your data to get both better compression and higher speed, while keeping things secure and easy to roll out in the real world.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The paper introduces the graph model and sketches OpenZL’s design and benefits, but leaves several issues unresolved that future work could address. The following specific gaps and open questions are identified:
- Absence of a formal typing system: message sets are informally “non-empty subsets of bitstrings,” but there is no rigorous type system or static typing rules to enforce edge constraints, ensure sound composition, and catch incompatibilities at build time.
- Composition correctness and decidability: no proofs or algorithms are provided to guarantee that independently authored codecs compose into a valid, lossless graph (e.g., type-checking, invariants, and completeness/soundness of a graph validator).
- Dynamicity semantics are underspecified: function-graph expansion lacks formal guarantees of termination, acyclicity, and confluence (i.e., whether different expansion orders yield the same resolved graph); provide a formal semantics and proofs or constraints to ensure uniqueness of the resolved graph.
- Resolved graph recording and overhead: the paper asserts decoding uses the resolved graph, but does not specify how expansions are captured in the wire format, what metadata is recorded, or quantify the size/latency overhead of embedding the graph per frame.
- Universal decoder specification is missing: there is no detailed description of the graph serialization format, codec identification/parameterization scheme, versioning, and forward/backward compatibility rules required for a truly universal decoder across releases.
- Unknown/third‑party codecs handling: how does the universal decoder behave when a graph references codecs outside the standard library (e.g., plugin discovery, trust model, sandboxing, fallback, and error signaling)?
- Resource and safety bounds for decoding: the universal decoder’s worst‑case memory, time, and stack/heap usage under adversarial graphs (e.g., huge fan‑out, deep chains) are not analyzed; define enforceable limits and admission control.
- Security/threat model is not articulated: beyond “security-hardened components,” there is no formal threat analysis (memory safety, integer overflows, decompression bombs, algorithmic complexity attacks, graph-crafted DoS), nor coverage of isolation/sandboxing between codecs.
- Verification and fuzzing strategy: no evidence or methodology is provided for systematic fuzzing, property-based testing, or formal verification of codecs and graph compositions (including decoder robustness to corrupted inputs/graphs).
- Error handling semantics: behavior on malformed graphs, unknown codecs, version mismatches, or corrupted intermediate streams is unspecified; define predictable error codes, recovery strategies, and partial data salvage guarantees.
- Streaming and random access are not addressed: how graphs support streaming compression/decompression, partial decoding (e.g., a single column/field), seeking, and index construction remains unspecified.
- Graph scheduling and parallelism: the runtime strategy for topological scheduling, concurrency, buffer lifetime management, and pipeline fusion is unspecified; provide a graph executor design and performance model (CPU/GPU, SIMD, NUMA).
- Memory management and intermediate buffers: no analysis of intermediate stream sizes, buffer reuse, spilling, and backpressure; quantify peak memory footprint and optimization techniques (e.g., liveness analysis).
- Small-file behavior and header overhead: the size penalty of the self-describing graph for tiny inputs and the thresholds/fallbacks (e.g., raw blocks) to avoid regressions are not evaluated.
- Benchmarking transparency: claimed superiority vs. state-of-the-art lacks disclosed datasets, workloads, hardware, parameters, and full results; publish a reproducible benchmark suite covering text, binaries, tabular, images, audio, models, and pathological cases.
- Overhead of dispatch/parse recording: the parsing stage “records instructions” per byte/field, but the space/time overhead and compression effectiveness of these instructions are not quantified; propose compact encodings and grammar-based alternatives.
- SDDL specification is incomplete: the Simple Data Description Language’s expressiveness, performance, security (e.g., sandboxing parsers), schema evolution support, and integration with type checking are not described; formalize SDDL and evaluate.
- Grouping stage methodology: correlation metrics, search space, and criteria for stream concatenation/interleaving are unspecified; detail the clustering trainer algorithm, objectives (ratio/speed), computational cost, and generalization.
- ACE (Automatic Compression Explorer) is underdefined: the search strategy, transforms/catalog, objective function, constraints (throughput/latency/memory), avoidance of overfitting, and reproducibility of ACE-selected graphs need specification and evaluation.
- Handling dataset drift: no mechanism is described for retraining or adapting trained graphs as data distributions evolve; propose monitoring, online adaptation, and safe rollout strategies.
- The “field_lz” backend is not characterized: precise conditions where struct-aware LZ outperforms byte-oriented LZ, its parameterization, numeric width handling, alignment effects, and hardware considerations are not evaluated.
- DAG-only limitation: compressors needing feedback loops across stages (e.g., certain adaptive models) may be difficult to express without embedding complexity inside single nodes; clarify expressivity limits and recommended patterns.
- Entropy/modeling theory connection: no formal analysis relates the graph decomposition to Shannon bounds (e.g., factorization of mutual information, how transforms lower entropy); provide theoretical guidance and metrics for graph design.
- Privacy and schema leakage: self-describing graphs may reveal data schema/semantics; outline options for minimizing metadata leakage (e.g., encryption of graph, opaque codec IDs) and the impact on universality.
- Governance and component lifecycle: policies for component library curation, versioning, deprecation, certification, and third‑party contributions are not defined; propose governance and compatibility contracts.
- Interoperability and migration: the effort and patterns to integrate OpenZL with existing formats (Parquet/Protobuf/PNG), storage systems, and pipelines are not documented; provide migration guides and performance/ROI analyses.
- Constraint-aware tradeoff selection: while OpenZL supports speed/ratio tradeoffs, there is no decision framework that, given constraints (CPU, latency, memory), selects graphs automatically; develop cost models and solvers.
- Dictionary/training synergy: how graph-based training interacts with classic dictionary training (sharing across frames, caching, dedup of graph metadata) is not specified; paper combined approaches and cache management.
- Robustness to corrupted data: define guarantees for partial recovery, metadata integrity checks (e.g., checksums/signatures for graphs), and end-to-end integrity verification.
- Licensing and deployment footprint: the universal decoder’s size, dependency profile, and license constraints for mobile/IoT are not discussed; quantify footprint and provide minimal builds.
Practical Applications
Immediate Applications
The following applications can be deployed now using the OpenZL framework, its universal decoder, self-describing wire format, and standard component library. Each item notes relevant sectors, potential tools/products/workflows, and key assumptions or dependencies.
- Data lake and warehouse optimization — Use OpenZL’s parse–group–transform–compress pipeline to compress CSV/Parquet columnar data and framing separately, improving both ratio and throughput. Sectors: software, analytics, finance, retail. Tools/workflows: SDDLformat specs, dispatch codecs, clustering trainer,ACEfor backend selection. Assumptions: representative training corpus; universal decoder shipped to all query/ETL nodes.
- ETL/stream processing pipelines — Integrate OpenZL as a stage in Spark/Flume/Flink/Beam to reduce shuffle and storage costs by compressing semantically grouped streams (e.g., IDs, timestamps, metrics). Sectors: software, ad-tech, IoT, energy. Tools/workflows: graph registry per dataset, A/B testing of graphs, telemetry-driven retraining. Assumptions: minimal glue code; decoder availability on all downstream consumers.
- Mobile app assets and updates — Compress app bundles (JSON/Protobuf assets, text catalogs, on-device ML tensors) with tailored graphs; deploy a single universal decoder binary across app versions to eliminate reader–writer lag. Sectors: software, consumer apps. Tools/workflows: mobile SDK for universal decoder, asset build-step plugins, field_lzfor numeric tensors. Assumptions: decoder footprint fits mobile constraints; compress-before-encrypt when needed.
- IoT telemetry and logs — Marshal typed sensor buffers directly (bypassing costly byte serialization), apply delta/predictor transforms, and entropy/LZ backends for efficient uplink/storage. Sectors: robotics, energy, manufacturing, smart cities. Tools/workflows: firmware module for universal decoder, typed buffer API, clustering trainer to group correlated signals. Assumptions: constrained memory/CPU budgets on edge; consistent schemas.
- ML artifacts and checkpoints — Compress PyTorch tensors and model weight arrays via vector parsing and field_lz, cutting checkpoint time and disk footprint without touching training loops. Sectors: AI/ML, research, finance risk modeling. Tools/workflows: framework plugin (PyTorch save/load hook),ACE-selected backends per tensor type. Assumptions: lossless requirements; decoder linked in training/serving environments.
- Audio and time-series files — Apply domain transforms (e.g., delta for sorted integers/samples) and appropriate entropy/LZ coding to WAV, telemetry, and metrics series. Sectors: media, observability/SRE, healthcare devices. Tools/workflows: time-series ingestion compression stage; transform libraries. Assumptions: stable sampling formats; compression pre-encryption.
- Tree-structured APIs and logs — Parse Thrift/Protobuf by unique path, group correlated fields, then compress, reducing service-to-service payload size and log storage. Sectors: software, fintech, govtech. Tools/workflows: SDDLfor message schemas, dispatch codecs, path-based grouping. Assumptions: schema discipline; decoder shipped with microservices.
- Domain-specific research datasets — Rapidly prototype compressors for genomics, graphics meshes, and scientific arrays by composing standard codecs and minimal domain transforms. Sectors: healthcare/genomics, media/graphics, academia. Tools/workflows: component library, ACEexploration, small custom codecs where needed. Assumptions: clear format descriptions; representative samples for training.
- Data-sharing across teams/orgs — Ship self-describing frames with serialized graphs to ensure any partner with the universal decoder can read data without wire-format negotiations. Sectors: public sector, healthcare, academia, consortia. Tools/workflows: graph catalogs, compatibility badges, add-only codec policies. Assumptions: decoder adoption; governance over allowed components.
- Cost and energy reduction in storage/networking — Lower bytes stored/transferred in CDN, backups, and replication, improving sustainability metrics. Sectors: cloud, telecom, media platforms. Tools/workflows: backup/restore plugins; continuous compression optimization via offline ACEretraining. Assumptions: operational monitoring to prevent regressions; stable decoder rollout.
- Developer observability — Compress large volume logs and traces by separating framing/control from values, improving ingestion throughput and retention windows. Sectors: software/SRE, enterprise IT. Tools/workflows: logging agent integration; standard graphs for common log formats. Assumptions: safe handling of untrusted input; fuzzing of codecs.
- Financial analytics payloads — Losslessly compress large numeric tables (risk, pricing, market data) using field_lzand delta transforms to cut cache, disk, and wire usage. Sectors: finance, insurance. Tools/workflows: integration with Pandas/Arrow/Parquet pipelines; schema-based grouping. Assumptions: accuracy-critical workflows; audit trails retained.
- Education/LMS telemetry and content — Compress activity logs and structured content packages (assessments, rubrics, metadata) for cheaper storage and faster sync for low-bandwidth users. Sectors: education. Tools/workflows: LMS plugin, universal decoder in web/app clients. Assumptions: client-side decoder bundling; privacy-sensitive data handled pre-compression.
- Governance/security hardening — Reduce attack surface by composing only vetted standard codecs; use self-describing frames to audit decompression paths. Sectors: all; policy/compliance. Tools/workflows: approved component lists, CI fuzzing harnesses, SBOM for graphs. Assumptions: disciplined component vetting; adherence to lossless guarantees.
- Reproducible academic artifacts — Publish datasets and results in self-describing frames so others can decode with the universal decoder without complex environment setup. Sectors: academia. Tools/workflows: repository templates, DOI-linked graph specs. Assumptions: stable open-source decoder; long-term archival policies.
Long-Term Applications
These applications are promising but require further research, scaling, ecosystem adoption, or standardization before broad deployment.
- Adaptive per-block graph selection at runtime — Function graphs that switch transforms/backends on the fly to track input non-stationarity without hurting throughput. Sectors: streaming analytics, robotics. Tools/workflows: on-line ACE, safe heuristics for expansion. Dependencies: robust selection policies; performance predictability.
- Industry standards for self-describing compression — Formalize OpenZL-like wire format and universal decoder as open standards to ensure cross-vendor interoperability. Sectors: healthcare (EHR), finance (regulatory reporting), public sector. Tools/workflows: standards bodies, conformance test suites. Dependencies: multi-party governance; backward-compatible evolution.
- Hardware acceleration of the universal decoder — FPGA/ASIC implementations of common codecs and graph execution to deliver datacenter-scale throughput with better energy efficiency. Sectors: cloud, HPC. Tools/workflows: kernel/driver support, offload APIs. Dependencies: stable component set; hardware-friendly codec designs.
- Codec marketplace and governance — Curated ecosystem where third-party codecs and graphs undergo security/performance certification, enabling plug-and-play specialization. Sectors: software, data platforms. Tools/workflows: registry, signing/attestation, sandboxing. Dependencies: trust model; reproducible benchmarking.
- Privacy-preserving compression — Co-design transforms with differential privacy or secure enclaves so compression gains do not leak sensitive information. Sectors: healthcare, finance, govtech. Tools/workflows: DP-aware transforms; policy tooling. Dependencies: formal privacy guarantees; performance tradeoff analysis.
- Native database integration — Columnar stores (Parquet/ORC/Arrow-native) and OLAP engines adopting graph-based compression natively for on-disk pages and network blocks. Sectors: analytics, BI. Tools/workflows: storage engine plugins, statistics-aware grouping. Dependencies: engine redesign; careful interaction with indexing/encoding.
- Real-time protocol integration — gRPC/QUIC/WebTransport stacks embedding universal decoder to negotiate graph-based compression per route/service. Sectors: microservices, edge. Tools/workflows: transport extensions, capability negotiation. Dependencies: client ubiquity; latency guarantees.
- Cross-modal AI dataset compression — Unified graph pipelines for text, audio, vision, and tabular features in multi-modal training corpora. Sectors: AI/ML, media. Tools/workflows: multi-stream marshalling APIs, ACEextensions for heterogeneous data. Dependencies: rich component library; large-scale training.
- Formal verification of codecs/graphs — Machine-checked proofs of losslessness, safety, and resource bounds for security-critical deployments. Sectors: aerospace, automotive, defense, finance. Tools/workflows: verification toolchains, spec languages for message sets. Dependencies: invest in formal methods; verified components.
- Compliance/certification paths — Auditable, certifiable compression for regulated environments (HIPAA, PCI DSS, SOX), including retention and eDiscovery workflows. Sectors: healthcare, finance, enterprise. Tools/workflows: audit trails in frames, compliance profiles. Dependencies: regulator engagement; policy mappings.
- Content-addressable storage synergy — Combine graph signatures with chunking/dedup for efficient CAS in backup and artifact registries. Sectors: DevOps, cloud storage. Tools/workflows: CAS integration, graph hashing. Dependencies: stable hashing of graphs and outputs; dedup safety.
- Edge OTA update efficiency — Universal decoder ubiquity on embedded devices enabling frequent, safe compressor evolution without reader lag, reducing OTA payloads. Sectors: automotive, consumer electronics. Tools/workflows: decoder distribution channels, safe rollout strategies. Dependencies: device storage/CPU constraints; field reliability.
- Green computing policy levers — Organizational mandates to use format-aware compression in data pipelines; carbon accounting tied to compression efficacy. Sectors: policy, enterprise IT. Tools/workflows: sustainability dashboards, optimizer services. Dependencies: measurement standards; cultural adoption.
- “Auto-compress” managed service — Cloud service that ingests sample corpora, runs offline exploration (ACE), and returns an optimized graph with ongoing retraining. Sectors: cloud, SaaS. Tools/workflows: MLOps-style training loops, graph lifecycle management. Dependencies: data access/security; service economics.
- Serialization co-design — Shift from byte-only serialization to typed multi-stream marshalling designed for compression-first workflows. Sectors: software platforms. Tools/workflows: SDKs replacing JSON-only paths; schema compilers to multi-stream buffers. Dependencies: developer adoption; tooling maturity.
Each long-term application presumes continued development of component libraries, training tooling, ecosystem conventions (graph registries, compatibility policies), and widespread availability of the universal decoder across clients and services.
Glossary
- Automatic Compression Explorer (ACE): An OpenZL tool that automatically determines the best backend compression graph for given inputs. "OpenZL provides the Automatic Compression Explorer (ACE) (described in \cref{subsec:training}), which determines the best backend graph for a given input."
- ANS (Asymmetric Numeral Systems): A modern family of entropy coders that achieves compression comparable to arithmetic coding with higher speed. "The ANS family offers arithmetic-like compression at higher speed~\cite{duda2014}, with Zstandardâs FSE~\cite{FSE} as a table-driven tANS example."
- arithmetic coding: An entropy coding method that encodes sequences into fractional intervals, approaching the entropy limit. "Huffman remains widely used~\cite{huff}; arithmetic coding approaches the entropy limit with different latency/state trade-offs~\cite{WITTEN87}."
- backpropagation order: The reverse execution order over a computational graph used here to define a deterministic decompression schedule. "this ensures that a well-defined compression graph always admits a valid feed-forward computation order (for compression) and a valid backpropagation order (for decompression)."
- beta-reduction: A lambda-calculus operation used here by analogy to describe expanding function graphs during runtime. "Readers familiar with lambda calculus may draw a vague parallel between function graph expansion and beta-reduction."
- bit-packing: Compactly encoding fixed-size integers into minimal bits, often paired with RLE. "Parquet~\cite{PARQUET} composes per-column encodings (dictionary, delta, RLE/bit-packing) with a backend codec."
- Blosc: A high-performance meta-compressor that composes blocking, shuffle/bitshuffle transforms, and a backend codec. "Blosc~\cite{BLOSC} composes blocking, a shuffle/bitshuffle transform (to decorrelate bytes/bits and surface runs), and then a configurable backend codec (LZ4, Zstd, etc.), executed in a multithreaded pipeline to maximize memory locality and throughput."
- Brotli: A general-purpose compressor that improves on gzip by using richer context models. "Brotli~\cite{rfc7932} improves on gzip by using context models for literals and offsets."
- Burrows–Wheeler Transform (BWT): A reversible transform that rearranges data to cluster similar contexts and expose runs for downstream coding. "For example, the Burrows--Wheeler Transform (BWT)~\cite{burrows1994bwt} clusters similar contexts so nearby symbols look alike."
- clustering graph: An OpenZL component and trainer that groups correlated parsed streams to improve compression. "OpenZL provides a clustering graph and corresponding trainer that handles this stage."
- codec: A pair of functions (encoder and decoder) mapping between input and output message sets, ideally invertible for lossless compression. "A codec is a tuple of functions."
- computational graph: A directed acyclic graph where nodes are functions and edges capture data dependencies; used to model compression. "Informally, a compression graph is a computational graph~\cite{COLLINS18, DYER16} where the nodes represent codecs and edges represent input and output sets\footnote{Technically, this is a reversed computational graph, since in typical depictions the feed-forward direction merges multiple inputs to produce the output, whereas a compression graph generates multiple outputs from the input.}"
- compression graph: A specialized computational graph whose nodes are codecs and edges connect output message sets to input message sets. "A compression graph is a computational graph where each node is labelled with a codec , and edges are doubly-labelled with both an output from the source and an input to the target such that ."
- context mixing: Combining multiple predictors to improve symbol probability estimates, often via boosting-like techniques. "context mixing (e.g., PAQ~\cite{mahoney2013paq}, cmix~\cite{CMIX}) blends multiple predictors using ideas related to boosting."
- Directed acyclic graph (DAG): A graph with directed edges and no cycles, enabling deterministic execution orders for compression/decompression. "A computational graph is a directed, acyclic, graph (DAG) where the nodes are functions and the edges represent function arguments (and data dependencies)."
- delta coding: Transforming a sequence by storing differences between successive values to expose structure for modeling. "Other examples include delta coding (replacing absolute values by differences) and move-to-front (MTF) coding~\cite{MTF}, which reorders symbols adaptively to expose locality for entropy coding."
- DEFLATE/gzip: A classic LZ+Huffman compressor encoding literals/lengths and distances in a single Huffman-coded stream. "DEFLATE/gzip~\cite{rfc1951, rfc1952} encodes literal/length and distance symbols, intertwined in a single Huffman-coded stream, using (static or dynamic) Huffman trees."
- dictionary substitution: Replacing frequent substrings with indices into a dictionary to reduce redundancy. "Another common strategy is dictionary substitution, where frequent substrings are factored into a table and the input is rewritten as indices into that table."
- dispatch codecs: Standard OpenZL components that route bytes to typed output streams per a parsing function, recording dispatch instructions. "This parse is typically achieved by providing a parsing function to OpenZL which can be plugged into one of the standard dispatch codecs."
- entropy coders: Algorithms that map symbols to bitstreams based on symbol probabilities (e.g., Huffman, arithmetic, ANS). "Entropy coders map symbols to bitstreams given a probability distribution."
- field_lz codec: An OpenZL LZ backend for struct and numeric streams that compresses typed fields more efficiently than bytewise LZ. "In addition to the zstd LZ backend, OpenZL provides the field_lz codec and builtin backend graph."
- FSE (Finite-State Entropy): Zstandard’s table-driven ANS implementation offering fast entropy coding. "with Zstandardâs FSE~\cite{FSE} as a table-driven tANS example."
- function graph: A runtime selector function that returns a compression graph based on input, enabling dynamic graph expansion. "Denote by the set of compression graphs. A function graph is a function ."
- Huffman (coding): A prefix-free entropy coder using variable-length codes optimized for symbol frequencies. "Huffman remains widely used~\cite{huff}; arithmetic coding approaches the entropy limit with different latency/state trade-offs~\cite{WITTEN87}."
- LZ4: A very fast LZ-style compressor using lightweight tagged formats without full entropy coding. "For scenarios with strict throughput requirements, LZ4~\cite{lz4} and Snappy~\cite{snappy} apply LZ-style parsing with a lightweight tagged format (varint lengths/distances) rather than a full entropy-coding stage."
- LZ77: A foundational dictionary-based compression technique using backward references to repeated substrings. "By contrast, production workloads must strike a balance between compressed size and processing time. For this reason, almost all popular compressors on the market use a variant of LZ77~\cite{GUPTA17}"
- LZMA: A high-ratio compressor combining LZ parsing with a range coder and context models, typically slower. "LZMA (as used in xz) combines LZ parsing and a range coder with context models (it is not a ``pure LZ'' design, making it more powerful but also markedly slower)~\cite{LZMA}."
- match selection (LZ): The choice of which overlapping or tied matches to encode in LZ parsing, affecting compressibility. "Match selection is non-trivial---overlaps and ties exist---and these choices materially affect downstream compressibility (e.g., shorter offsets, fewer distinct symbols)."
- message set: A constrained set of bitstrings defining allowed messages for codecs, used to impose semantics. "A message set is a non-empty subset of the universe of bitstrings."
- move-to-front (MTF) coding: A transform that adaptively reorders symbols to expose locality for entropy coding. "Other examples include delta coding (replacing absolute values by differences) and move-to-front (MTF) coding~\cite{MTF}, which reorders symbols adaptively to expose locality for entropy coding."
- NNCP: A neural-network-based compressor emphasizing high ratios at the cost of throughput. "NNCP is a more recent approach in this vein~\cite{NNCP}."
- Paeth (filter): An image prediction filter used in PNG to produce residuals more amenable to compression. "PNG~\cite{rfc2083} applies a per-scanline prediction filter (None/Sub/Up/Average/Paeth) before DEFLATE, turning image structure into locally predictable residuals that the downstream coder compresses well; the filter choice is itself a stage decision recorded per row."
- Parquet: A columnar storage format composing per-column encodings (dictionary, delta, RLE/bit-packing) with a backend codec. "Parquet~\cite{PARQUET} composes per-column encodings (dictionary, delta, RLE/bit-packing) with a backend codec."
- PNG: An image format that applies per-scanline prediction filters prior to DEFLATE to improve compression. "PNG~\cite{rfc2083} applies a per-scanline prediction filter (None/Sub/Up/Average/Paeth) before DEFLATE, turning image structure into locally predictable residuals that the downstream coder compresses well; the filter choice is itself a stage decision recorded per row."
- PPM (Prediction by Partial Matching): A prediction-centric compressor conditioning on preceding contexts. "PPM conditions on preceding contexts~\cite{ppm}; DMC learns an adaptive automaton~\cite{dmc}; context mixing (e.g., PAQ~\cite{mahoney2013paq}, cmix~\cite{CMIX}) blends multiple predictors using ideas related to boosting."
- range coder: An entropy coder similar to arithmetic coding, used by LZMA for high compression ratios. "LZMA (as used in xz) combines LZ parsing and a range coder with context models (it is not a ``pure LZ'' design, making it more powerful but also markedly slower)~\cite{LZMA}."
- residual error signal: The modeling-stage output representing what remains to be encoded after prediction, ideally with low entropy. "Thatâs why so much effort goes into the preceding \ref{traditional-stages-model} modelling stage, which reduces input into a residual error signal, for the following entropy stage to encode."
- resolved graph: A compression graph with all function graphs expanded, containing only regular codecs and no runtime dynamism. "A resolved graph is a compression graph that contains no function graphs."
- RLE (Run-Length Encoding): A transform that collapses consecutive identical symbols into length counts. "The run-length encoding (RLE)~\cite{RLE} collapses symbol runs."
- SDDL (Simple Data Description Language): A declarative format description that lets OpenZL auto-generate parsing logic. "Alternatively, the Simple Data Description Language (SDDL), described in \cref{subsec:sddl}, can implement the parsing function for you, given an SDDL description of the format."
- self-describing wire format: A serialized form that includes its own graph/config metadata so a universal decoder can decode without external instructions. "compresses data into a self-describing wire format, any configuration of which can be decompressed by a universal decoder."
- Shannon limit: The theoretical entropy bound giving the maximum possible compression ratio for a source distribution. "A more interesting bound is the Shannon limit~\cite{SHANNON}, a measure of uncertainty that gives the maximum possible compression ratio for a given source distribution."
- tANS (table-based ANS): A table-driven variant of ANS used for fast entropy coding, exemplified by Zstandard’s FSE. "with Zstandardâs FSE~\cite{FSE} as a table-driven tANS example."
- tokenize codec: A codec that factors a sequence into a unique token alphabet and an index stream. "In the graph model, compressors are graphs built from codec nodes. As a motivating example, consider the tokenize codec."
- topological sort: An ordering of DAG nodes that respects dependencies, ensuring valid compression and decompression schedules. "And since every DAG admits a topological sort, this ensures that a well-defined compression graph always admits a valid feed-forward computation order (for compression) and a valid backpropagation order (for decompression)."
- universal decoder: A single decompression engine capable of decoding any serialized OpenZL graph configuration. "compresses data into a self-describing wire format, any configuration of which can be decompressed by a universal decoder."
- varint (variable-length integer): A compact integer representation used for tagged formats to reduce overhead. "For scenarios with strict throughput requirements, LZ4~\cite{lz4} and Snappy~\cite{snappy} apply LZ-style parsing with a lightweight tagged format (varint lengths/distances) rather than a full entropy-coding stage."
- ZPAQ: An archiving system that stores a virtual-machine program specifying transforms and coding inside the compressed data. "ZPAQ goes further: the archive stores a virtual-machine program~\cite{mahoney2015zpaq} specifying contexts/transforms and the coder."
- Zstandard (Zstd): A high-throughput, general-purpose compressor combining LZ77 with modern entropy coding (FSE/Huffman). "Zstandard~\cite{rfc8878} factors tokens into four logical streams (literals, literal-lengths, match-lengths, offsets) and uses either Huffman or FSE depending on mode."
- wire format: The serialized, on-disk representation of data and its encoding configuration, often versioned. "Thus, without guarantees on library freshness, updating the wire format becomes untenable."
Collections
Sign up for free to add this paper to one or more collections.