Extend LZ compression from byte streams to typed data

Determine a principled extension of LZ-based lossless compression, which is traditionally defined on byte streams, to typed data representations such as fixed-size struct(k) streams and numeric(w) streams, and characterize the design requirements for a truly generic typed LZ engine that operates on typed data (rather than raw bytes).

Background

The paper proposes a graph model of compression and an implementation, OpenZL, that treats data as typed message sets (e.g., bytes, string, struct(k), numeric(w)) and composes small codecs in directed acyclic graphs. Within this framework, LZ compression is currently well understood for byte streams, but generalizing copy-based matching to operate natively on typed data is nontrivial.

OpenZL includes a preliminary codec (FieldLZ) that performs LZ-style matching over struct streams by finding matches of whole records rather than bytes, suggesting promise but also highlighting that a comprehensive, truly generic LZ engine for typed data has not yet been established. The authors explicitly state that identifying the natural extension of LZ to typed data remains an open question.

References

While it's true that LZ on byte streams is essentially a solved problem, the natural extension to typed data still remains an open question. Indeed, our implementation of FieldLZ demonstrates that there is significant progress yet to be made towards a truly generic LZ engine.

— OpenZL: A Graph-Based Model for Compression (2510.03203 - Collet et al., 3 Oct 2025) in Subsection “Future Work” (Conclusions)

Extend LZ compression from byte streams to typed data

Background

References

Related Problems