ZIP Selection: Compression-Ratio Data Curation

Updated 7 February 2026

ZIP selection is a methodology that uses empirical compression metrics as a proxy for data quality, redundancy, and informativeness.
It applies a multi-stage greedy algorithm to rank and select samples, ensuring maximal information density and task-aligned data curation.
Empirical results demonstrate that ZIP methods accelerate model convergence and outperform embedding-based techniques in various domains.

Compression-ratio–based data selection (referred to here as “ZIP selection”; Editor's term) encompasses a set of methodologies that employ empirical compression metrics—principally the compressed size or compression ratio achieved by standard lossless or lossy codecs—as a proxy for data quality, informativeness, and redundancy. Originating in universal compression theory and operationalized in large-scale dataset curation and scientific data storage, ZIP selection methods aim to maximize information diversity, minimize redundancy, or align training data with target tasks, all without reliance on model-based scoring or external supervision. ZIP-style data selection is contrasted with neural or embedding-based approaches by its simplicity, computational frugality, and close ties to fundamental information-theoretic measures.

1. Theoretical Foundations and the Entropy Law

The central premise of compression-ratio–based selection rests on the equivalence between optimal coding length and statistical modeling performance. Consider a lossless compressor $\mathcal{C}$ operating on a sequence %%%%1%%%%; the output length $|\mathcal{C}(x_{1:n})|$ approaches $-\sum_{i=1}^n \log_2 \rho(x_i|x_{1:i-1})$ for the true generative distribution $\rho$ . Consequently, for language modeling, the expected code length is the empirical cross-entropy loss, and the compression ratio reflects the degree of redundancy or consistency within the dataset.

The "entropy law" posited in (Yin et al., 2024) formalizes this link: $Z \propto f(R, L, C, Q)$ where $Z$ is final model performance, $R$ is the compression ratio $\left(R(D) = \frac{\text{Size}_{\text{pre}}(D)}{\text{Size}_{\text{post}}(\mathcal{C}(D))}\right)$ , $L$ is first-epoch training loss, $C$ is data consistency, and $Q$ is average sample quality. In practice, with $Q$ held constant, performance improvements are closely associated with lower $R$ (greater information density) and lower $L$ (higher consistency). The entropy law thus predicts that a carefully selected subset with a lower compression ratio yields better downstream model performance, a finding validated in both supervised and preference-based LLM fine-tuning.

2. Algorithmic Frameworks for ZIP Selection

ZIP selection algorithms exhibit a common pattern: they use measured compressed length as an objective function for subset selection. For lossless compressors such as gzip (DEFLATE), the ZIP method in (Yin et al., 2024) adopts a three-phase greedy algorithm to select a diverse, non-redundant, information-rich subset from a large unlabeled pool. The process can be summarized:

Global Ranking: Rank all samples by singleton compression ratio $R(\{d\})$ . Select bottom $K_1$ (least compressible) as candidate pool.
Coarse Local Selection: For each candidate $d$ in this pool, evaluate $R(D' \cup \{d\})$ for the current subset $D'$ . Retain bottom $K_2$ .
Fine-grained Greedy Selection: Iteratively select $K_3$ samples from $K_2$ by the maximal reduction in $R(S \cup \{d\})$ , where $S$ is the batch under selection.

The algorithm repeats until the sample budget is exhausted, each iteration requiring only a modest number of new compression operations. Diversity is enforced at every level: the global stage avoids highly generic/repetitive samples, while the local stages ensure new samples provide incremental information with minimal overlap to the growing subset (Yin et al., 2024).

3. Task Alignment via Compression: ZIP-FIT

For machine learning tasks such as domain adaptation or fine-tuning, ZIP-FIT (Obbad et al., 2024) extends compression-based selection by measuring structural alignment between candidate source examples and a small, labeled target set. The core metric is the Normalized Compression Distance (NCD): $\textrm{NCD}(A,B) = \frac{C(A\oplus B) - \min\{C(A), C(B)\}}{\max\{C(A), C(B)\}}$ where $C(\cdot)$ is the gzip-compressed length and $\oplus$ denotes byte-wise concatenation. ZIP-FIT computes alignment as: $\text{ZIP--FIT--Align}(x_i) = 1 - \frac{1}{n} \sum_{j=1}^n \textrm{NCD}(x_i, x'_j)$ where $x_i$ is a source candidate and { $x'_j$ } are target samples. This score reflects the degree to which $x_i$ is structurally similar to the task distribution defined by the targets. ZIP-FIT selects top- $k$ candidates by this alignment criterion and is shown to accelerate convergence in fine-tuning LLMs, outperforming embedding-based and n-gram–based data selection both in learning efficiency and selection speed (Obbad et al., 2024).

4. Compression-Ratio–Based Selection in Lossy and Scientific Data

ZIP methodologies are not limited to text. In high-performance computing (HPC), lossy compression algorithms such as SZ and ZFP exhibit data-dependent rate–distortion behavior. An automatic online selector (Tao et al., 2018) estimates, for each data field, the rate and distortion (e.g., bit-rate per value $R$ , and peak signal-to-noise ratio PSNR), by sampling operations sufficient to predict compressor performance with high accuracy. The algorithm executes:

Stage I: Sampled residual or block transform analysis to estimate quantizer parameters.
Stage II: Bit-rate and distortion estimation for both compressors, without executing full compression.
Stage III: Select the compressor that minimizes rate given an error constraint, or maximizes PSNR given a bit-rate.

This approach achieves compressor-selection accuracy of 99%, yielding up to 70% improvement in compression ratio at fixed distortion, for under 7% overhead versus always using the best single compressor (Tao et al., 2018).

5. Time-Universal Compression-Ratio–Based Selection

Compression-ratio–based data selection also occurs in the context of "time-universal" data compression (Ryabko, 2018). Given $m$ candidate compressors $\mathcal{F} = \{\varphi_1,\ldots,\varphi_m\}$ , the time-universal method identifies the optimal compressor for an entire sequence $x_{1:n}$ , but pays only a fraction $\gamma > 0$ of total compression time as a search overhead. The procedure is:

Choose a prefix of length $r = \lfloor \gamma n / m \rfloor$ .
Run all $m$ compressors on this prefix; select $s = \arg\min_i |\varphi_i(x_{1:r})|$ .
Encode the index $s$ and the full sequence compressed with $\varphi_s$ .

Asymptotically, this strategy guarantees the per-symbol code length matches that of the best compressor, while total time stays within $T(1+\gamma)$ of the fastest feasible run. This protocol is directly applicable with ZIP-family archivers (gzip, lzma, bzip2) operating as black boxes, with negligible header overhead and control over time/optimality tradeoff via $\gamma$ (Ryabko, 2018).

6. Empirical Results and Applications

Compression-ratio–based selection methods exhibit consistent, often superior, empirical performance across multiple domains:

In LLM supervised fine-tuning, ZIP yields absolute MT-Bench gains of 5–10% over strong baselines by selecting subsets with lower redundancy yet comparable average quality (Yin et al., 2024).
In RLHF scenarios, ZIP achieves the highest benchmark scores by curating preference data with maximized information density.
ZIP-FIT demonstrates 62.8–85.1% faster convergence to low cross-entropy loss and up to 211× faster selection compared to baselines in diverse code generation and autoformalization tasks (Obbad et al., 2024).
In scientific lossy compression, online per-field selection between SZ and ZFP improves rate–distortion up to 70% over static assignments (Tao et al., 2018).

A notable empirical finding is that higher alignment (measured by NCD or similar compression-based distance) and lower compression ratio consistently yield both faster training convergence and higher evaluation scores, even when the data pool is of fixed or high mean quality (Yin et al., 2024, Obbad et al., 2024).

7. Limitations, Extensions, and Best Practices

ZIP selection assumes the average sample quality within the pool is roughly constant; if not, a preliminary filter using heuristic or model-based scorers is recommended (Yin et al., 2024). For very large pools ( $m \gg 100{,}000$ samples), computational cost can be mitigated by reducing candidate pool sizes, parallelization, or approximate delta-compression. In data with high semantic redundancy but low lexical overlap, compression ratio may under-estimate redundancy, motivating combination with embedding-based filters.

Extensions include multi-stage selection (hierarchical or MDL-weighted), flexible use of alternative compressors (LZ4, Snappy, etc.), and adaptation for real-time or streaming compression. For scientific data, online selectors may be generalized to more than two candidate compressors or extended to fieldwise, blockwise, or hierarchical selection (Tao et al., 2018).

Conclusion

Compression-ratio–based data selection operationalizes information-theoretic principles for diverse goals—including minimization of redundancy, maximization of informativeness, and task alignment—by leveraging empirical compression statistics as the selection criterion. These methods offer provable performance bounds, low computational overhead, and robust applicability across text, code, and scientific datasets, establishing compression as a practical and theoretically sound approach to high-quality data curation and efficient model training (Ryabko, 2018, Yin et al., 2024, Obbad et al., 2024, Tao et al., 2018).

Markdown Upgrade to Chat

References (4)

Entropy Law: The Story Behind Data Compression and LLM Performance (2024)

ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment (2024)

Optimizing Lossy Compression Rate-Distortion from Automatic Online Selection between SZ and ZFP (2018)

Time-universal data compression and prediction (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compression-Ratio–Based Data Selection (ZIP).

ZIP Selection: Compression-Ratio Data Curation

1. Theoretical Foundations and the Entropy Law

2. Algorithmic Frameworks for ZIP Selection

3. Task Alignment via Compression: ZIP-FIT

4. Compression-Ratio–Based Selection in Lossy and Scientific Data

5. Time-Universal Compression-Ratio–Based Selection

6. Empirical Results and Applications

7. Limitations, Extensions, and Best Practices

Conclusion

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

ZIP Selection: Compression-Ratio Data Curation

1. Theoretical Foundations and the Entropy Law

2. Algorithmic Frameworks for ZIP Selection

3. Task Alignment via Compression: ZIP-FIT

4. Compression-Ratio–Based Selection in Lossy and Scientific Data

5. Time-Universal Compression-Ratio–Based Selection

6. Empirical Results and Applications

7. Limitations, Extensions, and Best Practices

Conclusion

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research