Corpus Filtering & Compression

Updated 16 August 2025

Corpus filtering and compression are techniques that reduce large text corpora by eliminating noise and retaining high-value data for efficient computation.
They employ methods such as rule-based filters, neural embedding similarity, and optimization-driven compression to improve data quality.
Applications in NLP, machine translation, and model pretraining are validated by metrics like improved BLEU scores and reduced dataset sizes.

Corpus filtering and compression refer to a constellation of algorithmic, statistical, and optimization techniques developed to reduce the size of large data corpora—while selectively retaining, emphasizing, or transforming the most informative or relevant content for downstream computational tasks. These approaches are critical in NLP, information retrieval, foundation model pretraining, and other fields where high-volume, noisy, or redundant data threaten both efficiency and predictive performance. Methods range from combinatorial and linear-programming optimization of dictionary representations, probabilistic scoring and thresholding, language-informed feature engineering, neural classifier-based filtering, to schema-specific and information-theoretic compression models. The choice and design of algorithms depend sharply on corpus characteristics (e.g., repetitiveness, language mix, encoding, domain), computational constraints, and application requirements.

1. Principles and Models of Corpus Filtering

Corpus filtering fundamentally treats the data set as comprising noisy, redundant, or low-relevance segments alongside high-value signals. Techniques for filtering include:

Multi-criteria Filtering: Use of rule-based conditions (e.g. length ratios, punctuation, language identification) and classifier-based scoring to eliminate suspect segments. For large web-mined corpora, aggregated filtering signals (e.g., scores from classifiers trained on clean references) are decisive (Wang et al., 2018).
Embedding Similarity Filtering: Application of pretrained neural embeddings (e.g., LaBSE, BERT) to compute similarity scores between data segments (sentence pairs, phrases), often using cosine similarity. Thresholding on these scores selects only high-quality data (Batheja et al., 2023), sometimes supported by fine-tuning to local corpus distributions.
Quality Estimation (QE): Use of reference-free models (e.g., MonoTransQuest with XLM-R) for scoring the translation adequacy and signal quality of sentence pairs, achieving high correlation with human judgment and outperforming similarity-based approaches (Batheja et al., 2023).
Information Bottleneck Filtering: The IB principle guides the selection of compressed representations that maximize retained mutual information with desired outputs and minimize redundant information with the raw corpus (Zhu et al., 3 Jun 2024).

The balance between noise removal and informative content retention is achieved by adjusting model parameters (e.g., storage cost, threshold τ, regularization terms), optimizing criteria such as cross-entropy loss, and performing ablation analysis.

2. Compression Algorithms and Formal Optimization

Corpus compression comprises a suite of algorithms supporting both generic and domain-specific transformation:

Deep Dictionary Learning: Dracula (Paskov et al., 2015) extends the Compressive Feature Learning paradigm by recursively compressing its own dictionary of n-grams. The corpus is represented as a directed acyclic graph, with binary linear program (BLP) or linear programming (LP) relaxation formulations for selecting substrings and pointer sets. The objective couples document reconstruction modules (min-cost flow/interval structure) with dictionary compression, and solution paths are well-behaved under changes in cost parameters.
Unicode-aware Compression: Tokenization of UTF-8 into codepoint-level tokens and adjustment of traditional byte-based compressors (LZW, PPM) to operate over tokens rather than bytes, incorporating Polya tree priors to handle symbol frequency heterogeneity (Gleave et al., 2017).
Component-based Modular Frameworks: tudocomp (Dinklage et al., 2017) enables pipeline composition of compression, coding, and transformation modules, facilitating rapid benchmarking and parameter sweeps. Lempel-Ziv variants (lcpcomp, LZ78U) are implemented for highly repetitive text, leveraging data structures like suffix trees and longest common prefix arrays.
Streaming Schemes: Corpus-compressed streaming schemes (CCSS) distinguish between setup (high-bandwidth) and transmission (low-bandwidth) phases, using schematic representations (e.g., Moore machines) to achieve compact coding and satisfy formal lower bounds (minimum code length $p(n,z) \geq \log(n+1)-1$ ) (Alston, 2017).
XML-specific and General-purpose Compressors: Benchmarked using combined efficiency metrics (compression ratio and runtime), general-purpose compressors (GZIP, BZIP2) often outperform XML-specialized tools except on highly structured or domain-constrained files (Augeri et al., 10 Oct 2024).

3. Feature Derivation and Corpus Representation

Feature extraction and representation compression are integral to downstream learning tasks:

Hierarchical Feature Vectors: Dracula derives bag-of-n-grams features, counting the usage of dictionary elements at multiple levels. Hierarchical diffusion (matrices $G$ , $H$ ) propagates counts to substrings, serving as structured regularization in linear models and enabling invariance under unregularized learning (Paskov et al., 2015).
Representation Dimensionality Compression: CoRe applies recursive autoencoding—most effectively via singular value decomposition (SVD)—to compress document embeddings from hundreds to single-digit dimensions, sometimes improving F1 scores through denoising effects (Škrlj et al., 2021). The recursive approach is representation-agnostic and enables adaptive compression steps.
Unicode Block Feature Vectors: uniblock (Gao et al., 2019) summarizes sentences by probabilistic vectors over Unicode blocks, fitting Gaussian mixture models (GMMs) through variational inference. Log-likelihood scores enable thresholding and filtering in various NLP tasks (sentiment analysis, machine translation).

4. Application Domains and Empirical Evaluation

Corpus filtering and compression are deployed across diverse tasks:

Machine Translation: High-quality parallel corpora are curated using classifier-based, embedding-based, or QE-based filtering of noisy data (WMT, Paracrawl) (Wang et al., 2018, Zhang et al., 2020, Batheja et al., 2023, Batheja et al., 2023, Xu et al., 2020, Blin et al., 2022). BLEU score improvements (up to 2.7 points) are reported, and aggressive filtering can remove up to 40% of initial data while boosting translation accuracy in some directions.
Pretraining for Mathematical Reasoning: MathPile (Wang et al., 2023) applies rule-based noise filtering, language identification, thresholded cleaning, and scalable deduplication (MinHash LSH) to obtain a high-quality, math-centric corpus, compressing terabytes of noisy sources into a high-density, 9.5-billion-token dataset.
Retrieval-Augmented Generation: The information bottleneck approach filters retrieved passages to maximize mutual information with correct answers and minimize redundancy, yielding a compression rate down to 2.5% and substantial gains in correctness and conciseness (Zhu et al., 3 Jun 2024).
Text Classification: Recursive SVD-based CoRe demonstrates substantial reduction in computation and storage with little loss (sometimes an improvement) in classification accuracy over 17 diverse corpora (Škrlj et al., 2021).

5. Challenges, Trade-offs, and Selection Criteria

Designing filters and choosing compression algorithms involve several nontrivial considerations:

Threshold Tuning: Setting parameters too stringently risks discarding relevant data; too lax admits more noise. Empirical validation on downstream tasks, cross-validation, and ablation studies are standard for calibration (Wang et al., 2018, Batheja et al., 2023).
Computational Expense: Embedding computation and probabilistic scoring can be intensive for large corpora. Strategies to reduce redundancy (e.g., extracting only longest unique phrase pairs) and leveraging transfer learning (few-shot QE fine-tuning) make filtering scalable (Batheja et al., 2023).
Corpus Characteristics: Structure, repetitiveness, file type, entropy, and other properties significantly impact choice of compressor or filter design (Dinklage et al., 2017, Augeri et al., 10 Oct 2024). For XML, domain, tree depth, and schema presence predict compression efficiency.
Binary vs. General-Purpose Tools: XML-specific compressors (XMill, WBXML) may be preferable for small, structure-rich XML, but general-purpose compressors consistently rank high when both compression ratio and speed are evaluated in aggregate (Augeri et al., 10 Oct 2024).

6. Mathematical and Information-Theoretic Foundations

Methods are grounded in formal optimization, combinatorics, and information theory:

Linear Programming Relaxations: Dracula’s LP relaxation exploits interval structure for tractable min-cost flow subproblems and uses polyhedral cutting planes (Chvátal–Gomory constraints) for tightening feasible sets (Paskov et al., 2015).
Degeneracy Spectrum in Filtering: Statistical analysis of filter-induced degeneracy (number of inputs per output) shows cumulative distributions decaying as $\exp[-c(\ln d)^\alpha]$ for $\alpha > 1$ , with maximum entropy achieved by shortest nontrivial filters (Baxter et al., 2019).
Efficiency Metrics: XML compression benchmarks use integrated metrics combining log compression ratio and log execution speed ( $\mathrm{Jeff}_{\mathrm{prop}}$ ), facilitating quantitative comparison across domains and tools (Augeri et al., 10 Oct 2024).
Streaming and Kolmogorov Complexity: CCSS links description complexity to schematic coding, with lower-bound theorems constraining streaming code length and highlighting the distinction with generic, redundancy-based compression (Alston, 2017).

7. Future Directions and Broader Implications

Corpus filtering and compression are poised for continued methodological diversification:

Adaptive Filtering: Expanded use of learned, dynamic models to adjust filters based on corpus drift and evolving application requirements (reinforcement learning reward constructions, few-shot transfer) (Zhu et al., 3 Jun 2024, Batheja et al., 2023).
Automation in Compression: Enhanced frameworks for automated benchmarking, margin identification in representation dimension, and strategy patterns for seamless integration into NLP pipelines (Dinklage et al., 2017, Škrlj et al., 2021).
Generalization Across Domains: Extension of information bottleneck filtering to multimodal and conversational systems; exploration of non-linear autoencoding and ontology-enriched semantic compression.
Quality Assessment: Open availability of filtered corpora (MathPile, CJaFr-v3) and scripts highlights the drive toward reproducibility, high-quality benchmarking, and fine-grained contamination control (Wang et al., 2023, Blin et al., 2022).
Evaluation Metrics Evolution: Calls for exploration of decompression speed and memory usage, expansion of efficiency metrics to accommodate emergent formats and new application requirements (Augeri et al., 10 Oct 2024).

A plausible implication is that, as the scale of foundation models and NLP systems grows, corpus filtering and compression will shift from peripheral preprocessing steps to first-class algorithmic components, tightly integrated with learning, reasoning, and inference modules, driving model efficiency and reliability in data-intensive environments.