CRAFT: Clustered Regression for Adaptive Filtering of Training data

Published 24 Apr 2026 in cs.CL and cs.AI | (2604.22693v1)

Abstract: Selecting a small, high-quality subset from a large corpus for fine-tuning is increasingly important as corpora grow to tens of millions of datapoints, making full fine-tuning expensive and often unnecessary. We propose CRAFT (Clustered Regression for Adaptive Filtering of Training data), a vectorization-agnostic selection method for training sequence-to-sequence models. CRAFT decomposes the joint source-target distribution and performs a two-stage selection: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. We prove that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters. We evaluate CRAFT on English-Hindi translation by selecting training data from 33 million NLLB sentence pairs and fine-tuning mBART via LoRA. CRAFT achieves 43.34 BLEU, outperforming TSDS (41.21) by 2.13 points on the same candidate pool and encoder while completing selection over 40 times faster. With TF-IDF vectorization, the entire pipeline completes in under one minute on CPU. TAROT achieves 45.61 BLEU, but CRAFT completes selection in 26.86 seconds versus TAROT's 75.6 seconds, a 2.8 time speedup.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces CRAFT, which leverages clustered regression to efficiently select training subsets that closely match validation distributions.
It employs a two-stage procedure that first aligns source embeddings via k-means clustering and then applies conditional target selection to minimize KL divergence.
CRAFT demonstrates superior translation quality (up to 43.34 BLEU) and speed, operating vectorization-agnostically with both dense and TF-IDF representations.

CRAFT: Clustered Regression for Adaptive Filtering of Training Data

Motivation and Problem Formulation

The rapid growth of parallel corpora for sequence-to-sequence tasks, especially neural machine translation (NMT), has highlighted the inefficiency of full fine-tuning on ever-expanding datasets. Many models can approach or surpass full-data performance by fine-tuning on a strategically selected subset, provided it sufficiently captures the relevant distributional properties present in the validation (target) set. The challenge lies in efficiently identifying a small, high-quality subset from candidate pools with tens of millions of examples that (1) matches the validation data distribution, (2) does so orders of magnitude faster than previous methods, and (3) is robust to different vectorization strategies.

CRAFT (Clustered Regression for Adaptive Filtering of Training Data) is introduced as a vectorization-agnostic data selection algorithm for training seq2seq models. It is founded on a principled decomposition of the joint source-target distribution $P(S,T)$ into a two-stage selection procedure that exploits the conditional structure inherent in parallel datasets—a structure not explicitly leveraged by prior selection methods.

Methodology

Factorization and Two-Stage Selection

CRAFT formalizes data selection as matching the empirical distribution induced by a validation set $V = \{(s_i, t_i)\}_{i=1}^M$ using a subset $T' \subset T$ of size $k \ll N$ from a very large candidate pool $T$ . The joint distribution $P(S,T)$ is decomposed as $P(S) \cdot P(T\mid S)$ , guiding a two-stage procedure:

Source Marginal Matching: Validation source embeddings are clustered via $k$ -means ( $m_s$ clusters). Cluster occupancy proportions are used to allocate selection budget per source cluster, thereby minimizing KL divergence between the source distribution of the selected data and the validation set. It is analytically shown that this proportional allocation minimizes the discretized KL divergence over the clusters, with an explicit upper bound on the continuous KL divergence that vanishes as clusters become fine (smaller diameter).
Conditional Target Selection: Within each source cluster, validation target embeddings are separately clustered into $m_t$ clusters. For a candidate in a source cluster, the selection score is defined as the expected distance to validation target cluster centroids, weighted by the empirical conditional distribution $V = \{(s_i, t_i)\}_{i=1}^M$ 0 observed in the validation set. This strategy regularizes selection (preventing over-concentration) by scoring at the cluster centroid level, not raw embedding distance, yielding robustness to metric noise and vectorizer choice.

This methodology is visualized through contrasted point distributions:

Figure 1: Distribution-matched: two samples drawn from the same joint distribution. The selected points (orange) cover the full spread of the validation points (blue).

CRAFT (not shown) concentrates selected points along the locus of the true conditional $V = \{(s_i, t_i)\}_{i=1}^M$ 1, leading to more precise and semantically aligned training data.

Vectorization-Agnostic Design

All operations in CRAFT—clustering, bucket assignment, centroid computations, and selection—depend only on distance calculations over vector representations. The framework is agnostic to the embedding method, admitting dense (semantic) LLM-based representations or high-dimensional sparse representations such as TF-IDF, with demonstrated robustness to representational expressivity.

Algorithmic Complexity

For $V = \{(s_i, t_i)\}_{i=1}^M$ 2 candidates and $V = \{(s_i, t_i)\}_{i=1}^M$ 3 validation examples, the vectorization step dominates runtime if using LLM-based embedding ( $V = \{(s_i, t_i)\}_{i=1}^M$ 4 for embedding cost $V = \{(s_i, t_i)\}_{i=1}^M$ 5). TF-IDF vectorization significantly reduces compute, making selection practical for CPU-only environments, while selection itself (excluding vectorization) operates with linear or near-linear complexity.

Empirical Evaluation and Ablation

Experimental Protocol

The effectiveness of CRAFT is evaluated primarily on the English–Hindi translation task using the 33M-pair NLLB corpus, with mBART-50 fine-tuned via LoRA on selected 20K subsets. All strong data selection baselines (DSIR, TSDS, TAROT) are included, using standardized encoders and consistent selection pool sizes. Both selection effectiveness (BLEU, chrF, METEOR) and algorithmic efficiency (end-to-end selection runtime, including vectorization and selection) are assessed.

Key Results

Quality: On a 1M candidate pool, CRAFT (dense embeddings) achieves 43.34 BLEU, surpassing TSDS (41.21), with TAROT scoring highest (45.61). With TF-IDF representations, CRAFT nearly matches TSDS (41.78 vs. 41.21 BLEU).
Efficiency: Selection time for CRAFT (dense) is 26.86 seconds, 2.8 $V = \{(s_i, t_i)\}_{i=1}^M$ 6 faster than TAROT (75.6s) and $V = \{(s_i, t_i)\}_{i=1}^M$ 740 $V = \{(s_i, t_i)\}_{i=1}^M$ 8 faster than TSDS. With TF-IDF, CRAFT reaches sub-minute selection even on million-scale pools, with CPU-only operation for both vectorization and selection.
Ablation: Ablating either source/target separation or conditional scoring (reduced to distribution matching) causes BLEU to degrade to near-random selection. This empirically confirms the necessity of CRAFT’s individualized source clustering and conditional alignment objectives.
Scalability: Increasing the candidate pool from 1M to 33M yields only marginal BLEU improvement, indicating that CRAFT rapidly isolates high-quality data early and avoids compute waste even on massive pools.

Theoretical Implications

CRAFT provides a formal connection between cluster-based data selection and continuous KL divergence minimization between validation and selected source distributions. The cluster diameter-dependent residual further offers a handle for practitioners to control selection quality by adjusting the number of clusters. Notably, regularization-by-discretization at the cluster level mitigates issues of overfitting and aligns well with the use of proxy distances in high-dimensional embedding spaces.

Furthermore, CRAFT naturally generalizes prior work (e.g., TSDS, DSIR, TAROT) by internalizing stratification/strata selection and integrating selection regularization implicitly, rather than via explicit diversity penalization or optimal transport instantiation.

Practical Significance and Future Directions

CRAFT’s flexibility regarding vectorization ensures practicality amidst diverse compute/resource settings. It offers a compelling trade-off between selection speed and fine-tuning efficacy, especially relevant for large-scale or latency-sensitive domains. The robust performance of CRAFT with simple TF-IDF encodings makes it appealing for bootstrapping rapid prototypes or large multilingual systems where GPU usage is constrained.

The explicit modeling of the conditional structure present in parallel datasets encourages adaptation of the CRAFT framework to other structured prediction tasks beyond translation, including cross-modal (e.g., vision-language) datasets.

A clear direction for future work involves extending both experimental benchmarks (other tasks/languages and modalities) and investigating cluster resolution adaptivity (dynamically selecting $V = \{(s_i, t_i)\}_{i=1}^M$ 9, $T' \subset T$ 0 depending on validation set entropy) for optimal performance-variance balancing.

Conclusion

CRAFT is a theoretically grounded, highly efficient data selection method for large-scale seq2seq training. By decomposing the source-target joint distribution via stratified clustering and conditional alignment, it consistently achieves high translation quality with minimal selection latency and exhibits robustness across vectorization choices. The regularization effect of cluster-level selection and the empirical efficacy in large parallel corpora motivate CRAFT as a generic data selection solution for modern machine learning workflows (2604.22693).

Markdown Report Issue