RegMix: Structured Mixing in ML

Updated 12 June 2026

RegMix is a family of machine learning methodologies that apply structured mixing—via data augmentation, adversarial regularization, or mixture regression—to enhance performance across varied tasks.
It employs nearest-neighbor mixing with learned policies and asymmetric KL divergence techniques to improve regression accuracy and adversarial robustness.
For language model pre-training, RegMix automates mixture selection using regression models, significantly reducing computation while outperforming traditional heuristics.

RegMix refers to a set of distinct methodologies in machine learning that leverage the principle of structured mixing—either of data, regularization objectives, or training policies—to improve model performance in regression, adversarial robustness, or large-scale language modeling. Three disjoint lines of research adopting the RegMix moniker have emerged: (1) data mixing augmentation for regression, (2) adversarial mutual and generalization regularization for deep neural network robustness, and (3) automated data mixture selection for LLM pre-training via regression(Hwang et al., 2021, Liu et al., 6 Oct 2025, Liu et al., 2024). Each framework addresses a central challenge in its application domain using the unifying concept of learned or structured mixing.

1. RegMix for Data Augmentation in Regression

The original RegMix framework was motivated by the limitations of Mixup-style augmentation in regression contexts. While Mixup—linear interpolation of feature-label pairs—demonstrates robust improvements in classification, its direct application to regression introduces the risk of generating unrealistic off-manifold labels, particularly when mixing distant data points in the continuous output space. RegMix restricts mixing to proximal pairs and learns a per-example mixing policy to optimize regression performance.

Core Algorithm

Mixing Operation: For two examples $(x_i, y_i)$ and $(x_j, y_j)$ , a mixed sample is constructed by sampling $\lambda\sim \mathrm{Beta}(\alpha,\alpha)$ and forming $\tilde x = \lambda x_i + (1-\lambda) x_j$ , $\tilde y = \lambda y_i + (1-\lambda) y_j$ .
Nearest-Neighbor Selection: For each $x_i$ , a sorted list of neighbors by Euclidean distance is precomputed. Candidate $k$ 's for nearest-neighbor mixing are discretized (e.g., $N = \{0, 2^0, \dots, 2^7\}$ ).
Policy Learning: A controller network $\pi_\theta$ (MLP or RNN) outputs a distribution over $k$ values for each example. For a given policy $(x_j, y_j)$ 0, the dataset is augmented, and a regression model is trained/evaluated on a held-out validation set. The controller is optimized via Proximal Policy Optimization, using a reward $(x_j, y_j)$ 1 to encourage improved validation performance.

Empirical Results

On both real (NO₂, Bike, Product) and synthetic datasets, RegMix achieved lower root mean squared error (RMSE) than no-augmentation, traditional Mixup, AdaMixup, Manifold Mixup, and global kNN/distance-mixing baselines. The table below summarizes main results over five seeds:

Dataset	Vanilla	Mixup	AdaMixup	ManifoldMixup	Global kNN	Global Dist	RegMix
NO₂	0.5441	0.5401	0.5397	0.5381	0.5470	0.5415	0.5248
Bike	393.45	393.36	391.62	399.44	388.81	388.43	368.86
Product	1.4100	1.2310	1.2293	1.2894	1.2625	1.2500	1.1948
Synthetic	15.6838	13.8358	13.7602	14.0457	13.7587	14.0860	13.4935

Ablation studies indicate that small $(x_j, y_j)$ 2 (few nearest neighbors) dominate in low-dimensional regression, whereas flexible, larger $(x_j, y_j)$ 3 can be optimal in high-dimensional, multi-output contexts. MLP and LSTM controllers yield similar results. Beta parameter $(x_j, y_j)$ 4 is critical, focusing $(x_j, y_j)$ 5 near 0.5 and reducing harmful extrapolation.

This suggests that local interpolation using nearest-neighbor-restricted Mixup is superior to global linear mixing in regression, particularly when training data is limited or the underlying function is non-linear (Hwang et al., 2021).

2. RegMix: Adversarial Mutual and Generalization Regularization

A subsequent incarnation of RegMix addresses the limitations of standard adversarial training, notably the use of mean squared error (MSE) or symmetric divergence penalties, which can dilute the effectiveness of distributional alignment under adversarial perturbation. Two main regularization strategies are proposed:

Adversarial Mutual Regularization (AMR): Enforces alignment between the output distributions on the "final" and "initial" adversarial examples using a weighted, asymmetric Kullback-Leibler (KL) divergence,

$(x_j, y_j)$ 6

with $(x_j, y_j)$ 7.

Adversarial Generalization Regularization (AGR): Adds a third term encouraging the adversarial output to mimic the clean sample output,

$(x_j, y_j)$ 8

with $(x_j, y_j)$ 9.

The total loss for a mini-batch is

$\lambda\sim \mathrm{Beta}(\alpha,\alpha)$ 0

where $\lambda\sim \mathrm{Beta}(\alpha,\alpha)$ 1 is either $\lambda\sim \mathrm{Beta}(\alpha,\alpha)$ 2 or $\lambda\sim \mathrm{Beta}(\alpha,\alpha)$ 3.

Rationale and Empirical Evaluation

AMR's bidirectional and weighted KL terms allow for controlled information flow, preventing overfitting to intermediate targets and reflecting the asymmetry in adversarial learning objectives. AGR further improves generalization by integrating clean-sample targets.

Experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet with ResNet-18 and WideResNet-34-10 under strong white-box attacks (PGD, C&W, AutoAttack) show robust accuracy gains—for instance, on CIFAR-10, FGSM-AGR yields 57.46% under $\lambda\sim \mathrm{Beta}(\alpha,\alpha)$ 4, outperforming the FGSM-PGK baseline (56.08%). Gains persist at larger $\lambda\sim \mathrm{Beta}(\alpha,\alpha)$ 5 and for other architectures. Tuning $\lambda\sim \mathrm{Beta}(\alpha,\alpha)$ 6 is consistently optimal.

These findings indicate that asymmetric, role-aware regularization outperforms prior symmetric or MSE penalties for adversarial robustness (Liu et al., 6 Oct 2025).

3. RegMix: Data Mixture Regression for LLM Pre-training

The RegMix methodology in LLM pre-training reframes the search for an effective domain mixture as a regression task. Rather than assume simple heuristics (e.g., Wikipedia-upweighting) or perform large-scale mixture grid search, RegMix employs an automated two-phase framework:

Regression-Based Mixture Selection

Phase 1: Proxy Training — Train many ( $\lambda\sim \mathrm{Beta}(\alpha,\alpha)$ 7) small proxy transformers (1M parameters, 1B tokens each) on randomly sampled mixtures of 17 Pile domains. Each run logs the input mixture $\lambda\sim \mathrm{Beta}(\alpha,\alpha)$ 8 and validation loss $\lambda\sim \mathrm{Beta}(\alpha,\alpha)$ 9 on a target domain (typically Pile-CC/web).
Phase 2: Performance Prediction — Fit a regression model $\tilde x = \lambda x_i + (1-\lambda) x_j$ 0 (Ridge or LightGBM) to predict validation loss for unseen mixtures.
Phase 3: Best Mixture Selection — Evaluate $\tilde x = \lambda x_i + (1-\lambda) x_j$ 1 on $\tilde x = \lambda x_i + (1-\lambda) x_j$ 2 simulated mixtures, select the top 100 by lowest predicted loss, and average to form the final mixture for high-compute, large-model (1B–7B) training.

Linear and non-linear regression (LightGBM) are both evaluated; LightGBM captures non-additive, cross-domain effects.

Empirical Assessment

Regression models trained on 512x1M-proxy runs using 1B tokens can correctly rank the downstream performance of 64 unseen 1B-parameter models (rank correlation $\tilde x = \lambda x_i + (1-\lambda) x_j$ 3, Table below):

Method	1M/1B	60M/1B	1B/25B
Linear	90.1	89.3	88.0
LightGBM	98.5	98.6	97.1

Zero-to-5-shot downstream task averages show that RegMix mixtures matched or slightly surpassed DoReMi, and both outperformed human-designed and “web-only” mixtures (RegMix/DoReMi avg. = 48.6%, Human = 46.6%). Notably, RegMix required only 10% of DoReMi's computational resources.

This result underscores that (1) domain mixture selection cannot be reduced to scaling laws or simple quality heuristics, (2) web domains (Pile-CC) are more correlated with downstream success than Wikipedia, and (3) regression-based mixture selection is computationally efficient and robust to scaling effects (Liu et al., 2024).

4. Comparative Summary of RegMix Methodologies

RegMix comprises distinct frameworks unified by the notion of structural mixing:

Application	Mixing Principle	Model Type	Learning Mechanism
Regression Aug.	Local linear interpolation	MLPs, regression models	Per-example policy, PPO
Adversarial Reg.	Role-weighted KL alignment	CNNs for vision (ResNet etc)	Bi-directional KL with unequal weights
LLM Pre-training	Data mixture as regression	Transformers	Regression (Ridge, LightGBM)

5. Limitations and Practical Guidance

RegMix for Regression: The policy search procedure is computationally intensive for large datasets; reliance on Euclidean neighbors may fail for adversarial or highly non-linear structure; controller hyperparameters require tuning. Post-hoc augmentation is supported in stable pipelines.
Adversarial RegMix: KL-divergence computations are costlier than $\tilde x = \lambda x_i + (1-\lambda) x_j$ 4 penalties; selection of $\tilde x = \lambda x_i + (1-\lambda) x_j$ 5 is dataset-dependent; marginal drop in clean accuracy may be observed.
RegMix for LLMs: Proxy data must span the intended domain distribution; downstream variance is large among random mixtures; inability to explain learned domain weights analytically.

6. Prospective Developments

Incorporation of fast policy search (e.g., Fast/Faster AutoAugment) for small-data regimes in regression augmentation (Hwang et al., 2021).
Exploration of non-Euclidean or learned similarity metrics for mixing policies.
Extension to non-tabular, high-dimensional inputs (e.g., images, spatio-temporal data).
Adoption of advanced regression or Bayesian optimization for mixture prediction in LLMs, potentially leveraging non-parametric or meta-learning strategies.
For adversarial robustness, investigation into learned regularization weights, adaptation to non-classification modalities, and hybridization with ensemble techniques.

7. Significance Across Domains

RegMix, as a family of methodologies, systematically addresses three core machine learning challenges—sample efficiency in regression, distributional robustness in adversarial settings, and optimal data curation for LLM pre-training—via learned, context-structured mixing policies. Its empirical superiority over prior heuristics and baselines, alongside extensibility to broader domains and model families, position these frameworks as state-of-the-art, especially when model performance is sensitive to data or regularizer composition(Hwang et al., 2021, Liu et al., 6 Oct 2025, Liu et al., 2024).

Markdown Report Issue Upgrade to Chat

References (3)

RegMix: Data Mixing Augmentation for Regression (2021)

RegMix: Adversarial Mutual and Generalization Regularization for Enhancing DNN Robustness (2025)

RegMix: Data Mixture as Regression for Language Model Pre-training (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RegMix.

RegMix: Structured Mixing in ML

1. RegMix for Data Augmentation in Regression

Core Algorithm

Empirical Results

2. RegMix: Adversarial Mutual and Generalization Regularization

Rationale and Empirical Evaluation

3. RegMix: Data Mixture Regression for LLM Pre-training

Regression-Based Mixture Selection

Empirical Assessment

4. Comparative Summary of RegMix Methodologies

5. Limitations and Practical Guidance

6. Prospective Developments

7. Significance Across Domains

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

RegMix: Structured Mixing in ML

1. RegMix for Data Augmentation in Regression

Core Algorithm

Empirical Results

2. RegMix: Adversarial Mutual and Generalization Regularization

Rationale and Empirical Evaluation

3. RegMix: Data Mixture Regression for LLM Pre-training

Regression-Based Mixture Selection

Empirical Assessment

4. Comparative Summary of RegMix Methodologies

5. Limitations and Practical Guidance

6. Prospective Developments

7. Significance Across Domains

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research