C-Mixup: Regression Data Augmentation

Updated 3 April 2026

The paper demonstrates that C-Mixup reduces MSE by selectively mixing instances with similar labels, ensuring synthetic targets remain plausible.
C-Mixup employs a kernel function over the label space to guide the interpolation process, enhancing in-distribution and out-of-distribution performance.
Empirical results show improved few-shot generalization and robustness across various tasks, with extensions to hierarchical and multimodal regression scenarios.

C-Mixup is a data augmentation technique designed to improve generalization and robustness in supervised regression tasks by constructing synthetic training points via convex combinations of instances with semantically similar labels. The method extends the classical Mixup approach, which was originally optimized for classification, by selecting mixed pairs in a manner that respects the geometric structure of the label space. This controlled interpolation is shown to reduce the creation of implausible targets and yields measurable gains in mean-squared error, domain-invariance, and generalization for regression problems where the output space is continuous and often multimodal.

1. Motivation: Challenges in Regression Data Augmentation

Classical Mixup generates synthetic samples by interpolating pairs uniformly sampled from the training set, combining both inputs and their associated outputs through a randomly drawn mixing coefficient $\lambda \sim \mathrm{Beta}(\alpha, \alpha)$ : $\tilde{x} = \lambda x_i + (1-\lambda)x_j, \qquad \tilde{y} = \lambda y_i + (1-\lambda)y_j.$ While this is effective for classification, where interpolating between discrete label vectors (e.g., one-hot encodings) produces valid soft labels, it fails in regression settings. If two samples with distant labels $y_i, y_j$ are mixed, the resulting $\tilde{y}$ may not correspond to any plausible or meaningful value in the problem domain, leading to degraded model performance, especially under covariate or correlation shifts (Yao et al., 2022, Hwang et al., 2024).

C-Mixup addresses this limitation by ensuring that only pairs with proximate labels are mixed, resulting in synthetic targets that remain semantically valid.

2. Algorithmic Formulation

For each anchor input-label pair $(x_i, y_i)$ in a dataset $D = \{(x_i, y_i)\}_{i=1}^n$ , C-Mixup defines a probabilistic selection mechanism over candidate partners $(x_j, y_j)$ using a kernel function over the label space: $P((x_j, y_j) \mid (x_i, y_i)) = \frac{\exp\left(-d(y_i, y_j)/b^2\right)}{\sum_{k=1}^n \exp\left(-d(y_i, y_k)/b^2\right)}$ where $d(y_i, y_j)$ is typically the Euclidean distance and $b > 0$ is a bandwidth hyperparameter controlling kernel sharpness [Eq (1), (Hwang et al., 2024)]. The partner $\tilde{x} = \lambda x_i + (1-\lambda)x_j, \qquad \tilde{y} = \lambda y_i + (1-\lambda)y_j.$ 0 is sampled according to $\tilde{x} = \lambda x_i + (1-\lambda)x_j, \qquad \tilde{y} = \lambda y_i + (1-\lambda)y_j.$ 1, and the corresponding input and label are linearly interpolated: $\tilde{x} = \lambda x_i + (1-\lambda)x_j, \qquad \tilde{y} = \lambda y_i + (1-\lambda)y_j.$ 2 with $\tilde{x} = \lambda x_i + (1-\lambda)x_j, \qquad \tilde{y} = \lambda y_i + (1-\lambda)y_j.$ 3.

The process results in a mixed dataset $\tilde{x} = \lambda x_i + (1-\lambda)x_j, \qquad \tilde{y} = \lambda y_i + (1-\lambda)y_j.$ 4, which augments the original dataset for model training [Algorithm 3, (Hwang et al., 2024)].

3. Hyperparameters and Sampling Behavior

C-Mixup's effectiveness depends on two main hyperparameters:

Bandwidth $\tilde{x} = \lambda x_i + (1-\lambda)x_j, \qquad \tilde{y} = \lambda y_i + (1-\lambda)y_j.$ 5: Governs the selectivity in label space. Small $\tilde{x} = \lambda x_i + (1-\lambda)x_j, \qquad \tilde{y} = \lambda y_i + (1-\lambda)y_j.$ 6 confines mixing to very close label-neighbors, approximating empirical risk minimization. Large $\tilde{x} = \lambda x_i + (1-\lambda)x_j, \qquad \tilde{y} = \lambda y_i + (1-\lambda)y_j.$ 7 broadens the kernel, making C-Mixup approach uniform Mixup behavior (Hwang et al., 2024).
Beta Parameter $\tilde{x} = \lambda x_i + (1-\lambda)x_j, \qquad \tilde{y} = \lambda y_i + (1-\lambda)y_j.$ 8: Controls the symmetry of the linear interpolation. Large $\tilde{x} = \lambda x_i + (1-\lambda)x_j, \qquad \tilde{y} = \lambda y_i + (1-\lambda)y_j.$ 9 (e.g., $y_i, y_j$ 0) yields mixing coefficients near 0.5; small $y_i, y_j$ 1 produces coefficients near 0 or 1, favoring nearly original samples (Hwang et al., 2024, Yao et al., 2022).

There is no explicit distance threshold; the kernel provides a soft weighting, and both parameters can be tuned via cross-validation or data-driven procedures.

4. Theoretical Properties and Generalization Guarantees

C-Mixup is theoretically justified in regression by minimizing the risk of generating synthetic points in low-density or semantically invalid regions of label space. The approach provably yields strictly lower mean squared error (MSE) than both vanilla Mixup (uniform over instances) and Mixup with feature-similarity-based pairing, under a noisy single-index model: $y_i, y_j$ 2 where $y_i, y_j$ 3 is learned via feature similarity Mixup [(Yao et al., 2022), Theorem 1].

C-Mixup also yields tighter bounds for meta-regression (few-shot task generalization) and under covariate shift, outperforming both classical and feature-based Mixup across these settings [(Yao et al., 2022), Theorems 2–3].

Empirically, C-Mixup outperforms strong baselines (ERM, vanilla mixup, Manifold Mixup, etc.) with improvements of +6.56% in-distribution generalization, +4.76% few-shot generalization, and +5.82% out-of-distribution robustness across a spectrum of tasks (tabular, time-series, image, video, drug–target binding) (Yao et al., 2022). C-Mixup is also robust to moderate label noise and the choice of kernel width and Beta parameter.

5. Extensions and Applications Beyond Tabular Regression

C-Mixup has been adapted for hierarchical and domain-specific representations, such as in multidimensional music aesthetic evaluation. In this setting, C-Mixup operates on pooled feature vectors from multiple scales (e.g., segment and track-level) rather than raw inputs, using a kernel over these semantic representations to enforce "semantic consistency"—mixing only those examples close in the pooled feature space (Liu et al., 24 Nov 2025).

In hierarchical augmentation pipelines, C-Mixup is often applied after base-level augmentations (like waveform perturbations), particularly for structured regression targets (e.g., multidimensional audio quality scores). Empirical ablations show that injecting C-Mixup produces measurable increases in top-tier ranking metrics (e.g., +0.99 Top-Tier Accuracy over baseline on music evaluation benchmarks) (Liu et al., 24 Nov 2025).

6. Limitations and Sensitivity to Noisy Labels

A core limitation of C-Mixup is its inability to distinguish clean from noisy samples. The kernel-based selection considers only label proximity, not the reliability of labels. In noise-corrupted settings, this can lead to harmful mixings, as confirmed by degradation in performance proportional to increasing noise ratios—on par with baseline ERM and vanilla Mixup [Section 3, Fig. 2(a), (Hwang et al., 2024)]. Further, the optimal kernel width $y_i, y_j$ 4 shifts with noise level, necessitating adaptive tuning [Fig. 2(b), (Hwang et al., 2024)].

The RC-Mixup extension addresses this by embedding C-Mixup within a robust training framework: C-Mixup is applied only to a dynamically curated set of presumed-clean examples, as determined by robust model selection. Bandwidth $y_i, y_j$ 5 is also periodically re-tuned throughout training [Section 4, Algorithm 1, (Hwang et al., 2024)].

C-Mixup complements and differs from other advanced mixup-type methods. For instance, in contrast with Co-Mixup—which leverages discrete optimization for maximizing saliency guidance and supermodular diversity in the batch for classification (Kim et al., 2021)—C-Mixup's principal novelty lies in label-space locality for regression.

A plausible implication is that while C-Mixup chiefly addresses plausibility of synthetic labels in continuous output domains, methods such as Co-Mixup focus on spatial or semantic diversity in input construction and are primarily geared toward classification or detection scenarios.

C-Mixup has also been successfully integrated (via label-similarity sampling) with other augmentation pipelines like CutMix, PuzzleMix, and AutoMix, leading to further gains across diverse data modalities (Yao et al., 2022).

Key References:

"RC-Mixup: A Data Augmentation Strategy against Noisy Data for Regression Tasks" (Hwang et al., 2024)
"C-Mixup: Improving Generalization in Regression" (Yao et al., 2022)
"Multidimensional Music Aesthetic Evaluation via Semantically Consistent C-Mixup Augmentation" (Liu et al., 24 Nov 2025)
"Co-Mixup: Saliency Guided Joint Mixup with Supermodular Diversity" (Kim et al., 2021)