Papers
Topics
Authors
Recent
Search
2000 character limit reached

Manifold Token Mixup for Text Classification

Updated 30 January 2026
  • Manifold Token Mixup is an interpolation-based augmentation technique that swaps token features to enhance deep text classifiers, especially in low-resource settings.
  • It operates by inserting a binary mask at a randomly selected layer to partially replace hidden representations between two input samples.
  • Empirical evaluations show that MSMix variants, particularly MSMix-A and MSMix-B, improve classification accuracy by around 1.6 percentage points over baseline models.

Manifold Token Mixup (MSMix) is an interpolation-based data augmentation technique for deep neural network text classifiers, designed to address performance degradation under limited data regimes. MSMix operates by inducing partial swaps of token-level hidden feature representations between two samples at a randomly selected layer within a neural architecture. It augments the sample space not through linear blending but via feature-wise replacement, offering distinct theoretical and empirical benefits for robust text classification tasks.

1. Formal Mathematical Framework

Let f:XYf: X \rightarrow Y be an LL-layer text classifier, typified by models such as a 12-layer BERT. The hidden-state tensor at layer \ell for input xx is denoted h(x)RI×dh_\ell(x) \in \mathbb{R}^{I \times d}, where II is the token length and dd is the hidden dimension. Given two randomly sampled pairs (xi,yi)(x_i, y_i) and (xj,yj)(x_j, y_j), draw a mixing coefficient λBeta(α,α)\lambda \sim \operatorname{Beta}(\alpha, \alpha). Traditional manifold mixup would produce:

hmix=λh(xi)+(1λ)h(xj)h_\ell^{\text{mix}} = \lambda h_\ell(x_i) + (1-\lambda) h_\ell(x_j)

and

ymix=λyi+(1λ)yj.y^{\text{mix}} = \lambda y_i + (1-\lambda) y_j.

MSMix instead constructs a binary mask M{0,1}I×dM \in \{0, 1\}^{I \times d} with exactly p=λIdp = \lfloor \lambda \cdot I \cdot d \rfloor zeros. MSMix's swap-mixup operation is:

hmix=Mh(xi)+(1M)h(xj)h_\ell^{\text{mix}} = M \odot h_\ell(x_i) + (1-M) \odot h_\ell(x_j)

where ‘‘\odot’’ denotes element-wise multiplication. For each zero position in MM, the corresponding feature value in h(xi)h_\ell(x_i) is replaced by that of h(xj)h_\ell(x_j), preserving the remainder.

2. Dimension-Selection Strategies

MSMix implements three dimension-selection strategies to govern which features are swapped:

  • MSMix-Base (Random Selection): pp positions within the I×dI \times d tensor are selected uniformly at random.
  • MSMix-A (Correlated-Magnitude Selection): The element-wise product magnitude

C=h(xi)h(xj)RI×dC = | h_\ell(x_i) \odot h_\ell(x_j) | \in \mathbb{R}^{I \times d}

is computed and flattened. The top-pp highest values are chosen for replacement.

  • MSMix-B (Low-Importance to High-Importance Selection):
  1. Identify index set QiQ_i of the qq smallest h(xi)|h_\ell(x_i)| values in h(xi)h_\ell(x_i), with qpq \geq p.
  2. Among these, select the top-pp largest h(xj)|h_\ell(x_j)| values; call this set PP.
  3. Construct MM with zeros at indices in PP, ones elsewhere.

These methods introduce structure into the mixup process, enhancing diversity of hidden representations relative to uniformly random swaps.

3. Step-by-Step MSMix Training Procedure

The MSMix algorithm proceeds as follows for each training iteration:

  1. Sample two examples (xi,yi),(xj,yj)(x_i, y_i), (x_j, y_j) from dataset DD.
  2. Draw mixing coefficient λBeta(α,α)\lambda \sim \operatorname{Beta}(\alpha, \alpha).
  3. Select a mixing layer {1,...,L}\ell \in \{1, ..., L\} randomly (random \ell is empirically superior to fixed).
  4. Execute the forward pass for xix_i and xjx_j up to layer \ell, yielding h(xi),h(xj)h_\ell(x_i), h_\ell(x_j).
  5. Generate mask M{0,1}I×dM \in \{0,1\}^{I \times d} with p=λIdp = \lfloor \lambda I d \rfloor zeros according to the desired strategy (Base/A/B).
  6. Form mixed hidden tensor hmixh_\ell^{\text{mix}} as above.
  7. Continue the forward pass from layer +1\ell+1 to LL with hmixh_\ell^{\text{mix}} to obtain model output y^\hat{y}.
  8. Compose mixed label ymix=λyi+(1λ)yjy^{\text{mix}} = \lambda y_i + (1-\lambda) y_j.
  9. Calculate cross-entropy loss CE(y^,ymix)\operatorname{CE}(\hat{y}, y^{\text{mix}}) and backpropagate through all layers.

4. Theoretical Analysis and Regularization Perspective

Each hidden-dimension acts as a virtual feature, and the swapping mechanism introduces correlated noise, distinct from additive Gaussian noise or stochastic dropout techniques. This approach can be interpreted as an instantiation of Dropout/DropConnect at the level of semantic token features rather than indiscriminate neuron or weight deactivation. MSMix leverages manifold mixup theory (cf. Verma et al. 2019), whereby linear interpolations within hidden space flatten decision boundaries and combat overfitting. By swapping finite, localized subsets of dimensions, MSMix preserves the discrete character of textual features, crucial for NLP tasks.

This methodology is particularly effective in low-resource regimes: the creation of 'in-between' hidden states impedes memorization of anomalous samples, providing strong regularization and increased generalization.

5. Empirical Evaluation and Key Findings

MSMix was evaluated across three Chinese intent classification datasets: YiwiseIC (12 classes), SMP2017-ECDT (31 classes), and CrossWOZ-IC (task-oriented extraction), including reduced-size splits:

Model YiwiseIC SMP2017 CrossWOZ
simBERT 91.79 95.05 95.14
EDA 90.28 93.40 95.12
Mixup-Transformer 92.50 94.45 95.38
TMix 92.61 95.35 95.63
MSMix-Base 94.09 95.36 95.64
MSMix-A 93.53 95.80 95.87
MSMix-B 94.30 95.35 95.75

Under small-sample scenarios, MSMix-A and MSMix-B deliver improvements of approximately 1.6 percentage points over the simBERT baseline:

Model YiwiseIC_FS SMP2017_FS CrossWOZ_FS
simBERT 81.17 89.81 91.12
EDA 79.73 90.85 91.77
Mixup-Transformer 80.04 90.85 90.65
TMix 81.40 90.10 92.00
MSMix-Base 81.98 90.40 92.00
MSMix-A 82.80 90.85 92.62
MSMix-B 82.54 91.45 92.55

MSMix-B yields the highest test accuracy on YiwiseIC (94.30%), and MSMix-A on SMP2017 (95.80%). All MSMix variants surpass mixup-at-output (Mixup-Transformer), TMix, and EDA methods.

6. Architectural and Implementation Details

Experiments employed simbert-base-chinese (12-layer BERT), with hidden dimension d=768d=768 and maximum token length I=128I=128. MSMix was compared against EDA (α=0.1\alpha=0.1, eight augmentations), Mixup-Transformer, and TMix. Standard Adam optimizer with learning rate 2×105\approx 2 \times 10^{-5} and batch size 32 was used, with three to five epochs of fine-tuning. The mixing coefficient λ\lambda utilizes the conventional Beta distribution (α=1.0\alpha=1.0).

Random selection of mixing layer \ell heterogeneous across batches was empirically favored over fixed-layer mixing.

7. Significance and Implications

MSMix provides an efficient, implementation-light method for augmenting text classification datasets at the hidden-state level, focusing on partial feature swaps rather than full-sample interpolation. This enhances local feature integrity, improves both in full and low-resource regimes, and establishes the viability of swap-based mixup methods for deep NLP models (Ye et al., 2023). A plausible implication is that MSMix strategies may generalize to other sequence encoding architectures and tasks where discrete token feature integrity is pivotal.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Manifold Token Mixup.