Dual Style Randomization (DSR)

Updated 19 November 2025

Dual Style Randomization (DSR) is a data augmentation module that simulates diverse foreground and global style perturbations to improve segmentation robustness across domains.
It employs two sequential modules—Foreground Style Randomization and Global Style Randomization—to generate synthetic training samples with varied style contrasts and image-level shifts.
Empirical results show that integrating DSR with hierarchical semantic learning yields significant mIoU improvements on ResNet-50 and ViT-B/16 backbones.

Dual Style Randomization (DSR) is a training-time data augmentation and style simulation module introduced to improve the robustness of cross-domain few-shot segmentation (CD-FSS) models to segmentation granularity gaps and inter-domain style shifts. In the context of the Hierarchical Semantic Learning (HSL) framework, DSR generates augmented inputs that simulate a diverse array of foreground–background style contrasts and global image variations, thereby enhancing the semantic discriminability of models faced with target domains exhibiting different data distributions and finer style contrasts than typical source domains (Sun et al., 15 Nov 2025).

1. Motivation and Problem Setting

Cross-domain few-shot segmentation models are often trained on datasets with coarse object boundaries, such as PASCAL-VOC, where foreground and background are separated by salient style differences. However, many practical target domains—such as medical imagery (e.g., skin lesions) or satellite images—present subtle, fine-grained style distinctions within a single object or between object and background. Standard models typically fail to generalize across such granularity gaps. DSR addresses this by generating synthetic variations along two axes:

By randomly perturbing only the foreground’s style (Foreground Style Randomization), DSR simulates variable foreground–background style gaps, thereby exposing the model to a broader distribution of granularity contrasts.
By perturbing the global style (Global Style Randomization), DSR introduces image-level domain shifts, mirroring the variations encountered when deployed on novel domains.

Encoders trained on DSR-augmented data learn to extract more robust and discriminative features, resulting in improved segmentation on domains with mismatched style and granularity.

2. DSR Architecture and Workflow

DSR consists of two sequential modules: Foreground Style Randomization (FSR) and Global Style Randomization (GSR).

A. Foreground Style Randomization (FSR):

The FSR module selectively perturbs the appearance of the foreground region in the input image. This is accomplished by mixing the amplitude spectra of the current foreground patch and a randomly sampled local patch from a superpixel region at the coarsest scale.
Mask smoothing is applied using MaxPool and AvgPool with kernel size $K$ to avoid hard mask edges.
Foreground extraction, local superpixel selection, Fourier transform (FFT)-based decomposition, amplitude fusion (weighted by $\omega \sim \mathcal{N}(0, \sigma_f^2)$ ), and inverse FFT are performed to reconstruct a style-perturbed foreground patch. The perturbed foreground is blended into the original image using the smoothed mask.

B. Global Style Randomization (GSR):

The resulting image from FSR is processed through a random convolution layer with kernel elements $\Theta \sim \mathcal{N}(0, \sigma_g^2)$ , generating a globally randomized counterpart.
FFT decomposes both the previous and the randomized image. The global amplitude spectrum from the randomized image is combined with the phase spectrum from the FSR image. The final inverse FFT produces the globally style-randomized image $\hat I$ , which is then consumed by the encoder.

The overall DSR process is applied to each training image (support or query) in every episode:

for each training episode:
  for each support or query image I and mask M:
    # Foreground Style Randomization
    M' = AvgPool(MaxPool(M, K), K)
    I_fg = crop(I, bbox(M'))
    pick random superpixel R from coarsest M0_sp
    I_local = resize(crop(I, bbox(R)), I_fg.size)
    [A_fg, P_fg]    = FFT(I_fg)
    [A_loc, P_loc]  = FFT(I_local)
    ω ← Normal(0, σ_f^2)
    A_fus = ω * A_loc + (1−ω) * A_fg
    I_fg_tilde = IFFT(A_fus * exp(i P_fg))
    pad I_fg_tilde to H×W
    I_tilde = M' ⊙ I + (1−M') ⊙ I_fg_tilde

    # Global Style Randomization
    Θ ← Normal(0, σ_g^2)
    I_rand = RC(I_tilde; Θ)
    [A_t, P_t] = FFT(I_tilde)
    [A_rand, P_rand] = FFT(I_rand)
    I_hat = IFFT(A_rand * exp(i P_t))

    # feed I_hat into encoder
end
extract features → pass through HSM → prototypes → segmentation loss
back-propagate BCE loss

At inference, DSR is disabled; images pass directly to the encoder.

3. Mathematical Formulation

The operations of DSR can be summarized as follows, with $I \in \mathbb{R}^{3 \times H \times W}$ , $M \in \{0,1\}^{H \times W}$ :

Foreground Style Randomization

$\begin{align*} &[A^{fg}, P^{fg}] = \mathrm{FFT}(I^{fg}), \quad [A^{local}, P^{local}] = \mathrm{FFT}(I^{local}) \ &\omega \sim \mathcal{N}(0, \sigma_f^2) \ &A^{fusion} = \omega A^{local} + (1-\omega) A^{fg} \ &\tilde{I}^{fg} = \mathrm{IFFT}(A^{fusion} \cdot e^{i P^{fg}}) \ &\tilde{I} = M' \odot I + (1-M') \odot \tilde{I}^{fg} \end{align*}$

Global Style Randomization

$\begin{align*} &\tilde{I}^{rand} = RC(\tilde{I}; \Theta), \quad \Theta \sim \mathcal{N}(0, \sigma_g^2) \ &[\tilde{A}, \tilde{P}] = \mathrm{FFT}(\tilde{I}), \quad [\tilde{A}^{rand}, \tilde{P}^{rand}] = \mathrm{FFT}(\tilde{I}^{rand}) \ &\hat{I} = \mathrm{IFFT}(\tilde{A}^{rand} \cdot e^{i \tilde{P}}) \end{align*}$

4. Hyperparameters and Ablation Outcomes

Typical hyperparameters and empirical results are as follows:

Hyperparameter	Typical Value	Context
Pooling kernel $K$	9	Mask smoothing
Superpixel scales $L$	4, regions $\{5^2, 10^2, 15^2, 20^2\}$	FSR, HSM inputs
$\sigma_f$ (foreground)	0.25	FSR style mix
$\sigma_g$ (global; Res50)	0.1	GSR (ResNet-50)
$\sigma_g$ (global; ViT)	0.6	GSR (ViT-B/16)

No random drop probability is used; both FSR and GSR are applied at every training step.

Ablation results demonstrate the quantitative effectiveness of DSR:

On ResNet-50 backbone: baseline mIoU = 57.8%, with DSR = 60.4% (+2.6%)
On ViT-B/16 backbone: baseline mIoU = 62.2%, with DSR = 64.6% (+2.3%)

5. Integration with Hierarchical Semantic Learning

DSR operates as a pure augmentation and simulation module at the model’s input pipeline during training. The style-randomized images $\hat{I}$ are processed by a shared encoder and then forwarded to the Hierarchical Semantic Mining (HSM) module. HSM leverages the same multi-scale superpixel masks as FSR to guide the mining of intra-class consistency and inter-class distinction at various granularity levels. Exposure to DSR-augmented data compels the encoder to generate features that are robust to a continuum of foreground–background style differences and diverse global styles. At test time, DSR is disabled; the encoder and HSM modules extract and aggregate hierarchical semantic features from the unaltered inputs, while Prototype Confidence-modulated Thresholding (PCMT) mitigates ambiguity in the final segmentation via confidence-adaptive thresholding.

Empirical evidence indicates that DSR and HSM jointly deliver substantial improvements over either module alone, suggesting a synergistic relationship between style-randomized data and hierarchical semantic representation learning. The addition of PCMT yields further state-of-the-art performance in cross-domain few-shot segmentation (Sun et al., 15 Nov 2025).

6. Implementation Details and Practical Guidelines

The DSR module is deterministically applied during training on both support and query images. The recommended settings for FSR include a pooling kernel $K=9$ , style mixing standard deviation $\sigma_f=0.25$ , and four superpixel scales. For GSR, $\sigma_g$ should be tuned according to the encoder architecture (e.g., $\sigma_g=0.1$ for ResNet-50, $\sigma_g=0.6$ for ViT-B/16). At each iteration, the steps follow the supplied pseudocode sequence, ensuring reproducibility.

The DSR module’s modularity allows straightforward integration into segmentation model pipelines, with no modifications required for test-time inference. Removal of DSR at inference time does not compromise the benefits conferred by exposure to style diversity during learning.

7. Significance in Cross-domain Few-shot Segmentation

DSR addresses specific shortcomings of prior CD-FSS approaches that overfit to the style and granularity statistics of the source domain, thereby enabling models to generalize more effectively to target domains with unseen, subtle style gaps. By systemically randomizing both local (foreground) and global (image-wide) styles, DSR bridges the granularity gap, improving hierarchical semantic feature separability. Quantitative improvements, as measured by mean Intersection-over-Union (mIoU), substantiate its impact. In summary, DSR is a principled, reproducible augmentation procedure that materially advances the state of the art in domain-robust few-shot segmentation (Sun et al., 15 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Bridging Granularity Gaps: Hierarchical Semantic Learning for Cross-domain Few-shot Segmentation (2025)

Follow Topic

Get notified by email when new papers are published related to Dual Style Randomization (DSR).