Papers
Topics
Authors
Recent
Search
2000 character limit reached

FBSDiff: Frequency-Guided I2I Translation

Updated 3 February 2026
  • FBSDiff is a frequency-domain I2I translation framework that uses DCT-based decomposition to isolate appearance, layout, and contour information.
  • It leverages plug-and-play frequency band substitution during diffusion steps to achieve fine-grained, text-guided image manipulations.
  • FBSDiff++ accelerates the process with localized control and adaptive masking, delivering superior efficiency and state-of-the-art editing quality.

FBSDiff is an algorithmic framework for text-driven image-to-image (I2I) translation based on plug-and-play frequency band substitution of latent diffusion features. It enables highly controllable manipulation of image synthesis using a frozen, pre-trained text-to-image (T2I) diffusion model by dynamically substituting only selected frequency bands from a reference image’s diffusion features into the generation trajectory. This approach allows operators to fine-tune the transfer of appearance, layout, and contour information, or interpolate control strength by modulating the bandwidth of the substituted frequencies. FBSDiff and its accelerated extension FBSDiff++ offer versatile, training-free pipelines for state-of-the-art text-conditioned I2I editing, supporting global or localized manipulation, arbitrary input resolutions, and both structure-holding and style-transferring operations (Gao et al., 2024, Gao et al., 27 Jan 2026).

1. Frequency-Band Decomposition and Motivation

In conventional T2I or I2I translation with diffusion models, semantic content, appearance, and structural detail are entangled in spatial-domain latent representations, limiting control over which attributes from a reference image are transferred. FBSDiff introduces a DCT-based frequency separation of latent features, enabling explicit control:

  • Low-frequency coefficients (top-left of DCT spectrum): encode appearance, color distributions, and broad layout statistics.
  • Mid-frequency coefficients: encode intermediate-scale layout or region arrangements.
  • High-frequency coefficients (bottom-right): encode edge information, contours, and fine detail.

By selectively substituting only a desired frequency band at each denoising step, FBSDiff achieves plug-and-play, fine-grained control of guiding factors—appearance, layout, or contour—without modifying model weights or performing any additional training or tuning.

2. Mathematical and Algorithmic Formulation

Let xx denote the reference image with latent encoding z0=E(x)z_0=E(x). During generation, the core Frequency Band Substitution (FBS) layer operates as follows:

z~t=IDCT(DCT(z^t)Mask+DCT(z~t)(1Mask))\tilde z_t' = \mathrm{IDCT}\Bigl(\mathrm{DCT}(\hat z_t) \cdot \mathit{Mask}_* + \mathrm{DCT}(\tilde z_t) \cdot (1 - \mathit{Mask}_*)\Bigr)

where z^t\hat z_t and z~t\tilde z_t are the reference and generation latent features at timestep tt, DCT and IDCT are channelwise 2D or cascaded 1D discrete cosine/inverse cosine transforms, and Mask\mathit{Mask}_* denotes a binary mask selecting a frequency band by index sum (x+yx+y for 2D). Mask types include:

  • Low-pass: Masklp(x,y)=1\mathit{Mask}_{lp}(x,y)=1 if x+ythlpx+y \leq th_{lp}
  • High-pass: Maskhp(x,y)=1\mathit{Mask}_{hp}(x,y)=1 if x+y>thhpx+y > th_{hp}
  • Mid-pass: Maskmp(x,y)=1\mathit{Mask}_{mp}(x,y)=1 if thmp1<x+ythmp2th_{mp1} < x+y \leq th_{mp2}

In FBSDiff++, static 2D-DCT is replaced with cascaded 1D-DCT along width and height, with percentile-based thresholds (e.g., ptlppt_{lp}) ensuring the method operates on arbitrary image shapes and aspect ratios.

The I2I pipeline consists of:

  1. Inversion: DDIM inversion of the reference image to noise
  2. Reconstruction: (FBSDiff only) reconstruction to cache the feature trajectory {z^t}\{\hat z_t\}
  3. Text-guided sampling: generation from random noise under target text
  4. Dynamic FBS: at each step, substitute the chosen frequency band after denoising during the “calibration” phase

FBSDiff++ caches the inversion trajectory and entirely eliminates explicit reconstruction, providing significant acceleration.

3. Plug-and-Play Integration and Control Mechanisms

FBSDiff requires only standard UNet denoiser invocations (no model updates). During the initial λT\lambda T sampling steps (calibration phase), FBS is repeatedly applied; text-only denoising resumes thereafter. The guiding factor—appearance, layout, or contour—is chosen via mask type; guiding intensity is adjusted by varying the mask thresholds (frequency band widths), yielding continuous interpolation from weak to strong attribute transfer.

FBSDiff++ extends this paradigm with:

  • AdaFBS: automatic percentile-based mask scaling for arbitrary H×WH\times W
  • Localized control: region masks allow spatially restricted editing
  • Style-specific content creation: random spatial transformations (spatial transformation pool, or STP) “destroy” explicit structure in the reference prior to low-frequency substitution, enabling style transfer without layout copying

4. Empirical Performance and Comparative Evaluation

Extensive experiments on LAION-Mini and benchmark protocols demonstrate:

  • Qualitative versatility: By switching among band types, FBSDiff/++ can produce appearance-matching, layout-constrained, or contour-preserving edits, surpassing contemporaneous approaches such as Null-text Inversion, Plug-and-Play, Pix2Pix-zero, InstructPix2Pix, and StyleDiff in visual fidelity and attribute disentanglement.
  • Diversity: Multiple distinct outputs for the same reference/text are produced via stochastic resampling of initial noise, in contrast to inversion-locked methods.
  • Quantitative superiority: FBSDiff/++ achieves top-4 ranks on Structure Sim (DINO), LPIPS, AdaIN style loss, CLIP-similarity, and aesthetic scores under both derivative and style-transfer tasks. FBSDiff++ reaches 0.965 Structure Sim and leads in several trade-off metrics.
  • Efficiency: Architecture streamlining enables FBSDiff++ to achieve an 8.9×\times speedup over FBSDiff and 7–8×\times over other state-of-the-art methods, with total processing time of 9.6s (A100, SD v1.5, 50 steps).
  • User study validation: Among 70 raters and 20 models, FBSDiff++ attained the highest “Excellent/Optimal” rating proportion.
Method Inversion Sampling Total Time
PT-Inv 42 s 30 s 72 s
StyleDiff 40 s 38 s 78 s
FBSDiff 77 s 8 s 85 s
FBSDiff++ 3.5 s 6.1 s 9.6 s

5. Ablation, Limitations, and Future Research

Ablations indicate that:

  • Replacing stepwise dynamic substitution with a one-shot substitution at t=λTt=\lambda T results in severe artifacts.
  • Full-spectrum substitution collapses text fidelity; selective (partial) band substitution is critical.
  • STP is necessary for structure-decoupled style transfer.

Identified limitations include:

  • Hyperparameter tuning (frequency thresholds, λ\lambda) is currently manual; automated controllers could improve usability.
  • FBSDiff applies global frequency masks, lacking spatially variant control.
  • In cases of strong semantic mismatch between reference and text, some blending artifacts can appear.
  • The current formulation is channelwise and DCT-based; extensions to other frequency representations or pixel-space diffusion are open research directions.

6. Practical Considerations and Usage Guidelines

FBSDiff/++ is implemented in Python/PyTorch leveraging HuggingFace diffusers (Stable Diffusion v1.5) and scipy for DCT computations. Key parameters:

  • Number of steps T=50T=50, CFG scale ω=7.5\omega=7.5, calibration ratio λ=0.5\lambda=0.5
  • Frequency thresholds or percentiles tuned for desired attribute transfer
  • Region masks and STP available for local edits and style-only generation, respectively
  • The method is resolution-agnostic in FBSDiff++, generalizing to arbitrary image sizes without retraining.

A typical pipeline involves encoding the source image, DDIM inversion, caching latent features, text-guided sampling with AdaFBS at each step, and decoding the output. Masking, style-only, or other functionalities are achieved by simple modular additions.

7. Significance and Impact

FBSDiff introduces a frequency-domain methodology for factor-controllable text-driven I2I by exploiting the spectral separation of appearance, layout, and contours in DCT space. Its training-free, model-agnostic, and highly efficient implementation establishes a new class of practical, tunable I2I pipelines for both research and application contexts, providing a new avenue for controllable generative image editing with strong empirical performance across tasks and benchmarks (Gao et al., 2024, Gao et al., 27 Jan 2026).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FBSDiff.