FBSDiff: Frequency-Guided I2I Translation
- FBSDiff is a frequency-domain I2I translation framework that uses DCT-based decomposition to isolate appearance, layout, and contour information.
- It leverages plug-and-play frequency band substitution during diffusion steps to achieve fine-grained, text-guided image manipulations.
- FBSDiff++ accelerates the process with localized control and adaptive masking, delivering superior efficiency and state-of-the-art editing quality.
FBSDiff is an algorithmic framework for text-driven image-to-image (I2I) translation based on plug-and-play frequency band substitution of latent diffusion features. It enables highly controllable manipulation of image synthesis using a frozen, pre-trained text-to-image (T2I) diffusion model by dynamically substituting only selected frequency bands from a reference image’s diffusion features into the generation trajectory. This approach allows operators to fine-tune the transfer of appearance, layout, and contour information, or interpolate control strength by modulating the bandwidth of the substituted frequencies. FBSDiff and its accelerated extension FBSDiff++ offer versatile, training-free pipelines for state-of-the-art text-conditioned I2I editing, supporting global or localized manipulation, arbitrary input resolutions, and both structure-holding and style-transferring operations (Gao et al., 2024, Gao et al., 27 Jan 2026).
1. Frequency-Band Decomposition and Motivation
In conventional T2I or I2I translation with diffusion models, semantic content, appearance, and structural detail are entangled in spatial-domain latent representations, limiting control over which attributes from a reference image are transferred. FBSDiff introduces a DCT-based frequency separation of latent features, enabling explicit control:
- Low-frequency coefficients (top-left of DCT spectrum): encode appearance, color distributions, and broad layout statistics.
- Mid-frequency coefficients: encode intermediate-scale layout or region arrangements.
- High-frequency coefficients (bottom-right): encode edge information, contours, and fine detail.
By selectively substituting only a desired frequency band at each denoising step, FBSDiff achieves plug-and-play, fine-grained control of guiding factors—appearance, layout, or contour—without modifying model weights or performing any additional training or tuning.
2. Mathematical and Algorithmic Formulation
Let denote the reference image with latent encoding . During generation, the core Frequency Band Substitution (FBS) layer operates as follows:
where and are the reference and generation latent features at timestep , DCT and IDCT are channelwise 2D or cascaded 1D discrete cosine/inverse cosine transforms, and denotes a binary mask selecting a frequency band by index sum ( for 2D). Mask types include:
- Low-pass: if
- High-pass: if
- Mid-pass: if
In FBSDiff++, static 2D-DCT is replaced with cascaded 1D-DCT along width and height, with percentile-based thresholds (e.g., ) ensuring the method operates on arbitrary image shapes and aspect ratios.
The I2I pipeline consists of:
- Inversion: DDIM inversion of the reference image to noise
- Reconstruction: (FBSDiff only) reconstruction to cache the feature trajectory
- Text-guided sampling: generation from random noise under target text
- Dynamic FBS: at each step, substitute the chosen frequency band after denoising during the “calibration” phase
FBSDiff++ caches the inversion trajectory and entirely eliminates explicit reconstruction, providing significant acceleration.
3. Plug-and-Play Integration and Control Mechanisms
FBSDiff requires only standard UNet denoiser invocations (no model updates). During the initial sampling steps (calibration phase), FBS is repeatedly applied; text-only denoising resumes thereafter. The guiding factor—appearance, layout, or contour—is chosen via mask type; guiding intensity is adjusted by varying the mask thresholds (frequency band widths), yielding continuous interpolation from weak to strong attribute transfer.
FBSDiff++ extends this paradigm with:
- AdaFBS: automatic percentile-based mask scaling for arbitrary
- Localized control: region masks allow spatially restricted editing
- Style-specific content creation: random spatial transformations (spatial transformation pool, or STP) “destroy” explicit structure in the reference prior to low-frequency substitution, enabling style transfer without layout copying
4. Empirical Performance and Comparative Evaluation
Extensive experiments on LAION-Mini and benchmark protocols demonstrate:
- Qualitative versatility: By switching among band types, FBSDiff/++ can produce appearance-matching, layout-constrained, or contour-preserving edits, surpassing contemporaneous approaches such as Null-text Inversion, Plug-and-Play, Pix2Pix-zero, InstructPix2Pix, and StyleDiff in visual fidelity and attribute disentanglement.
- Diversity: Multiple distinct outputs for the same reference/text are produced via stochastic resampling of initial noise, in contrast to inversion-locked methods.
- Quantitative superiority: FBSDiff/++ achieves top-4 ranks on Structure Sim (DINO), LPIPS, AdaIN style loss, CLIP-similarity, and aesthetic scores under both derivative and style-transfer tasks. FBSDiff++ reaches 0.965 Structure Sim and leads in several trade-off metrics.
- Efficiency: Architecture streamlining enables FBSDiff++ to achieve an 8.9 speedup over FBSDiff and 7–8 over other state-of-the-art methods, with total processing time of 9.6s (A100, SD v1.5, 50 steps).
- User study validation: Among 70 raters and 20 models, FBSDiff++ attained the highest “Excellent/Optimal” rating proportion.
| Method | Inversion | Sampling | Total Time |
|---|---|---|---|
| PT-Inv | 42 s | 30 s | 72 s |
| StyleDiff | 40 s | 38 s | 78 s |
| FBSDiff | 77 s | 8 s | 85 s |
| FBSDiff++ | 3.5 s | 6.1 s | 9.6 s |
5. Ablation, Limitations, and Future Research
Ablations indicate that:
- Replacing stepwise dynamic substitution with a one-shot substitution at results in severe artifacts.
- Full-spectrum substitution collapses text fidelity; selective (partial) band substitution is critical.
- STP is necessary for structure-decoupled style transfer.
Identified limitations include:
- Hyperparameter tuning (frequency thresholds, ) is currently manual; automated controllers could improve usability.
- FBSDiff applies global frequency masks, lacking spatially variant control.
- In cases of strong semantic mismatch between reference and text, some blending artifacts can appear.
- The current formulation is channelwise and DCT-based; extensions to other frequency representations or pixel-space diffusion are open research directions.
6. Practical Considerations and Usage Guidelines
FBSDiff/++ is implemented in Python/PyTorch leveraging HuggingFace diffusers (Stable Diffusion v1.5) and scipy for DCT computations. Key parameters:
- Number of steps , CFG scale , calibration ratio
- Frequency thresholds or percentiles tuned for desired attribute transfer
- Region masks and STP available for local edits and style-only generation, respectively
- The method is resolution-agnostic in FBSDiff++, generalizing to arbitrary image sizes without retraining.
A typical pipeline involves encoding the source image, DDIM inversion, caching latent features, text-guided sampling with AdaFBS at each step, and decoding the output. Masking, style-only, or other functionalities are achieved by simple modular additions.
7. Significance and Impact
FBSDiff introduces a frequency-domain methodology for factor-controllable text-driven I2I by exploiting the spectral separation of appearance, layout, and contours in DCT space. Its training-free, model-agnostic, and highly efficient implementation establishes a new class of practical, tunable I2I pipelines for both research and application contexts, providing a new avenue for controllable generative image editing with strong empirical performance across tasks and benchmarks (Gao et al., 2024, Gao et al., 27 Jan 2026).