Papers
Topics
Authors
Recent
2000 character limit reached

StackMFF V4: Deterministic Multi-Focus Image Fusion

Updated 31 December 2025
  • The paper introduces StackMFF V4, which integrates a pretrained MobileNetV3 and SACA block to achieve accurate multi-focus image fusion.
  • The framework employs an iterative refinement loop that minimizes artifacts and enhances pixel fidelity while reducing FLOPs to a quarter of earlier versions.
  • StackMFF V4 robustly fuses registered image stacks in near-real-time, enabling seamless integration with generative restoration models like IFControlNet.

StackMFF V4 is a deterministic multi-focus image fusion (MFF) backbone designed to generate all-in-focus images from a stack of partially focused source images. Employed as Stage 1 in the Generative Multi-Focus Image Fusion (GMFF) framework, StackMFF V4 integrates high-resolution feature discrimination, efficient cross-stack information aggregation, and iterative refinement, enabling robust fusion quality at low computational cost. Its architecture leverages a pretrained MobileNetV3-Large encoder, the Spatial Aggregation Cross-layer Attention (SACA) mechanism, and a refinement loop to deliver superior performance on large and complex stacks, as demonstrated by extensive benchmarking against prior StackMFF versions (Xie et al., 25 Dec 2025).

1. Objectives and Design Rationale

StackMFF V4 seeks to produce an all-in-focus image IfusI_{\mathrm{fus}} from a set of NN registered partially focused images {I1,,IN}\{I_1, \ldots, I_N\} of the same scene, subject to three main design criteria:

  • Pixel Fidelity: Preserve the sharpest pixels from each input and avoid "hard selection" artifacts at defocus boundaries.
  • Efficiency: Achieve near-real-time inference for large NN, scaling better than previous StackMFF variants.
  • Artifact Avoidance: Suppress edge artifacts arising from uncertain focus estimation or abrupt selection.

Earlier versions introduced stepwise improvements: V1 (3D CNN) established feasibility but incurred prohibitive computation, V2 used ULDA-Net for focus regression with limited accuracy, and V3 adopted a pixel-wise cross-layer classification (soft selection) with a per-pixel cross-layer self-attention (PCA) block, reducing but not minimizing FLOPs. V4 introduces:

  1. Enhanced Intra-Layer Discrimination: MobileNetV3-Large, pretrained on ImageNet, enables sharper focus feature extraction compared to ULDA-Net or PFMLP.
  2. SACA Block: Aggregates features spatially before attention, allowing efficient and more context-aware cross-layer feature fusion.
  3. Iterative Refinement: Incorporates a lightweight single-pass refinement loop, facilitating error correction in per-layer focus estimation.

These innovations result in higher SSIM/PSNR and roughly one quarter of the floating point operations (FLOPs) of V3 for the same stack size (Xie et al., 25 Dec 2025).

2. Network Architecture

The StackMFF V4 pipeline processes NN registered grayscale or RGB images of shape RH×W×C\mathbb{R}^{H \times W \times C} (C=1C=1 or $3$, H=W=384H=W=384 for training) through four primary stages:

2.1 Intra-Layer Focus Estimation

  • Backbone: MobileNetV3-Large encoder (ImageNet-pretrained). Initial 3×33 \times 3 convolution (stride=2\text{stride}=2, $16$ channels, BatchNorm + h-swish), followed by inverted residual blocks, yields a feature tensor of shape H/16×W/16×960\lfloor H/16\rfloor \times \lfloor W/16\rfloor \times 960 per input.
  • Decoder Head: A three-stage upsampling path (bilinear ×2\times 2 followed by 3×33 \times 3 conv, BatchNorm, ReLU) progressively reduces channel dimension 9602561281960 \rightarrow 256 \rightarrow 128 \rightarrow 1, yielding a single-channel per-layer focus score map Si(x,y)S_i(x, y) for all ii.

2.2 Inter-Layer Aggregation (SACA)

  • Spatial Aggregation: Average-pool each decoder feature map FiF_i ($128$ channels) by a factor of $4$, producing FˉiR(H/4)×(W/4)×128F̄_i \in \mathbb{R}^{(H/4) \times (W/4) \times 128}.
  • Tokenization: Flatten spatial dimensions to a sequence of M=(H/4)(W/4)M = (H/4) \cdot (W/4) tokens of size $128$ per layer.
  • Cross-Layer Transformer: Stack NN sequences along a new dimension and apply multi-head self-attention across the NN layer tokens at each spatial position. This captures cross-image focus correlations with complexity O(N2M)O(N^2 \cdot M).
    • Notes: Pre-attention LayerNorm on tokens; no positional encodings are used, preserving layer-order invariance.
  • Reshaping: Output tokens are bilinearly upsampled (factor 4) and fed through a 1×11\times1 convolution to restore H×WH \times W dimensionality, resulting in refined features FiF'_i.

2.3 Iterative Refinement Loop

  • Conduct a single repeat (loop count = 1, based on ablation results): Concatenate {Fi}\{F'_i\} with original decoder features, apply a lightweight 3×33\times3 convolution, and re-apply SACA. This iterative step provides corrected focus feature estimation, attenuating propagation of initial focus assignment errors.

2.4 Focus Map Generation and Fusion

  • Per-layer Focus Scores: Generate final Si(x,y)S_i(x, y) via 1×11\times1 convolution on refined features.
  • Softmax Normalization:

wi(x,y)=exp(Si(x,y))j=1Nexp(Sj(x,y))w_i(x,y) = \frac{\exp(S_i(x,y))}{\sum_{j=1}^N \exp(S_j(x,y))}

  • Deterministic Fusion:

Ifus(x,y)=i=1Nwi(x,y)Ii(x,y)I_{\mathrm{fus}}(x, y) = \sum_{i=1}^N w_i(x, y) \cdot I_i(x, y)

Summary Table: Key Architectural Stages

Stage Function Key Innovations
Intra-layer Estimation Per-layer feature extraction & focus score computation MobileNetV3-Large backbone
Inter-layer Fusion Cross-stack focus context encoding SACA block w/ token Attention
Iterative Refinement Focus feature correction via loop One-pass feedback
Fusion Output Softmax-weighted blend of source pixels Deterministic, artifact-free

3. Training Regimen and Datasets

3.1 Data Synthesis

  • Datasets: DUTS, NYU Depth V2 (native depth), DIODE, Cityscapes, ADE20K.
  • Depth Maps: Extracted via Depth Anything V2, except for NYU Depth V2.
  • Stack Generation: Multi-focus image stacks are synthesized by stratifying depth, assigning each pixel to an input with a focal plane matching its ground truth depth. Layer dropout (0–50%) simulates missing focus conditions.

3.2 Loss Functions

  • Pixel-wise Classification (Cross-Entropy):

Lcls=x,yi=1NYi(x,y)logwi(x,y)\mathcal{L}_{\mathrm{cls}} = - \sum_{x,y} \sum_{i=1}^N Y_i(x, y) \log w_i(x, y)

where Yi(x,y){0,1}Y_i(x, y)\in \{0,1\} denotes ground-truth assignment.

  • Reconstruction Loss:

Lrec=x,yIfus(x,y)Igt(x,y)1\mathcal{L}_{\mathrm{rec}} = \sum_{x, y} \| I_{\mathrm{fus}}(x, y) - I_{\mathrm{gt}}(x, y) \|_1

  • Total Loss:

L=Lcls+λLrec(λ=1)\mathcal{L} = \mathcal{L}_{\mathrm{cls}} + \lambda \mathcal{L}_{\mathrm{rec}} \quad (\lambda=1)

3.3 Optimization and Hyperparameters

  • Optimizer: AdamW
  • Batch size: 12 stacks
  • Initial LR: 1×1031 \times 10^{-3}, exponential decay γ=0.9\gamma=0.9 per epoch
  • Weight decay: 1×1041 \times 10^{-4}
  • Epochs: 50 (∼8 hours on 2×A6000), early stopping by validation SSIM
  • Input resolution: 384×384384\times384

Data augmentation includes random flips and cropping. Only 10% of StackMFF V3's data volume is required due to pretrained feature extraction.

4. Empirical Evaluation and Ablation Studies

Systematic ablation studies demonstrate the impact of architectural components:

Comparison Type Measured Improvement
MobileNetV3 vs PFMLP SSIM: 0.9733 vs 0.9718; FLOPs: 0.51G vs 1.79G
SACA vs PCA PSNR: +0.08 dB; FLOPs: 0.51G vs 0.93G
Iterative Loops (0→1) SSIM: +0.0045 (0.9688→0.9733), diminishing returns beyond one loop
SACA Down-sample 1/4 Optimal SSIM/PSNR trade-off

A plausible implication is that SACA substantially enhances computational scalability without diluting fusion quality, and the judicious use of a single refinement loop optimally balances accuracy and efficiency (Xie et al., 25 Dec 2025).

5. Implementation Specifics and Performance

  • Framework: PyTorch
  • Inference speed: ~0.12 s for two 256×256256 \times 256 images on a single A6000 (4× faster than StackMFF V3).
  • No post-processing: Fused outputs are directly supplied to Stage 2 of GMFF (i.e., IFControlNet) if used.
  • Registration: Inputs must be pre-registered; misalignment can degrade results.
  • Open Source: Implementation is available at https://github.com/Xinzhe99/StackMFF-Series

StackMFF V4 assumes synthetically generated, registered stacks, and reproducible experiments are facilitated by explicit data splits, augmentation details, and default hyperparameters.

StackMFF V4 serves as the deterministic backbone in the two-stage GMFF pipeline. The all-in-focus output from StackMFF is subsequently refined via IFControlNet, a latent diffusion model module targeting restoration of content from missing focus planes and suppression of edge artifacts. Each stage operates independently, but the high fidelity of StackMFF V4’s deterministic fusion directly impacts the effectiveness of subsequent generative restoration (Xie et al., 25 Dec 2025).

StackMFF V4’s advancements over StackMFF V1-V3 illustrate the progressive shift from heavy, monolithic architectures (V1), to lightweight regression (V2), to attention-based soft selection (V3), culminating in V4’s synergistic blend of pretrained discriminative encoders, efficient attention, and iterative refinement for artifact-minimized, real-time multi-focus image fusion.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to StackMFF V4.