StackMFF V4: Deterministic Multi-Focus Image Fusion
- The paper introduces StackMFF V4, which integrates a pretrained MobileNetV3 and SACA block to achieve accurate multi-focus image fusion.
- The framework employs an iterative refinement loop that minimizes artifacts and enhances pixel fidelity while reducing FLOPs to a quarter of earlier versions.
- StackMFF V4 robustly fuses registered image stacks in near-real-time, enabling seamless integration with generative restoration models like IFControlNet.
StackMFF V4 is a deterministic multi-focus image fusion (MFF) backbone designed to generate all-in-focus images from a stack of partially focused source images. Employed as Stage 1 in the Generative Multi-Focus Image Fusion (GMFF) framework, StackMFF V4 integrates high-resolution feature discrimination, efficient cross-stack information aggregation, and iterative refinement, enabling robust fusion quality at low computational cost. Its architecture leverages a pretrained MobileNetV3-Large encoder, the Spatial Aggregation Cross-layer Attention (SACA) mechanism, and a refinement loop to deliver superior performance on large and complex stacks, as demonstrated by extensive benchmarking against prior StackMFF versions (Xie et al., 25 Dec 2025).
1. Objectives and Design Rationale
StackMFF V4 seeks to produce an all-in-focus image from a set of registered partially focused images of the same scene, subject to three main design criteria:
- Pixel Fidelity: Preserve the sharpest pixels from each input and avoid "hard selection" artifacts at defocus boundaries.
- Efficiency: Achieve near-real-time inference for large , scaling better than previous StackMFF variants.
- Artifact Avoidance: Suppress edge artifacts arising from uncertain focus estimation or abrupt selection.
Earlier versions introduced stepwise improvements: V1 (3D CNN) established feasibility but incurred prohibitive computation, V2 used ULDA-Net for focus regression with limited accuracy, and V3 adopted a pixel-wise cross-layer classification (soft selection) with a per-pixel cross-layer self-attention (PCA) block, reducing but not minimizing FLOPs. V4 introduces:
- Enhanced Intra-Layer Discrimination: MobileNetV3-Large, pretrained on ImageNet, enables sharper focus feature extraction compared to ULDA-Net or PFMLP.
- SACA Block: Aggregates features spatially before attention, allowing efficient and more context-aware cross-layer feature fusion.
- Iterative Refinement: Incorporates a lightweight single-pass refinement loop, facilitating error correction in per-layer focus estimation.
These innovations result in higher SSIM/PSNR and roughly one quarter of the floating point operations (FLOPs) of V3 for the same stack size (Xie et al., 25 Dec 2025).
2. Network Architecture
The StackMFF V4 pipeline processes registered grayscale or RGB images of shape ( or $3$, for training) through four primary stages:
2.1 Intra-Layer Focus Estimation
- Backbone: MobileNetV3-Large encoder (ImageNet-pretrained). Initial convolution (, $16$ channels, BatchNorm + h-swish), followed by inverted residual blocks, yields a feature tensor of shape per input.
- Decoder Head: A three-stage upsampling path (bilinear followed by conv, BatchNorm, ReLU) progressively reduces channel dimension , yielding a single-channel per-layer focus score map for all .
2.2 Inter-Layer Aggregation (SACA)
- Spatial Aggregation: Average-pool each decoder feature map ($128$ channels) by a factor of $4$, producing .
- Tokenization: Flatten spatial dimensions to a sequence of tokens of size $128$ per layer.
- Cross-Layer Transformer: Stack sequences along a new dimension and apply multi-head self-attention across the layer tokens at each spatial position. This captures cross-image focus correlations with complexity .
- Notes: Pre-attention LayerNorm on tokens; no positional encodings are used, preserving layer-order invariance.
- Reshaping: Output tokens are bilinearly upsampled (factor 4) and fed through a convolution to restore dimensionality, resulting in refined features .
2.3 Iterative Refinement Loop
- Conduct a single repeat (loop count = 1, based on ablation results): Concatenate with original decoder features, apply a lightweight convolution, and re-apply SACA. This iterative step provides corrected focus feature estimation, attenuating propagation of initial focus assignment errors.
2.4 Focus Map Generation and Fusion
- Per-layer Focus Scores: Generate final via convolution on refined features.
- Softmax Normalization:
- Deterministic Fusion:
Summary Table: Key Architectural Stages
| Stage | Function | Key Innovations |
|---|---|---|
| Intra-layer Estimation | Per-layer feature extraction & focus score computation | MobileNetV3-Large backbone |
| Inter-layer Fusion | Cross-stack focus context encoding | SACA block w/ token Attention |
| Iterative Refinement | Focus feature correction via loop | One-pass feedback |
| Fusion Output | Softmax-weighted blend of source pixels | Deterministic, artifact-free |
3. Training Regimen and Datasets
3.1 Data Synthesis
- Datasets: DUTS, NYU Depth V2 (native depth), DIODE, Cityscapes, ADE20K.
- Depth Maps: Extracted via Depth Anything V2, except for NYU Depth V2.
- Stack Generation: Multi-focus image stacks are synthesized by stratifying depth, assigning each pixel to an input with a focal plane matching its ground truth depth. Layer dropout (0–50%) simulates missing focus conditions.
3.2 Loss Functions
- Pixel-wise Classification (Cross-Entropy):
where denotes ground-truth assignment.
- Reconstruction Loss:
- Total Loss:
3.3 Optimization and Hyperparameters
- Optimizer: AdamW
- Batch size: 12 stacks
- Initial LR: , exponential decay per epoch
- Weight decay:
- Epochs: 50 (∼8 hours on 2×A6000), early stopping by validation SSIM
- Input resolution:
Data augmentation includes random flips and cropping. Only 10% of StackMFF V3's data volume is required due to pretrained feature extraction.
4. Empirical Evaluation and Ablation Studies
Systematic ablation studies demonstrate the impact of architectural components:
| Comparison Type | Measured Improvement |
|---|---|
| MobileNetV3 vs PFMLP | SSIM: 0.9733 vs 0.9718; FLOPs: 0.51G vs 1.79G |
| SACA vs PCA | PSNR: +0.08 dB; FLOPs: 0.51G vs 0.93G |
| Iterative Loops (0→1) | SSIM: +0.0045 (0.9688→0.9733), diminishing returns beyond one loop |
| SACA Down-sample 1/4 | Optimal SSIM/PSNR trade-off |
A plausible implication is that SACA substantially enhances computational scalability without diluting fusion quality, and the judicious use of a single refinement loop optimally balances accuracy and efficiency (Xie et al., 25 Dec 2025).
5. Implementation Specifics and Performance
- Framework: PyTorch
- Inference speed: ~0.12 s for two images on a single A6000 (4× faster than StackMFF V3).
- No post-processing: Fused outputs are directly supplied to Stage 2 of GMFF (i.e., IFControlNet) if used.
- Registration: Inputs must be pre-registered; misalignment can degrade results.
- Open Source: Implementation is available at https://github.com/Xinzhe99/StackMFF-Series
StackMFF V4 assumes synthetically generated, registered stacks, and reproducible experiments are facilitated by explicit data splits, augmentation details, and default hyperparameters.
6. Related Work and Integration in GMFF
StackMFF V4 serves as the deterministic backbone in the two-stage GMFF pipeline. The all-in-focus output from StackMFF is subsequently refined via IFControlNet, a latent diffusion model module targeting restoration of content from missing focus planes and suppression of edge artifacts. Each stage operates independently, but the high fidelity of StackMFF V4’s deterministic fusion directly impacts the effectiveness of subsequent generative restoration (Xie et al., 25 Dec 2025).
StackMFF V4’s advancements over StackMFF V1-V3 illustrate the progressive shift from heavy, monolithic architectures (V1), to lightweight regression (V2), to attention-based soft selection (V3), culminating in V4’s synergistic blend of pretrained discriminative encoders, efficient attention, and iterative refinement for artifact-minimized, real-time multi-focus image fusion.