StackMFF V4: Deterministic Multi-Focus Image Fusion

Updated 31 December 2025

The paper introduces StackMFF V4, which integrates a pretrained MobileNetV3 and SACA block to achieve accurate multi-focus image fusion.
The framework employs an iterative refinement loop that minimizes artifacts and enhances pixel fidelity while reducing FLOPs to a quarter of earlier versions.
StackMFF V4 robustly fuses registered image stacks in near-real-time, enabling seamless integration with generative restoration models like IFControlNet.

StackMFF V4 is a deterministic multi-focus image fusion (MFF) backbone designed to generate all-in-focus images from a stack of partially focused source images. Employed as Stage 1 in the Generative Multi-Focus Image Fusion (GMFF) framework, StackMFF V4 integrates high-resolution feature discrimination, efficient cross-stack information aggregation, and iterative refinement, enabling robust fusion quality at low computational cost. Its architecture leverages a pretrained MobileNetV3-Large encoder, the Spatial Aggregation Cross-layer Attention (SACA) mechanism, and a refinement loop to deliver superior performance on large and complex stacks, as demonstrated by extensive benchmarking against prior StackMFF versions (Xie et al., 25 Dec 2025).

1. Objectives and Design Rationale

StackMFF V4 seeks to produce an all-in-focus image $I_{\mathrm{fus}}$ from a set of $N$ registered partially focused images $\{I_1, \ldots, I_N\}$ of the same scene, subject to three main design criteria:

Pixel Fidelity: Preserve the sharpest pixels from each input and avoid "hard selection" artifacts at defocus boundaries.
Efficiency: Achieve near-real-time inference for large $N$ , scaling better than previous StackMFF variants.
Artifact Avoidance: Suppress edge artifacts arising from uncertain focus estimation or abrupt selection.

Earlier versions introduced stepwise improvements: V1 (3D CNN) established feasibility but incurred prohibitive computation, V2 used ULDA-Net for focus regression with limited accuracy, and V3 adopted a pixel-wise cross-layer classification (soft selection) with a per-pixel cross-layer self-attention (PCA) block, reducing but not minimizing FLOPs. V4 introduces:

Enhanced Intra-Layer Discrimination: MobileNetV3-Large, pretrained on ImageNet, enables sharper focus feature extraction compared to ULDA-Net or PFMLP.
SACA Block: Aggregates features spatially before attention, allowing efficient and more context-aware cross-layer feature fusion.
Iterative Refinement: Incorporates a lightweight single-pass refinement loop, facilitating error correction in per-layer focus estimation.

These innovations result in higher SSIM/PSNR and roughly one quarter of the floating point operations (FLOPs) of V3 for the same stack size (Xie et al., 25 Dec 2025).

2. Network Architecture

The StackMFF V4 pipeline processes $N$ registered grayscale or RGB images of shape $\mathbb{R}^{H \times W \times C}$ ( $C=1$ or $3$, $H=W=384$ for training) through four primary stages:

2.1 Intra-Layer Focus Estimation

Backbone: MobileNetV3-Large encoder (ImageNet-pretrained). Initial $3 \times 3$ convolution ( $\text{stride}=2$ , $16$ channels, BatchNorm + h-swish), followed by inverted residual blocks, yields a feature tensor of shape $\lfloor H/16\rfloor \times \lfloor W/16\rfloor \times 960$ per input.
Decoder Head: A three-stage upsampling path (bilinear $\times 2$ followed by $3 \times 3$ conv, BatchNorm, ReLU) progressively reduces channel dimension $960 \rightarrow 256 \rightarrow 128 \rightarrow 1$ , yielding a single-channel per-layer focus score map $S_i(x, y)$ for all $i$ .

2.2 Inter-Layer Aggregation (SACA)

Spatial Aggregation: Average-pool each decoder feature map $F_i$ ($128$ channels) by a factor of $4$, producing $F̄_i \in \mathbb{R}^{(H/4) \times (W/4) \times 128}$ .
Tokenization: Flatten spatial dimensions to a sequence of $M = (H/4) \cdot (W/4)$ tokens of size $128$ per layer.
Cross-Layer Transformer: Stack $N$ $N$ sequences along a new dimension and apply multi-head self-attention across the $N$ $N$ layer tokens at each spatial position. This captures cross-image focus correlations with complexity $O(N^2 \cdot M)$ $O (N^{2} \cdot M)$ .
- Notes: Pre-attention LayerNorm on tokens; no positional encodings are used, preserving layer-order invariance.
Reshaping: Output tokens are bilinearly upsampled (factor 4) and fed through a $1\times1$ convolution to restore $H \times W$ dimensionality, resulting in refined features $F'_i$ .

Conduct a single repeat (loop count = 1, based on ablation results): Concatenate $\{F'_i\}$ with original decoder features, apply a lightweight $3\times3$ convolution, and re-apply SACA. This iterative step provides corrected focus feature estimation, attenuating propagation of initial focus assignment errors.

2.4 Focus Map Generation and Fusion

Per-layer Focus Scores: Generate final $S_i(x, y)$ via $1\times1$ convolution on refined features.
Softmax Normalization:

$w_i(x,y) = \frac{\exp(S_i(x,y))}{\sum_{j=1}^N \exp(S_j(x,y))}$

Deterministic Fusion:

$I_{\mathrm{fus}}(x, y) = \sum_{i=1}^N w_i(x, y) \cdot I_i(x, y)$

Summary Table: Key Architectural Stages

Stage	Function	Key Innovations
Intra-layer Estimation	Per-layer feature extraction & focus score computation	MobileNetV3-Large backbone
Inter-layer Fusion	Cross-stack focus context encoding	SACA block w/ token Attention
Iterative Refinement	Focus feature correction via loop	One-pass feedback
Fusion Output	Softmax-weighted blend of source pixels	Deterministic, artifact-free

3. Training Regimen and Datasets

3.1 Data Synthesis

Datasets: DUTS, NYU Depth V2 (native depth), DIODE, Cityscapes, ADE20K.
Depth Maps: Extracted via Depth Anything V2, except for NYU Depth V2.
Stack Generation: Multi-focus image stacks are synthesized by stratifying depth, assigning each pixel to an input with a focal plane matching its ground truth depth. Layer dropout (0–50%) simulates missing focus conditions.

3.2 Loss Functions

Pixel-wise Classification (Cross-Entropy):

$\mathcal{L}_{\mathrm{cls}} = - \sum_{x,y} \sum_{i=1}^N Y_i(x, y) \log w_i(x, y)$

where $Y_i(x, y)\in \{0,1\}$ denotes ground-truth assignment.

Reconstruction Loss:

$\mathcal{L}_{\mathrm{rec}} = \sum_{x, y} \| I_{\mathrm{fus}}(x, y) - I_{\mathrm{gt}}(x, y) \|_1$

Total Loss:

$\mathcal{L} = \mathcal{L}_{\mathrm{cls}} + \lambda \mathcal{L}_{\mathrm{rec}} \quad (\lambda=1)$

3.3 Optimization and Hyperparameters

Optimizer: AdamW
Batch size: 12 stacks
Initial LR: $1 \times 10^{-3}$ , exponential decay $\gamma=0.9$ per epoch
Weight decay: $1 \times 10^{-4}$
Epochs: 50 (∼8 hours on 2×A6000), early stopping by validation SSIM
Input resolution: $384\times384$

Data augmentation includes random flips and cropping. Only 10% of StackMFF V3's data volume is required due to pretrained feature extraction.

4. Empirical Evaluation and Ablation Studies

Systematic ablation studies demonstrate the impact of architectural components:

Comparison Type	Measured Improvement
MobileNetV3 vs PFMLP	SSIM: 0.9733 vs 0.9718; FLOPs: 0.51G vs 1.79G
SACA vs PCA	PSNR: +0.08 dB; FLOPs: 0.51G vs 0.93G
Iterative Loops (0→1)	SSIM: +0.0045 (0.9688→0.9733), diminishing returns beyond one loop
SACA Down-sample 1/4	Optimal SSIM/PSNR trade-off

A plausible implication is that SACA substantially enhances computational scalability without diluting fusion quality, and the judicious use of a single refinement loop optimally balances accuracy and efficiency (Xie et al., 25 Dec 2025).

5. Implementation Specifics and Performance

Framework: PyTorch
Inference speed: ~0.12 s for two $256 \times 256$ images on a single A6000 (4× faster than StackMFF V3).
No post-processing: Fused outputs are directly supplied to Stage 2 of GMFF (i.e., IFControlNet) if used.
Registration: Inputs must be pre-registered; misalignment can degrade results.
Open Source: Implementation is available at https://github.com/Xinzhe99/StackMFF-Series

StackMFF V4 assumes synthetically generated, registered stacks, and reproducible experiments are facilitated by explicit data splits, augmentation details, and default hyperparameters.

StackMFF V4 serves as the deterministic backbone in the two-stage GMFF pipeline. The all-in-focus output from StackMFF is subsequently refined via IFControlNet, a latent diffusion model module targeting restoration of content from missing focus planes and suppression of edge artifacts. Each stage operates independently, but the high fidelity of StackMFF V4’s deterministic fusion directly impacts the effectiveness of subsequent generative restoration (Xie et al., 25 Dec 2025).

StackMFF V4’s advancements over StackMFF V1-V3 illustrate the progressive shift from heavy, monolithic architectures (V1), to lightweight regression (V2), to attention-based soft selection (V3), culminating in V4’s synergistic blend of pretrained discriminative encoders, efficient attention, and iterative refinement for artifact-minimized, real-time multi-focus image fusion.

PDF Markdown Chat (Pro)

References (1)

Generative Multi-Focus Image Fusion (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to StackMFF V4.

StackMFF V4: Deterministic Multi-Focus Image Fusion

1. Objectives and Design Rationale

2. Network Architecture

2.1 Intra-Layer Focus Estimation

2.2 Inter-Layer Aggregation (SACA)

2.3 Iterative Refinement Loop

2.4 Focus Map Generation and Fusion

Summary Table: Key Architectural Stages

3. Training Regimen and Datasets

3.1 Data Synthesis

3.2 Loss Functions

3.3 Optimization and Hyperparameters

4. Empirical Evaluation and Ablation Studies

5. Implementation Specifics and Performance

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

StackMFF V4: Deterministic Multi-Focus Image Fusion

1. Objectives and Design Rationale

2. Network Architecture

2.1 Intra-Layer Focus Estimation

2.2 Inter-Layer Aggregation (SACA)

2.3 Iterative Refinement Loop

2.4 Focus Map Generation and Fusion

Summary Table: Key Architectural Stages

3. Training Regimen and Datasets

3.1 Data Synthesis

3.2 Loss Functions

3.3 Optimization and Hyperparameters

4. Empirical Evaluation and Ablation Studies

5. Implementation Specifics and Performance

6. Related Work and Integration in GMFF

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics