CLIPPan: Unsupervised Pansharpening via CLIP

Updated 21 November 2025

CLIPPan is an unsupervised pansharpening framework that uses a vision-language model (CLIP) to semantically align multispectral and panchromatic images.
It employs a two-stage adaptation with cross-modal adapters and novel semantic loss functions to fuse images without requiring ground truth.
Experimental results on QB and WV3 datasets show significant improvements in spectral and spatial metrics, establishing new state-of-the-art performance.

CLIPPan is an unsupervised pansharpening framework that leverages a vision-LLM (CLIP) as a semantic supervisor for full-resolution training, addressing significant domain adaptation challenges in pansharpening neural networks. By introducing a two-stage adaptation and supervisory process grounded in cross-modal embeddings and protocol-aligned textual guidance, CLIPPan enables arbitrary deep pansharpening backbones to be trained directly on real high-resolution data without requiring ground truth. The approach results in improved spectral and spatial fidelity across various datasets, consistently establishing new state-of-the-art performance in unsupervised pansharpening (Jian et al., 14 Nov 2025).

1. Architecture and Workflow

CLIPPan operates as a two-stage framework:

Stage I (Adaptation): Both remote-sensing images (multispectral (MS), panchromatic (PAN), and high-resolution multispectral (HRMS)) and textual prompts encoding image types and fusion protocols are mapped into a shared semantic embedding space. A frozen CLIP backbone (ViT or ResNet-based) is augmented with six Cross-modal Adapter (CA) modules—three within the visual encoder and three in the text transformer—and a modified first convolution to accommodate domain-specific MS inputs (C=4 or 8). Image Fusion Adapter (IFA) and Text Fusion Adapter (TFA) align MS and PAN representations for both modalities.
Stage II (Unsupervised Pansharpening): The adapted CLIP is frozen. Any pansharpening backbone (e.g., PNN, ArbRPN, LFormer, PanMamba) processes paired MS and PAN images, outputting a fused high-resolution MS (HRMS) estimate. The backbone’s output, input images, and their corresponding text prompts are routed through the adapted visual and textual encoders, producing embeddings used to compute both a novel semantic alignment loss and multiple unsupervised visual losses.

2. CLIP Adaptation Pipeline

The framework’s adaptation phase modifies CLIP for remote sensing:

Visual-side modifications: CLIP’s initial embedding layer is replaced by a $3 \times 3$ convolution to ingest multi-band inputs. Three CA modules are inserted post-transformer blocks or residual stages, injecting pansharpening-related bias. IFA merges MS and PAN embeddings for image fusion in the latent space.
Text-side modifications: The text encoder is extended with three CA modules and a TFA module to fuse MS and PAN textual features. Text prompts describe image modalities (e.g., “MS image,” “PAN image”) and fusion protocols (notably Wald’s or Khan’s rule descriptions).

Adaptation optimizes three losses:

Loss Name	Purpose	Formalization
Inter-modal CL (InterMCL)	Binds each image modality to its prompt	$\mathcal{L}_{\text{inter}} = \frac{1}{3} \sum_{M_1 \ne M_2} \mathcal{L}_{\text{align}}$
Intra-modal CL (IntraMCL)	Enforces content diversity on patches	$\mathcal{L}_{\text{intra}} = -\frac{1}{3N} \sum \log \frac{\exp(\langle F^{I(i)}_{M_1},F^{I(i)}_{M_2}\rangle/\tau_i)}{\sum_{k} \exp(\langle F^{I(i)}_{M_1},F^{I(k)}_{M_1}\rangle/\tau_i)}$
Fusion-aware alignment	Aligns IFA/TFA and protocol-driven fused embeddings	$\mathcal{L}_{\text{fusion}} = \\|F^{I}_{\text{fuse}} - F^{I}_{\text{HRMS}}\\|_1 + \\|F^{T}_{\text{fuse}} - F^{T}_{\text{wald}}\\|_1$

Only the adapter modules and the initial convolution are fine-tuned.

3. Semantic Language-Constraint Loss and Auxiliary Losses

The core innovation in Stage II is the directional semantic loss, which aligns pansharpening transformations in image space with their corresponding protocol-derived expectations in text space. Formally, directional shifts $\Delta V$ are defined in both modalities:

$\Delta V^{I}_{\text{MS}} = F^I_{\text{out}} - F^I_{\text{MS}}$

$\Delta V^{T}_{\text{MS}} = F^T_{\text{wald}} - F^T_{\text{MS}}$

Semantic loss is then

$\mathcal{L}_d = 1 - \frac{1}{2}\left[ \langle \Delta V^I_{\text{MS}}, \Delta V^T_{\text{MS}}\rangle + \langle \Delta V^I_{\text{PAN}}, \Delta V^T_{\text{PAN}}\rangle \right]$

This loss encourages the fused output to transition in embedding space according to the protocol-encoded text description (“apply Wald’s fusion rule”), achieving protocol-compliant fusion without paired ground truth.

Auxiliary unsupervised losses are combined as $\mathcal{L}_{s2} = \mathcal{L}_{\text{spec}} + \mathcal{L}_{\text{spat}} + \mathcal{L}_{\text{QNR}} + \mathcal{L}_{\text{ship}}$ , where:

$\mathcal{L}_{\text{spec}}$ enforces spectral preservation after bicubic downsampling.
$\mathcal{L}_{\text{spat}}$ enforces spatial fidelity via a $1 \times 1$ conv and SSIM against PAN.
$\mathcal{L}_{\text{QNR}}$ applies QNR-based distortion measures ( $D_\lambda$ , $D_s$ ).
$\mathcal{L}_{\text{ship}}$ adds pseudo-supervision from a SHIP network trained at reduced resolution.

The total pansharpening loss is $\mathcal{L}_{\text{total}} = \mathcal{L}_{s2} + \lambda_d \mathcal{L}_d$ .

4. Datasets, Training Protocols, and Implementation

CLIPPan is validated on QuickBird (QB) and WorldView-3 (WV3) datasets:

Dataset	Bands (MS)	PAN GSD	MS GSD	Train/Eval Protocol
QB	4	60 cm	2.4 m	Full-res + Wald’s red.
WV3	8	30 cm	1.2 m	Full-res + Wald’s red.

Training details:

Optimizer: Adam, learning rate $3\times 10^{-3}$ , batch size 32, 1000 iterations.
Applies to all pansharpening backbones.
Full-resolution training removes the “domain gap” of supervised protocols (train on simulated, test on real) by leveraging direct unsupervised supervision at operational GSDs.

5. Experimental Validation and Ablation Analyses

CLIPPan-integrated (“-C”) variants consistently demonstrate improved fusion performance:

On full-resolution QB, ArbRPN-C reduces $D_\lambda$ from 0.0140 to 0.0030 (–79%) and improves QNR by +0.011.
On WV3, LFormer-C lowers $D_s$ by $\approx$ 30% and increases QNR from 0.9227 to 0.9572.
Reduced-resolution experiments confirm across-the-board improvements in MPSNR, ERGAS, SAM, and Q2n, indicating benefits for both unsupervised and supervised settings.

Qualitatively, CLIP-fused outputs enhance spatial detail (sharper edges, building delineations) and spectral integrity (reduced error in color maps).

Ablations show:

Cumulative gains from successive addition of $\mathcal{L}_{\text{QNR}}$ , $\mathcal{L}_{\text{ship}}$ , and $\mathcal{L}_d$ (e.g., MPSNR up to 34.72 dB, ERGAS down to 4.49, SAM 5.54°, Q2n to 0.7986).
Each Stage I adaptation loss ( $\mathcal{L}_{\text{intra}}$ , $\mathcal{L}_{\text{inter}}$ , $\mathcal{L}_{\text{fusion}}$ ) provides non-redundant merit.
Learnable $1 \times 1$ MS compression outperforms PCA, RGB-only, or GBNIR options.
Protocol text prompt selection is critical: Wald’s rule outperforms alternatives, while generic “fused image” or noise prompts underperform.

6. Limitations and Future Directions

CLIPPan’s current reliance on hand-designed textual prompts motivates future research in automatic or learnable prompt engineering. While adapter-based fine-tuning is relatively lightweight, further reduction in adaptation overhead (e.g., via LoRA) is identified as a direction. Validation is currently limited to QB and WV3; extension to hyperspectral and broader sensor platforms is noted as necessary.

The semantic loss assumes monotonic “fusion” vectors in CLIP space; richer geometric or region-level semantic constraints (e.g., patch-based Semantic Tile Loss) may provide additional benefits. Alternative supervisory language–vision models (ALIGN, Florence) or foundation multimodal models (DALLE, Stable Diffusion) could furnish more expressive guidance.

7. Significance and Impact

CLIPPan demonstrates that large vision-LLMs can supply effective, protocol-aligned semantic supervision for pansharpening tasks, directly learning from real full-resolution imagery without ground truth (Jian et al., 14 Nov 2025). This resolves longstanding domain adaptation difficulties and enables panoptic fusion networks to achieve state-of-the-art fidelity under unsupervised, full-resolution conditions, thereby advancing the intersection of remote-sensing fusion and multimodal deep learning.

PDF Markdown Chat (Pro)

References (1)

CLIPPan: Adapting CLIP as A Supervisor for Unsupervised Pansharpening (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to CLIPPan.