Explicit Prompt-Encoding in CNNs/U-Nets

Updated 8 December 2025

Explicit prompt-encoding is a technique that integrates structured auxiliary cues (e.g., masks, vectors, text embeddings) directly into CNN/U-Net models to guide feature extraction and segmentation.
It employs a range of methods—from simple binary mask channels to advanced transformer fusion of learnable prompts—to improve accuracy and reduce boundary ambiguity.
Empirical studies demonstrate measurable gains in segmentation performance, faster convergence, and minimal computational overhead using these prompt-based strategies.

Explicit prompt-encoding in convolutional neural networks (CNNs) and U-Net architectures denotes architectural strategies that inject structured, interpretable auxiliary information—often in the form of masks, vectors, language embeddings, or domain prompts—directly into the computational graph as explicit input channels or feature bias layers. The approach augments or guides feature extraction, boundary handling, segmentation specificity, and uncertainty modeling beyond what is possible with conventional pixelwise inputs, zero padding, or class-label supervision. This paradigm spans from minimal binary channel concatenation (for structure-aware feature learning) to large-scale text-driven prompt generation for multi-domain and multi-task adaptation, demonstrating measurable improvements in segmentation, classification, image translation, and robustness to missing data.

1. Architectural Patterns of Explicit Prompt-Encoding

Explicit prompt-encoding mechanisms have evolved from simple indicator variables to complex multimodal fusion:

Binary/Padding Masks as Channels: Minimal explicit prompt approaches, such as PadChannel, augment the input tensor $X \in \mathbb{R}^{H \times W \times C}$ with a binary mask $P \in \{0, 1\}^{H \times W}$ flagging spatially valid pixels or boundaries. The concatenated tensor $X' = [X; P]$ allows convolutional filters to selectively modulate responses near image edges, improving feature localization at boundaries. In this scenario, only the initial convolution layer is modified from $(C \rightarrow F)$ to $(C + 1 \rightarrow F)$ ; subsequent feature propagation is unaltered (Kim, 2023).
Learnable Prompt Vectors and Feature Injection: More advanced U-Net-based models (e.g., multi-rater and task-prompted networks) maintain sets of learnable prompt vectors $P^i \in \mathbb{R}^{(R+1) \times d_i}$ at multiple encoder/decoder stages, with each vector corresponding to a rater, domain, or consensus case. Prompts are injected via concatenation with feature tokens and integrated using Transformer-based implantable blocks, enabling stage-wise, contextual modulation of information flow (Wang et al., 11 Apr 2024).
Textual Embeddings and Natural-Language Guidance: High-level domain or task descriptors, encoded via CLIP or BERT models, generate dense text prompts aligned to visual feature spaces. These embeddings are projected via $W_t \in \mathbb{R}^{d_v \times d_t}$ and bias layers, then injected at bottleneck or multi-scale locations through FiLM modulation or cross-attention, providing semantically precise, context-sensitive guidance for segmentation and translation tasks (Wu et al., 25 May 2025, Wang et al., 18 Nov 2024).
Visual/Numerical Prompts and Affine Modulation: In hybrid/state-space or dual-path networks, the input image itself or meta-information is distilled into scaling and bias vectors, which modulate feature activations at key points. This allows for dynamic re-weighting of global context extraction paths (e.g., Mamba or KAN layers), blending local detail with global structure in segmentation (Zhang et al., 25 Mar 2025).

2. Mathematical Formulation and Encoding Mechanisms

Explicit prompt encoding is mathematically formalized as additional terms or auxiliary inputs in the feature extraction pipeline:

Channel Concatenation with Indicator Masks:

$X' = [X; P] \in \mathbb{R}^{H \times W \times (C+1)}$

The first conv layer computes pre-activation via:

$Y_\ell(x, y) = \sum_{c=1}^{C} \sum_{u, v} W(u,v,c,\ell) X_{c}(x+u, y+v) + \sum_{u, v} W(u,v,C+1,\ell) P(x+u, y+v) + b_\ell$

(Kim, 2023)

Prompt-Guided Fusion via Transformer Blocks:

For each feature map $F_i$ at block $i$ , concatenate with projected prompt $P_r^i$ , apply multi-head attention and feed-forward processing:

$Z_{\mathrm{in},i} = \mathrm{Concat}(T_i, W_p^i P^i)$

$Z_{\mathrm{hid},i} = \mathrm{LN}(\mathrm{MSA}(Z_{\mathrm{in},i})) + Z_{\mathrm{in},i}$

$F_{i+1} = \mathrm{Conv}_i(\hat{F}_i)$

(Wang et al., 11 Apr 2024)

Text Prompt Alignment and FiLM Modulation (CLIP-aligned):

$\psi(\mathbf{v}_{bot}, \mathbf{T}_p) = (W_a\,\mathbf{T}_p + \mathbf{b}_a)\odot \mathbf{v}_{bot} + (W_b\,\mathbf{T}_p + \mathbf{b}_b)$

This fuses projected task textual embeddings with visual bottleneck features (Wu et al., 25 May 2025).

Cross-Attention for Temporal or Contextual Prompts:

For text embedding $F_t'$ , and image feature $F_m'$ , the fusion is:

$\mathcal{F} = \mathrm{softmax}\Bigl(\frac{([F_m'; F_t'] W^Q)\,([F_m'; F_t'] W^K)^{T}}{\sqrt{d_k}}\Bigr) \bigl([F_m'; F_t'] W^V\bigr)$

(Wang et al., 18 Nov 2024)

Visual Prompting of State Space Models:

$f_{mod} = \mathrm{VSSM}(\mathrm{Conv}_{3\times 3}(f_{in})) \otimes k + b$

where $k, b$ are dynamic affine parameters predicted by feeding the input image through a lightweight CNN (Zhang et al., 25 Mar 2025).

3. Domains and Use-Cases of Prompt Encoding

Explicit prompt encoding has been deployed in diverse settings:

Boundary and Padding Awareness: PadChannel exemplifies minimal but effective prompt encoding by flagging zero-padded regions, reducing boundary ambiguity, improving feature learning consistency, and decreasing variance in ImageNet models at negligible computational cost (Kim, 2023).
Ambiguous and Multi-Annotator Medical Segmentation: Multi-rater approaches inject per-rater and consensus learnable prompts, allowing a U-Net to model both agreement and disagreement regions, with fine-tuning requiring only ∼0.3% of backbone parameters, thus being computationally efficient (Wang et al., 11 Apr 2024).
Domain, Task, and Language Specification: Textual, semantic, or temporal prompts enable generalization to new segmentation or translation tasks, especially when anatomical priors or scan order information is relevant. CDPDNet and TP-UNet exemplify this, using text embeddings to condition on organ identity, scan phase, or slice order, yielding measurable gains in multi-organ and multi-task generalization (Wu et al., 25 May 2025, Wang et al., 18 Nov 2024).
User-Guided or Interactive Segmentation: Architectures support concatenation of explicit scribbles, bounding-box masks, or spatial priors as input channels, merging high-level user intent or anatomical hints with visual features.

4. Empirical Results and Ablation Studies

Empirical studies consistently show that explicit prompt encoding yields measurable improvements:

Accuracy Gains: Segmentation models with prompt encoding demonstrate up to +4.4% Dice improvement over baselines without such mechanisms. CDPDNet achieves state-of-the-art performance across 11 medical datasets, and prompt-guided Mamba U-Nets show 1–2% Dice score increases—reductions comparable to several years of architectural progression (Wu et al., 25 May 2025, Zhang et al., 25 Mar 2025).
Variance and Convergence: Reduction of run-to-run accuracy variance (–0.013 to –0.066%) and faster convergence (ResNet50+PadChannel reaches key accuracies three epochs earlier) are observed in PadChannel (Kim, 2023).
Ablation Outcomes: Removing the prompt extraction/fusion capsules degrades PSNR and SSIM by 2–3 dB and 0.03–0.08, respectively, in cross-modal translation (Chen et al., 2023). Prompt tuning in PU-Net outperforms naive head-only fine-tuning and other multi-labeling strategies (Wang et al., 11 Apr 2024).
Parameter and Compute Overhead: Most explicit prompt-encoding layers involve minimal parameter increases (e.g., PadChannel <0.03%) or permit freezing of core weights, with only prompt vectors being updated at test-time or during rapid adaptation (Kim, 2023, Wang et al., 11 Apr 2024).

Method	Prompt Type	Accuracy Effect	Compute Overhead
PadChannel	Binary mask channel	+0–0.21% (ImageNet top-1)	<0.03% params
PU-Net	Learnable vectors	+0–1.1% Dice (RIGA)	0.3% updated params
CDPDNet	CLIP text/FiLM	+state-of-the-art mDice (11 sets)	Slight (proj. heads)
PGM-UNet	Visual prompt scaling	+1–2% Dice (ISIC, DIAS)	+2 CNN heads/branch

5. Extensions and Open Challenges

Several generalizations and future avenues exist:

Arbitrary Prompt Types: The strategy is extensible to arbitrary mask types, spatial priors, scribbles, or multi-channel guidance. The addition of separate indicator channels or continuous-valued priors is straightforward.
Multimodal and Multitask Fusion: Language, visual, and numerical prompt modalities can be injected simultaneously (e.g., textual task description plus anatomical centerline masks).
Prompt Propagation vs. Early Fusion: The most significant effect is typically from prompt injection at the first encoder block; deeper prompt propagation yields diminishing returns (Kim, 2023).
Task Scalability and Routing: Explicit weight routing based on prompt embeddings (e.g., intent/target/modality) enables modular, scalable multitask frameworks, as evidenced by LLM-driven routing in MedPrompt for both segmentation and classification (Sobhan et al., 26 Jun 2025).
Hierarchical and Adaptive Prompting: Dynamic adjustment of prompt weighting or selection (as in Prompt Extraction Block) allows real-time or context-sensitive adaptation in translation and segmentation.

6. Limitations and Broader Impact

While explicit prompt-encoding demonstrates broad utility, several limitations persist:

Dependency on Prompt Validity: Encoding gains depend on high-fidelity prompt definition; for example, PadChannel requires strict zero-padding—other padding types nullify the mask's utility (Kim, 2023).
Marginal Gains in Some Settings: In shallow networks or those less reliant on context, prompt injection can yield negligible improvement (e.g., VGG11 in PadChannel saw 0.000% change).
Computational and Memory Cost: While most architectures minimize overhead, multi-prompt or multimodal fusion (especially with large text encoders or Transformer modules) can non-trivially impact inference latency.
Prompt Design and Tuning: Selection and encoding of auxiliary guidance (e.g., language templates, spatial priors) require domain knowledge and careful calibration to maximize effect.

A plausible implication is that as architectures integrate more diverse and sophisticated prompts, the boundary between purely visual CNN/U-Net pipelines and hybrid, context-aware, language-driven networks will further blur, enabling more robust and interactive medical imaging applications and generative vision tasks.

Explicit prompt-encoding thus provides a rigorously validated, modular, and extensible mechanism for integrating boundary markers, human guidance, consensus vectors, semantic cues, and temporal priors directly into CNN/U-Net architectures, measurably enhancing segmentation, translation, and classification while enabling scalable, flexible adaptation to new domains and ambiguous labeling scenarios (Kim, 2023, Wang et al., 11 Apr 2024, Wu et al., 25 May 2025, Wang et al., 18 Nov 2024, Zhang et al., 25 Mar 2025, Chen et al., 2023, Sobhan et al., 26 Jun 2025).