Bilateral Prompt Optimization (BiPrompt)

Updated 12 January 2026

BiPrompt is a framework that jointly optimizes visual and textual prompts to mitigate spurious correlations and enhance causal, domain-invariant feature extraction.
It employs structured erasure, balanced prompt normalization, and dual-prompt collaboration to reduce linguistic and visual biases in model predictions.
Empirical evaluations demonstrate significant improvements in accuracy and robustness, achieving up to 17-point gains in vision-language and language model benchmarks.

A Bilateral Prompt Optimization Framework (BiPrompt) refers to a class of techniques which perform joint, modality-specific adaptation or optimization of prompts associated with both input (“visual” or “system”) and output (“textual” or “user”) modalities. This bilateral strategy addresses key limitations of unilateral prompt techniques that restrict optimization to one modality. BiPrompt aims to maximize robustness, generalization, and causal alignment in multimodal or LLMs through parallel and often synergistic treatment of both prompt types, especially in test-time or lightweight adaptation scenarios (Gupta et al., 5 Jan 2026, Zhang et al., 21 Jul 2025, Li et al., 17 Mar 2025, Xu et al., 25 Mar 2025).

1. Motivation and Conceptual Foundation

Bilateral Prompt Optimization arose in response to incomplete or suboptimal adaptation in both vision-language and LLMs that rely on prompt-based conditioning. Unimodal debiasing or prompt-tuning approaches—optimizing the visual or textual (system or user) prompt alone—fail to eliminate spurious correlations and may induce unstable adaptation under distribution shift. In vision-LLMs (VLMs) such as CLIP, visual-only debiasing (e.g., masking, SEraser) can reduce reliance on background cues but leaves textual shortcuts unaddressed; inversely, text-only approaches ignore entangled visual context. In LLMs, separately tuning the system or user prompt yields suboptimal behavioral alignment due to their mutual interdependence (Gupta et al., 5 Jan 2026, Zhang et al., 21 Jul 2025).

The core goal of BiPrompt is thus joint mitigation of non-causal (spurious) feature reliance in both modalities at once, steering model predictions toward causal, domain-invariant features. This is quantified in vision-language settings by minimizing the conditional mutual information $I(z_s;\,y\mid z_c)$ , where $z_c$ is a vector of causal features (typically foreground semantics), $z_s$ denotes spurious information (background, texture, context), and $y$ is the model's predicted label (Gupta et al., 5 Jan 2026).

2. Mathematical Structure and Optimization Objectives

A general mathematical framework for BiPrompt can be established as follows:

Let $x$ denote an image input and $t_c$ a class prompt. The vision-LLM decomposes into $f_v(x)$ (vision encoder) and $f_t(t_c)$ (text encoder), with prediction computed as

$p(y=c\mid x)\;=\;\mathrm{softmax}\!\Big(\tau\,\mathrm{sim}(f_v(x),\hat f_t(t_c))\Big),$

where $\tau$ is a learnable temperature and $\hat f_t$ incorporates a debiased text embedding. The BiPrompt optimization problem is

$\min_{\theta}\;I\big(z_s;\,y\mid z_c\big)$

subject to architecture-specific objectives, typically minimizing a combination of:

Structured erasure loss (visual prompt debiasing): forces consistency of predictions on causal vs. spurious regions.
Balanced prompt normalization (text prompt debiasing): re-centers text representations to an isotropic space, suppressing linguistic shortcuts.
Entropy regularization: prevents probability collapse.
Optional cross-entropy on available labels.

An instantiation of the total loss function:

$\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{CE}} + \lambda_1\,\mathcal{L}_{\mathrm{BSE}} + \lambda_2\,\mathcal{L}_{\mathrm{ent}}$

with hyperparameters $\lambda_1, \lambda_2$ tuning the strength of erasure and regularization (Gupta et al., 5 Jan 2026).

For language-only (LLM) settings, the optimization objective maximizes expected response quality as measured by a model-as-judge,

$\max_{\,x_s\,,\,f\,}\;\mathbb{E}_{x_u\sim\mathcal{D}\;}\bigl[J\bigl(\mathcal{M}(x_s,\;x_u,\;e=f(x_u))\bigr)\bigr]$

where $x_s$ is the system prompt, $x_u$ the user prompt, $e$ a complement generated by $f$ , and $J(\cdot)$ a judgement score (Zhang et al., 21 Jul 2025).

3. Core Algorithmic Modules

Key modules of BiPrompt frameworks, as exemplified by recent works, include:

Visual/Textual Bimodal Debiasing (VLMs)

Structured Attention-Guided Erasure: Computes Grad-CAM-based attention maps $a(x)$ to isolate image regions. Foreground ( $x_{\mathrm{fg}}$ ) and background ( $x_{\mathrm{bg}}$ ) views are constructed, enabling a loss that enforces prediction invariance to causally relevant (foreground) features and orthogonality to spurious (background) features:

$\mathcal{L}_{\mathrm{BSE}} = D_{\mathrm{KL}}\bigl[p(y\mid x_{\mathrm{fg}})\;\|\;p(y\mid x)\bigr] - \beta \cos\bigl(p(y\mid x_{\mathrm{bg}}), p(y\mid x)\bigr)$

Balanced Prompt Normalization: Static class prompts are replaced by gated combinations of template ( $f_t(t_c)$ ) and the embedding set mean $\bar f_t$ :

$\hat f_t(t_c) = \alpha\,f_t(t_c) + (1-\alpha)\,\bar f_t$

with $\alpha \in [0,1]$ learned per class (Gupta et al., 5 Jan 2026).

Dual-Prompt Collaboration (Two-Branch Prompt Tuning)

Prompt Decoupling: Two prompt vectors $p_b$ (base) and $p_n$ (new) are instantiated; $p_n$ is frozen for new-class generalization and $p_b$ trained for base-class specialization. At inference, predictions are made with a convex combination:

$\tilde P = \omega \cdot p_b + (1-\omega)\cdot p_n$

with task-specific $\omega$ . Hard negative mining and InfoNCE loss further drive purity of $p_b$ (Li et al., 17 Mar 2025).

Bilateral Prompting in LLMs

System/User Prompt Optimization: BiPrompt jointly optimizes both the system and user prompts via LLM-as-optimizer and LLM-as-judge loops. Offline, candidate complements are iteratively proposed and ranked; the system prompt pool is likewise evolved through judge-driven selection. Online, a generator $f$ produces custom complements $e$ for user prompts, either by fine-tuning a small model or via retrieval-based in-context learning (Zhang et al., 21 Jul 2025).

Segmentation: Mixture-of-Experts Fusion

BiPrompt-SAM: Parallel point and text prompts are fused using a deterministic gating mechanism based on Intersection over Union (IoU), selecting the best-aligned mask candidate at inference. This gating is non-parametric and interpretable as a simplified MoE (Xu et al., 25 Mar 2025).

4. Empirical Evaluation and Benchmarks

Experiments across vision-language and language domains consistently demonstrate BiPrompt’s effectiveness:

Domain	Core metric	Baseline	BiPrompt	Gain
CLIP OOD (Top-1 avg)	42.4% (BiPrompt)	25.8% (Vanilla)	40.5% (SEraser)	+16.5–17pts
Synthetic bias (AVG/W.G.)	AVG=91.3%, W.G.=85.0%	72.3%/49.5%		+19.0/35.5pts
LLM QA (Arena-Hard avg)	54.64% (P3)	41.85%		+12.8pts
Reasoning (GSM8K)	84.8% (P3)	81.1% (TextGrad)		+3.7pts
Segmentation (RefCOCO IoU)	86.5%	78.3% (EVF-SAM)		+8.2pts

Ablation studies confirm the necessity of bilateral optimization: removing visual or textual modules in BiPrompt for VLMs leads to substantial drops in worst-group or average accuracy (up to 10 points). In prompt fusion, tuning only the base or only the new prompt fails to resolve the base-new trade-off (Gupta et al., 5 Jan 2026, Li et al., 17 Mar 2025, Zhang et al., 21 Jul 2025, Xu et al., 25 Mar 2025).

5. Theoretical Perspectives and Analyses

Bilateral prompt schemes are supported empirically and, in some frameworks, analytically:

Conditional Mutual Information Reduction: For VLMs, structured erasure and isotropic normalization drive $I(z_s; y \mid z_c) \to 0$ , decoupling predictions from spurious background or linguistic priors without retraining the backbone (Gupta et al., 5 Jan 2026).
Feature-Channel Invariance: In dual-branch prompt transfer, weight-mixing p_b and p_n avoids cross-channel interference; updates live in a radial subspace with magnitude modulation but not channel permutation (Li et al., 17 Mar 2025).
Prompt Pool Stabilization: LLM prompt optimization shows empirical convergence of system and user prompt pools after several thousand LLM generations, yielding robust, transferable prompt templates (Zhang et al., 21 Jul 2025).
No formal convergence or optimality theorems are provided for joint LLM prompt loops or for multi-modal erasure schemes.

6. Implementation, Limitations, and Practical Recommendations

Implementation of BiPrompt frameworks typically adds minimal overhead: only a small number of gating, prompt, or normalization parameters are tuned at test time, and backbone models remain fixed. Practical considerations include:

Hyperparameter Sensitivity: Adaptation steps $T$ (usually 1–5), learning rate $\eta \simeq 10^{-3}$ , and orthogonality trade-off $\beta$ are critical. Poor Grad-CAM attention can impair visual debiasing; isotropy strength $\alpha$ guides text normalization (Gupta et al., 5 Jan 2026).
Compute Cost: LLM-based bilateral optimization requires an offline loop with tens of thousands of model queries, but online adaptation (via fine-tuning or in-context learning) is lightweight (Zhang et al., 21 Jul 2025).
Prompt Initialization and Dataset Construction: Initialization from high-quality manual prompts and leveraging representative user queries or base-class splits are recommended (Zhang et al., 21 Jul 2025, Li et al., 17 Mar 2025).
Plug-and-play Usage: BiPrompt can layer onto existing CoOp/CLIP-style prompt learners, segmentation pipelines (SAM + EVF-SAM), or open-source LLMs (Li et al., 17 Mar 2025, Xu et al., 25 Mar 2025).

Common pitfalls include over-pruning in erasure (removing causal features), incorrect hyperparameter ranges, and insufficient candidate diversity in prompt pools.

7. Outlook and Extensions

Ongoing and potential future directions for bilateral prompt optimization include:

Adaptive Attention Schemes: Improved saliency methods or thresholding for more precise visual erasure (Gupta et al., 5 Jan 2026).
Hierarchical and Layerwise Prompt Fusion: Multi-scale erasure and staged prompt gating could further disentangle spurious and causal cues.
Generalization to New Tasks: Extension to video, multitask, or multi-modal fusion settings via dynamic, input-dependent gating or extension of prompt complement generation (Gupta et al., 5 Jan 2026, Zhang et al., 21 Jul 2025).
Theoretical Analysis: A detailed study of the convergence dynamics of $I(z_s;y|z_c)$ under BiPrompt adaptation remains open.
Annotation Efficiency: In segmentation, BiPrompt-SAM demonstrates that a single-point plus text prompt substantially lowers labeling burden in clinical or practical workflows (Xu et al., 25 Mar 2025).

Bilateral Prompt Optimization thus provides a unified and flexible strategy across modalities, yielding consistent empirical improvements on robustness, debiasing, and accuracy under distribution shifts without retraining model backbones (Gupta et al., 5 Jan 2026, Zhang et al., 21 Jul 2025, Li et al., 17 Mar 2025, Xu et al., 25 Mar 2025).

Markdown Upgrade to Chat

References (4)

BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models (2026)

P3: Prompts Promote Prompting (2025)

DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models (2025)

BiPrompt-SAM: Enhancing Image Segmentation via Explicit Selection between Point and Text Prompts (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bilateral Prompt Optimization Framework (BiPrompt).