Bilateral Prompt Optimization (BiPrompt)
- BiPrompt is a framework that jointly optimizes visual and textual prompts to mitigate spurious correlations and enhance causal, domain-invariant feature extraction.
- It employs structured erasure, balanced prompt normalization, and dual-prompt collaboration to reduce linguistic and visual biases in model predictions.
- Empirical evaluations demonstrate significant improvements in accuracy and robustness, achieving up to 17-point gains in vision-language and language model benchmarks.
A Bilateral Prompt Optimization Framework (BiPrompt) refers to a class of techniques which perform joint, modality-specific adaptation or optimization of prompts associated with both input (“visual” or “system”) and output (“textual” or “user”) modalities. This bilateral strategy addresses key limitations of unilateral prompt techniques that restrict optimization to one modality. BiPrompt aims to maximize robustness, generalization, and causal alignment in multimodal or LLMs through parallel and often synergistic treatment of both prompt types, especially in test-time or lightweight adaptation scenarios (Gupta et al., 5 Jan 2026, Zhang et al., 21 Jul 2025, Li et al., 17 Mar 2025, Xu et al., 25 Mar 2025).
1. Motivation and Conceptual Foundation
Bilateral Prompt Optimization arose in response to incomplete or suboptimal adaptation in both vision-language and LLMs that rely on prompt-based conditioning. Unimodal debiasing or prompt-tuning approaches—optimizing the visual or textual (system or user) prompt alone—fail to eliminate spurious correlations and may induce unstable adaptation under distribution shift. In vision-LLMs (VLMs) such as CLIP, visual-only debiasing (e.g., masking, SEraser) can reduce reliance on background cues but leaves textual shortcuts unaddressed; inversely, text-only approaches ignore entangled visual context. In LLMs, separately tuning the system or user prompt yields suboptimal behavioral alignment due to their mutual interdependence (Gupta et al., 5 Jan 2026, Zhang et al., 21 Jul 2025).
The core goal of BiPrompt is thus joint mitigation of non-causal (spurious) feature reliance in both modalities at once, steering model predictions toward causal, domain-invariant features. This is quantified in vision-language settings by minimizing the conditional mutual information , where is a vector of causal features (typically foreground semantics), denotes spurious information (background, texture, context), and is the model's predicted label (Gupta et al., 5 Jan 2026).
2. Mathematical Structure and Optimization Objectives
A general mathematical framework for BiPrompt can be established as follows:
Let denote an image input and a class prompt. The vision-LLM decomposes into (vision encoder) and (text encoder), with prediction computed as
where is a learnable temperature and incorporates a debiased text embedding. The BiPrompt optimization problem is
subject to architecture-specific objectives, typically minimizing a combination of:
- Structured erasure loss (visual prompt debiasing): forces consistency of predictions on causal vs. spurious regions.
- Balanced prompt normalization (text prompt debiasing): re-centers text representations to an isotropic space, suppressing linguistic shortcuts.
- Entropy regularization: prevents probability collapse.
- Optional cross-entropy on available labels.
An instantiation of the total loss function:
with hyperparameters tuning the strength of erasure and regularization (Gupta et al., 5 Jan 2026).
For language-only (LLM) settings, the optimization objective maximizes expected response quality as measured by a model-as-judge,
where is the system prompt, the user prompt, a complement generated by , and a judgement score (Zhang et al., 21 Jul 2025).
3. Core Algorithmic Modules
Key modules of BiPrompt frameworks, as exemplified by recent works, include:
Visual/Textual Bimodal Debiasing (VLMs)
- Structured Attention-Guided Erasure: Computes Grad-CAM-based attention maps to isolate image regions. Foreground () and background () views are constructed, enabling a loss that enforces prediction invariance to causally relevant (foreground) features and orthogonality to spurious (background) features:
- Balanced Prompt Normalization: Static class prompts are replaced by gated combinations of template () and the embedding set mean :
with learned per class (Gupta et al., 5 Jan 2026).
Dual-Prompt Collaboration (Two-Branch Prompt Tuning)
- Prompt Decoupling: Two prompt vectors (base) and (new) are instantiated; is frozen for new-class generalization and trained for base-class specialization. At inference, predictions are made with a convex combination:
with task-specific . Hard negative mining and InfoNCE loss further drive purity of (Li et al., 17 Mar 2025).
Bilateral Prompting in LLMs
- System/User Prompt Optimization: BiPrompt jointly optimizes both the system and user prompts via LLM-as-optimizer and LLM-as-judge loops. Offline, candidate complements are iteratively proposed and ranked; the system prompt pool is likewise evolved through judge-driven selection. Online, a generator produces custom complements for user prompts, either by fine-tuning a small model or via retrieval-based in-context learning (Zhang et al., 21 Jul 2025).
Segmentation: Mixture-of-Experts Fusion
- BiPrompt-SAM: Parallel point and text prompts are fused using a deterministic gating mechanism based on Intersection over Union (IoU), selecting the best-aligned mask candidate at inference. This gating is non-parametric and interpretable as a simplified MoE (Xu et al., 25 Mar 2025).
4. Empirical Evaluation and Benchmarks
Experiments across vision-language and language domains consistently demonstrate BiPrompt’s effectiveness:
| Domain | Core metric | Baseline | BiPrompt | Gain |
|---|---|---|---|---|
| CLIP OOD (Top-1 avg) | 42.4% (BiPrompt) | 25.8% (Vanilla) | 40.5% (SEraser) | +16.5–17pts |
| Synthetic bias (AVG/W.G.) | AVG=91.3%, W.G.=85.0% | 72.3%/49.5% | +19.0/35.5pts | |
| LLM QA (Arena-Hard avg) | 54.64% (P3) | 41.85% | +12.8pts | |
| Reasoning (GSM8K) | 84.8% (P3) | 81.1% (TextGrad) | +3.7pts | |
| Segmentation (RefCOCO IoU) | 86.5% | 78.3% (EVF-SAM) | +8.2pts |
Ablation studies confirm the necessity of bilateral optimization: removing visual or textual modules in BiPrompt for VLMs leads to substantial drops in worst-group or average accuracy (up to 10 points). In prompt fusion, tuning only the base or only the new prompt fails to resolve the base-new trade-off (Gupta et al., 5 Jan 2026, Li et al., 17 Mar 2025, Zhang et al., 21 Jul 2025, Xu et al., 25 Mar 2025).
5. Theoretical Perspectives and Analyses
Bilateral prompt schemes are supported empirically and, in some frameworks, analytically:
- Conditional Mutual Information Reduction: For VLMs, structured erasure and isotropic normalization drive , decoupling predictions from spurious background or linguistic priors without retraining the backbone (Gupta et al., 5 Jan 2026).
- Feature-Channel Invariance: In dual-branch prompt transfer, weight-mixing p_b and p_n avoids cross-channel interference; updates live in a radial subspace with magnitude modulation but not channel permutation (Li et al., 17 Mar 2025).
- Prompt Pool Stabilization: LLM prompt optimization shows empirical convergence of system and user prompt pools after several thousand LLM generations, yielding robust, transferable prompt templates (Zhang et al., 21 Jul 2025).
- No formal convergence or optimality theorems are provided for joint LLM prompt loops or for multi-modal erasure schemes.
6. Implementation, Limitations, and Practical Recommendations
Implementation of BiPrompt frameworks typically adds minimal overhead: only a small number of gating, prompt, or normalization parameters are tuned at test time, and backbone models remain fixed. Practical considerations include:
- Hyperparameter Sensitivity: Adaptation steps (usually 1–5), learning rate , and orthogonality trade-off are critical. Poor Grad-CAM attention can impair visual debiasing; isotropy strength guides text normalization (Gupta et al., 5 Jan 2026).
- Compute Cost: LLM-based bilateral optimization requires an offline loop with tens of thousands of model queries, but online adaptation (via fine-tuning or in-context learning) is lightweight (Zhang et al., 21 Jul 2025).
- Prompt Initialization and Dataset Construction: Initialization from high-quality manual prompts and leveraging representative user queries or base-class splits are recommended (Zhang et al., 21 Jul 2025, Li et al., 17 Mar 2025).
- Plug-and-play Usage: BiPrompt can layer onto existing CoOp/CLIP-style prompt learners, segmentation pipelines (SAM + EVF-SAM), or open-source LLMs (Li et al., 17 Mar 2025, Xu et al., 25 Mar 2025).
Common pitfalls include over-pruning in erasure (removing causal features), incorrect hyperparameter ranges, and insufficient candidate diversity in prompt pools.
7. Outlook and Extensions
Ongoing and potential future directions for bilateral prompt optimization include:
- Adaptive Attention Schemes: Improved saliency methods or thresholding for more precise visual erasure (Gupta et al., 5 Jan 2026).
- Hierarchical and Layerwise Prompt Fusion: Multi-scale erasure and staged prompt gating could further disentangle spurious and causal cues.
- Generalization to New Tasks: Extension to video, multitask, or multi-modal fusion settings via dynamic, input-dependent gating or extension of prompt complement generation (Gupta et al., 5 Jan 2026, Zhang et al., 21 Jul 2025).
- Theoretical Analysis: A detailed study of the convergence dynamics of under BiPrompt adaptation remains open.
- Annotation Efficiency: In segmentation, BiPrompt-SAM demonstrates that a single-point plus text prompt substantially lowers labeling burden in clinical or practical workflows (Xu et al., 25 Mar 2025).
Bilateral Prompt Optimization thus provides a unified and flexible strategy across modalities, yielding consistent empirical improvements on robustness, debiasing, and accuracy under distribution shifts without retraining model backbones (Gupta et al., 5 Jan 2026, Zhang et al., 21 Jul 2025, Li et al., 17 Mar 2025, Xu et al., 25 Mar 2025).