Cross-Modality Proxy Query Update
- Cross-Modality Proxy Query Update is a mechanism that leverages proxy queries (embeddings, tokens, or memory points) to bridge and align heterogeneous modalities effectively.
- It employs iterative refinement and selective information transfer—using transformer updates, momentum memory banks, and contrastive losses—to boost performance in tasks like detection, retrieval, and segmentation.
- This approach underpins modern multimodal learning by providing robustness against noise, missing modalities, and domain shifts while enabling decoupled, efficient training.
Cross-Modality Proxy Query Update refers to a broad class of mechanisms designed to bridge, align, or propagate information between heterogeneous modalities by leveraging proxy queries—representative embeddings, tokens, or memory points that serve as intermediaries for supervised or unsupervised cross-modal interaction. These mechanisms are foundational in modern multimodal learning settings such as cross-modal retrieval, detection, segmentation, person re-identification, and multimodal data integration. They provide both algorithmic efficiency (by reducing direct query–gallery or feature–feature connections) and statistical robustness (by suppressing noise, encouraging semantic alignment, or balancing modality-specific information content). The concept encompasses both architectural innovations (how proxies are created and updated) and training or inference strategies (how they are employed during adaptation or alignment).
1. Foundational Principles
Cross-modality proxy query update operates by introducing or updating proxy units—queries, tokens, or memory vectors—that mediate interaction between modalities. The proxy serves as a semantic, instance-level, or cluster-level representative, allowing networks to concentrate cross-modal fusion, alignment, or adaptation on condensed, higher-signal points rather than diffuse full-feature spaces. This facilitates:
- Selective information transfer: Only reliable or high-confidence proxies are used for cross-modal correction or complementarity, thereby controlling noise and reducing negative transfer (Tian et al., 13 Jan 2026, Wu et al., 2024).
- Iterative refinement: Proxies can be refined over multiple stages, via transformer layers, memory-bank updates, or message-passing, enabling progressive semantic convergence (Sun et al., 26 Nov 2025, Wu et al., 2024, Nandy et al., 2024).
- Architectural decoupling: Proxy update often enables coupled yet parameter-independent branches, simplifying training on unpaired data or handling missing modalities (Tian et al., 13 Jan 2026, Seo et al., 3 Oct 2025).
The field encompasses diverse proxy modalities, including object queries in DETR-like models, text-derived modality tokens, memory points in contrastive learning, and integrated latent representations in message-passing frameworks.
2. Proxy Query Construction and Update Methodologies
The variant approaches to proxy query update fall into several canonical design patterns:
2.1 Query Fusion in Parallel Branch Transformers
In detectors such as Modality-Decoupled Query Fusion (MDQF) for RGB-Thermal tasks, proxies are instantiated as object queries tied to detection proposals. The update flow is as follows (Tian et al., 13 Jan 2026):
- Each branch (e.g., RGB and TIR) maintains its own query set.
- Selection: Top-k high-confidence queries from both branches are selected using score-based ranking.
- Cross-modal adaptation: Selected queries from branch A are projected via a learnable MLP into the space of branch B and vice versa.
- Fusion: Adapted proxies are concatenated and gathered into new proxy sets based on selection indices; the process is symmetric across branches.
- Decoding: Both branches update their object queries by decoding fused proxies with shared proposals, but retain independent parameters, supporting both complementarity (clean conditions) and separation (modality degradation).
2.2 Progressive Proxy Propagation in Video-Text Tasks
ProxyFormer for referring video object segmentation initializes and updates a set of proxy queries for each video frame (Sun et al., 26 Nov 2025):
- Initialization: N proxies per frame, randomly initialized.
- Stacked cross-modality encoding: Alternating multi-head attention blocks cycle proxies through:
- Spatio-temporal self-attention on video features.
- Cross-attention injecting proxies into video encoding (Proxy→Video).
- Cross-attention from video and text features back into proxies (Video/Text→Proxy).
- Multi-stage update: After K iterations, proxies are semantically aligned to both video and language, tracking objects with temporal and spatial granularity.
- Downstream selection: Final proxies generate candidate segmentations; the best match to ground truth is selected.
2.3 Modality Token Integration and Contrastive Alignment
In multimodality medical object detection, proxy queries are realized as compact text-derived "modality tokens" whose alignment is enforced at both the representation and loss-function level (Seo et al., 3 Oct 2025):
- Definition of modality tokens: Each (modality, class) pair yields a CLIP-encoded textual embedding, projected into the visual query space.
- MoCA (self-attention fusion): At each decoder layer, the relevant modality token is appended to the object query set, and a multi-head self-attention update propagates modality context bidirectionally.
- QueryREPA pretraining: Query cluster means are aligned to modality tokens via InfoNCE contrastive loss with modality-balanced batches, encouraging global modality-awareness in detection queries.
2.4 Memory Bank and Prototype Update with Cross-Modal Aggregation
ECUL for visible–infrared person re-identification maintains instance- and cluster-level proxies per modality (Wu et al., 2024):
- Memory banks for each modality: Embedding vectors for all samples (instance level) and clusters (cluster level), updated by:
- Real-time replacement (early epochs): Proxies are immediately refreshed by batch examples.
- Linear/momentum fusion (late epochs): Proxies are updated with exponential moving average, with linearly increasing memory factor.
- Cross-modal fusion: Most reliable cross-modal pairs are discovered by reciprocal count-priority ranking and then used to further update memory banks via small-momentum aggregation—only high-confidence associations are aggregated, ensuring clean cross-modality links.
- Contrastive learning: The unified proxies serve as "keys" in contrastive losses across and within modalities.
2.5 Proxy Query Refinement at Test Time
TCR for cross-modal retrieval under query shift applies proxy-based test-time adaptation (Li et al., 2024):
- Proxy-based candidate selection: For each test query, select the top-scoring gallery samples as proxies (G′).
- Prediction refinement: Adapt the assignment softmax to operate on the proxy subset.
- Constraint-driven adaptation: Reference constraints such as intra-modality uniformity and cross-modal gap are computed over the most source-like proxy-query pairs and enforced via dedicated loss terms, updating only normalization parameters in the query encoder.
2.6 Proxy Query in Multimodal Data Integration via Message Passing
OrchAMP for multimodal data integration employs proxy query updates to infer latent subject representations from observed modalities (Nandy et al., 2024):
- Proxy initialization: Each observed modality is OLS-projected onto the latent space with SVD-aided linear inversion.
- Empirical Bayes update: All partial proxies are fused via a one-step EB denoiser using the joint prior learned from complete data, yielding a refined, integrated proxy embedding.
- Prediction set construction: The proxy query's asymptotic normality allows construction of a valid, ellipsoidal prediction set for the subject's latent state.
3. Loss Functions and Optimization Schemes Anchored to Proxy Update
Cross-modality proxy query update mechanisms are closely linked to specialized training or adaptation objectives designed to enforce alignment, consistency, and robustness. Common loss components include:
- Contrastive loss (InfoNCE-like): Encouraging proxy queries (object-level, cluster-level, or modality-token-averaged) to be close to the right semantic/instance anchors, while pushing apart negatives (Seo et al., 3 Oct 2025, Wu et al., 2024, Sun et al., 26 Nov 2025).
- Uniformity and gap rectification losses: Promoting within-modality uniformity and matching of inter-modality gap statistics at test time (Li et al., 2024).
- Consistency constraints: Ensuring the alignment between semantic proxies (e.g., proxies yielding masks) and global multimodal embeddings (joint video+text), using bidirectional normalized similarities (Sun et al., 26 Nov 2025).
- Noise-robust adaptation: Weighting entropy terms in proxy-based predictions to suppress unreliable adaptation at test time (Li et al., 2024).
These loss constructs steer proxy updates to both represent the desired modality content and maintain effective cross-modal bridges throughout optimization and inference.
4. Empirical Benefits and Application Domains
Proxy query update techniques have demonstrated significant empirical gains across a wide span of challenging multimodal tasks. Notable results include:
- RGB-T object detection: MDQF yields 1–2 mAP improvement versus feature- or box-level fusions, and is robust under modality corruption or dropout (Tian et al., 13 Jan 2026).
- Unsupervised visible-infrared re-identification: ECUL’s two-step memory update (TSMem) gains over 2% rank-1 accuracy and boosts all mAP/mINP scores versus naïve memory update (Wu et al., 2024).
- Multimodal medical object detection: The MoCA+QueryREPA stack delivers up to +6.1 AP on colonoscopy, with consistent ΔAP across all modalities with minimal latency increase (Seo et al., 3 Oct 2025).
- Robust retrieval under query shift: TCR achieves an R@1 of 59.1 (vs. next best 49.3) on the COCO-C image-modality benchmark and overall superior results on transfer benchmarks (Li et al., 2024).
- Multimodal data integration and querying: OrchAMP provides theoretically grounded ellipsoidal prediction sets for the inferred latent state of partially observed query subjects (Nandy et al., 2024).
These approaches accommodate missing modalities, domain shift, noisy or adversarial data, and allow for end-to-end or modular deployment across diverse architectures and domains.
5. Theoretical Foundations and Analytical Insights
Proxy query update mechanisms are underpinned by both empirical and theoretical arguments. The geometrical and statistical justifications include:
- State evolution analysis (message passing): Orchestrated AMP provides blockwise Gaussian denoising channels and CLT-based concentration, validating asymptotically sharp proxy estimates and coverage (Nandy et al., 2024).
- Instance/cluster memory stability: The two-step memory strategy balances between plasticity (tracking label/feature drift in early epochs) and stability (suppressing overfitting consistent prototypes late in training) (Wu et al., 2024).
- Uniformity and gap theory: Increasing query embedding dispersion (uniformity) and correcting mean cross-modal gaps improves retrieval recall, but over-concentration or excessive correction is detrimental (Li et al., 2024).
- Information-theoretic alignment: Pretext contrastive losses for query-token alignment maximize lower bounds on mutual information, thus theoretically guaranteeing better feature transferability (Seo et al., 3 Oct 2025).
Proxy queries thus provide a mathematically principled and empirically proven channel for cross-modality knowledge transfer and adaptation.
6. Limitations and Open Challenges
While cross-modality proxy query update has established itself across tasks and modalities, important open questions remain:
- Scalability to ultra-high-dimensional or highly asynchronous modalities: Practical runtime and memory are bounded by the design of proxy selection, memory banks, and attention mechanisms.
- Generalization to highly imbalanced or partially observed data: Many approaches require careful control of proxy selection to avoid mode collapse or propagation of noise.
- Complex downstream alignment: In multi-hop or multi-modal cascades, unclear how best to intermediate via proxies without introducing bottlenecks or information loss.
- Automated proxy schedule and adaptation: Determining optimal switching points (as in TSMem), momentum, or fusion thresholds is subject to empirical tuning rather than universal prescription.
A plausible implication is that future research will develop more adaptive, theoretically supported proxy update schemas tailored to dynamic or streaming cross-modal environments.
Key Papers Cited:
- MDQF and query fusion: (Tian et al., 13 Jan 2026)
- ProxyFormer cross-modality proxy query in RVOS: (Sun et al., 26 Nov 2025)
- MoCA and QueryREPA: (Seo et al., 3 Oct 2025)
- Test-time adaptation via proxy queries: (Li et al., 2024)
- ECUL for unsupervised visible-infrared ReID: (Wu et al., 2024)
- Orchestrated AMP proxy query update: (Nandy et al., 2024)