ProxyFormer: Proxy-Based Transformer Models

Updated 29 November 2025

ProxyFormer is a neural architecture that employs proxy-based representations and transformer attention to model multi-modal, spatio-temporal, and geometric data effectively.
It decomposes input data into semantic regions represented by trainable proxy vectors, enabling efficient point cloud completion and video object segmentation.
Empirical benchmarks demonstrate state-of-the-art performance with reduced computational load through proxy-conditioned attention and a coarse-to-fine decoding strategy.

ProxyFormer is a class of neural architectures leveraging proxy-based representations and transformer-based attention for improved modeling in tasks where the alignment or completion of multi-modal, spatio-temporal, or geometric data is required. The proxy-based approach introduces learned intermediate representations (“proxies”) that communicate semantically or geometrically critical information between observed and inferred segments of the input space, yielding state-of-the-art performance in both point cloud completion (Li et al., 2023) and referring video object segmentation (Sun et al., 26 Nov 2025).

1. Proxy-Based Representation and Problem Decomposition

ProxyFormer architectures rely on the decomposition of the input data into semantically distinct regions, then represent these regions through trainable proxy vectors that encapsulate feature and position information. In point cloud completion, input $C_i \in \mathbb{R}^{P \times 3}$ (incomplete cloud) is decomposed into “existing” and “missing” parts, with Farthest Point Sampling (FPS) yielding seed sets $\tilde{C}_i$ and $\tilde{C}_m$ (Li et al., 2023). These seeds are then mapped, via feature and position extractors, into proxies: Existing Proxies (EP) and Missing Proxies (MP).

In video-language segmentation, ProxyFormer initializes proxy queries $Q^0 \in \mathbb{R}^{T \times N \times C_p}$ , each representing object candidates for every frame. These proxies act as anchors for alignment and allow dynamic information propagation across encoding stages, bridging visual and textual semantics (Sun et al., 26 Nov 2025).

2. Feature and Position Extraction Mechanisms

Point cloud ProxyFormer utilizes a Feature and Position Extractor (FAPE). Feature extraction first upsamples 3D coordinates to higher-dimensional features via MLPs, followed by Point Transformer blocks, supporting vector attention and neighborhood aggregation. The position extractor computes per-proxy position encodings by aggregating local geometric and feature differences among neighbors: $PE_i = \sum_{k=1}^K \left[ \mathrm{attn}(\mathrm{TF}_i^k) \odot \mathrm{TF}_i^k \right]$ where $\mathrm{TF}_i^k$ is a transition feature computed from local differences.

In video segmentation, feature extraction employs backbone networks (e.g., Video Swin for visuals, RoBERTa for text) and recurrent Cross-Modality Interaction Encoding (CMIE) blocks. This dual extraction refines both spatial and temporal semantics across modalities, with proxy queries being continuously updated by both video and text cues (Sun et al., 26 Nov 2025).

3. Transformer-Based Proxy Interaction

ProxyFormer enables bidirectional communication between proxies and input features through attention mechanisms.

Missing Part Sensitive Transformer (MPST): For point clouds, missing proxies ( $\mathbb{M}$ ) attend to existing proxies ( $\mathbb{E}$ ) in each transformer block. Standard multi-head attention enables proxy queries to dynamically update their latent feature distribution, sensitizing the completion process to missing geometric regions (Li et al., 2023).
Cross-Modality Proxy Query Update: In referring video object segmentation, ProxyFormer alternates proxy-conditioned video encoding (P2V)—where current proxy queries bias video feature attention—and visual-language conditioned proxy encoding (V2P)—where video and language features refine proxies. Decoupled spatio-temporal self-attention further reduces complexity from quadratic to linear in the temporal and spatial dimensions (Sun et al., 26 Nov 2025).

4. Proxy Alignment and Semantic Consistency

In ProxyFormer for point clouds, a proxy alignment mechanism penalizes the discrepancy between predicted and ground-truth missing proxies: $\ell_p = \frac{1}{M} \sum_{i=1}^{M} \| \mathbb{P}_i - \mathbb{T}_i \|_2^2$ This regularizes the latent space and improves generative fidelity.

For video object segmentation, joint semantic consistency (JSC) training ensures that proxy queries are aligned with the fused video-text representation. A symmetric contrastive loss links proxy queries to joint video-text summaries, enforcing that proxy queries encapsulate semantics critical for task-relevant alignment (Sun et al., 26 Nov 2025). The total training loss:

$\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{match}} + \lambda_{\mathrm{jsc}} \mathcal{L}_{\mathrm{jsc}}$

5. Coarse-to-Fine Completion and Mask Generation

ProxyFormer employs a coarse-to-fine completion strategy for point clouds. Coarse predictions $C_{pcm}$ are produced by pooling the existing proxy features, then refined by feeding predicted proxies and positions into a lightweight FoldingNet decoder. Output $C_{pc}$ merges the completed missing part $C_{pmp}$ with all observed input points.

For video object segmentation, masks are inferred by collapsing final-stage proxy queries into adaptive segmentation kernels, which are applied to encoded features in a feature pyramid network (FPN)-style decoder, producing frame-wise object masks. Best-matching object trajectories are then selected via query-to-object assignment procedures (Sun et al., 26 Nov 2025).

6. Computational and Empirical Benchmarks

ProxyFormer architectures prioritize computational efficiency via proxy decoupling, lightweight decoding, and multi-stage proxy update schemes. In point cloud completion, observed FLOPs are $\approx$ 10 G for typical settings (PCN dataset), with real-time inference faster than preceding transformer-based models. On PCN, average Chamfer Distance $CD_{16384}=6.77 \times 10^{-3}$ and DCD=0.577 set new state-of-the-art results (Li et al., 2023).

For video segmentation, decoupled cross-attention reduces FLOPs and inference times; e.g., for Ref-YouTube-VOS, ProxyFormer achieves J&F=63.0% with 1341.6 GFLOPs and 109 ms per sequence, a +3.6% improvement over ReferFormer at marginal cost (Sun et al., 26 Nov 2025).

Task/Dataset	Metric	ProxyFormer Value	Prior Best
PCN (point clouds)	CD $_{16384}$	6.77e-3	6.74e-3 (SeedFormer)
PCN	DCD	0.577	SOTA
Ref-YouTube-VOS	J&F (%)	63.0	59.4 (ReferFormer)
KITTI (LiDAR)	MMD	0.508	0.526 (PoinTr)

7. Context, Applicability, and Significance

ProxyFormer’s modular structure has demonstrated superiority in both geometric completion (point cloud domains) and cross-modal referential segmentation (video-language domains). Its design isolates task-relevant regions through proxy abstraction, enabling robust inference even with incomplete or ambiguous inputs, and can flexibly extend to settings requiring fine-grained alignment between data modalities or between observed and inferred segments.

A plausible implication is broader utility for ProxyFormer-style frameworks in domains that are bottlenecked by the inability to propagate semantics or geometric structure via learned proxies—such as robot perception, multi-modal scene understanding, and generative modeling for occluded or sequential observations. Systematic ablation studies attribute performance gains to proxy-conditioned attention mechanisms, dynamic proxy evolution, and proxy alignment or semantic consistency losses (Li et al., 2023, Sun et al., 26 Nov 2025). Empirical results across benchmarks confirm the value of these architectural features.