Papers
Topics
Authors
Recent
Search
2000 character limit reached

CoPLOT: Collaborative 3D Perception

Updated 24 January 2026
  • CoPLOT is a collaborative perception framework that leverages sparse, semantically salient point-level tokens for efficient multi-agent 3D detection.
  • It replaces dense BEV grids with adaptive 1D token sequences to preserve 3D spatial cues while reducing computation and communication overhead.
  • Its innovative modules—STR, FSSM, and NEA—enable closed-loop alignment and frequency-enhanced state-space modeling, significantly improving detection performance in V2X applications.

CoPLOT is a collaborative perception framework that introduces Point-Level Optimized Tokens as an intermediate representation for multi-agent 3D perception tasks. By replacing traditional 2D bird’s-eye-view (BEV) feature exchanges with sparse, semantically salient point-level tokens, CoPLOT achieves enhanced object recognition and localization performance while substantially reducing computation and communication overhead in vehicle-to-everything (V2X) and similar applications (Li et al., 27 Aug 2025).

1. Motivation and Conceptual Overview

Traditional collaborative perception frameworks transmit dense BEV feature grids, incurring loss of critical vertical structure due to height compression and imposing significant resource demands, especially as grid size grows with multi-agent fusion. These BEV-based methods are also inefficient in representing empty space and fail to retain fine-grained 3D spatial cues, which are essential for precise detection and localization.

CoPLOT addresses these deficiencies by exchanging compact, semantically filtered point-level tokens. Each token encodes a 3D position, local geometric context, and semantic importance. The resulting token representation forms an adaptive 1D sequence tailored per agent, focusing computational and bandwidth resources on object-relevant content. This approach avoids BEV’s height-binning artifacts and scale-induced overhead, enabling resource reduction of up to 80–90% compared to BEV transmission.

Key distinguishing features include:

  • Compact 1D point token sequences instead of 2D grids
  • Dynamic top-kk token selection via semantic filtering
  • Linear-time sequence modeling using state-space models (SSMs) with frequency-domain enhancement
  • Closed-loop, explicit alignment for correction of inter-agent pose noise

2. Point-Native Processing Pipeline

Each agent processes its local observation through a sequence of modules, aligning and fusing information with received tokens from neighboring agents.

2.1. Semantic-Aware Token Reordering (STR)

The unordered point cloud Oi\mathcal O_i is first tokenized and feature-encoded:

Xi=Φpt(Oi)∈Rl×d\mathcal X_i = \Phi_{pt}(\mathcal O_i)\in\mathbb R^{l\times d}

Fi=Φenc(Xi)∈Rki×d\mathcal F_i = \Phi_{enc}(\mathcal X_i)\in\mathbb R^{k_i\times d}

STR serializes the token set {xj,pj}\{x_j,\mathbf p_j\} into a 1D sequence FroF^{ro}, grouping tokens by semantic and spatial adjacency. Scene-dynamic prompts are constructed as

Gs=Gsp⊗GssG_s = G_{sp}\otimes G_{ss}

with GspG_{sp} learned per scene and GssG_{ss} fixed. Each token is injected with the prompt and projected to a group-score vector, followed by softmax grouping and Z-order sorting. Tokens are then assigned semantic-importance scores s^j\hat s_j and modulated:

Fjs=(Fjro+GIjg)â‹…s^jF^s_j = (F^{ro}_j + G_{I^g_j})\cdot \hat s_j

Supervision employs a focal loss over ground-truth object annotations.

2.2. Frequency-Enhanced State-Space Model (FSSM)

Token sequences are modeled with SSMs that capture long-range dependencies in O(N)O(N) time per sequence length NN:

hi=A‾ihi−1+B‾i(Δixi)h_i = \overline A_i h_{i-1} + \overline B_i(\Delta_i x_i)

yi=Cihi+Dxiy_i = C_i h_i + D x_i

To enhance separation between object and background tokens, FSSM computes per-token compact spectral descriptors QifreqQ^{freq}_i through local 2D discrete Fourier transforms (DFTs) over scene feature maps. These descriptors are injected into the SSM read-out, enabling modeling across both spatial and spectral domains. Dual-scope modeling considers both the global sequence and local windows.

2.3. Neighbor-to-Ego Alignment (NEA)

Upon receipt, neighboring agents’ tokens undergo closed-loop alignment. A global agent-level correction Δξ(j)\Delta\xi^{(j)} is computed by comparing fused and per-agent scene features. Neighbor coordinates are transformed into the ego frame using the predicted correction, followed by token-level refinement—residual offsets predicted via lightweight MLPs and further compensated according to local statistics. The process is supervised using mean-square error alignment loss.

Complete Pipeline Summary

The full pipeline for agent ii can be summarized as follows:

  1. Tokenization (Φpt\Phi_{pt}), encoding (Φenc\Phi_{enc}), and STR/FSSM-based sequence processing.
  2. Formation and broadcast of messages Mi=[Fi,Pi,ξi]\mathcal M_i = [\mathcal F_i, \mathcal P_i, \xi_i].
  3. Reception, NEA-based alignment, and aggregation of neighbor information.
  4. Fusion (Φfuse\Phi_{fuse}) using stacked FSSM blocks.
  5. Task head processing (Φtask\Phi_{task}), typically for 3D object detection.

3. Algorithmic Structure

The process is implemented as an iterative encoding and fusion pipeline, integrating STR and FSSM throughout. The following high-level pseudocode characterizes the workflow:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
X_i = tokenize(O_i)  # grid-sample to l×d tokens

F = X_i
for t in range(T):
    F = STR(F)       # semantic group, reorder
    F = FSSM(F)      # dual-scope SSM + frequency cues
select top-k tokens by semantic scores → F_i

M_i = {F_i, P_i, ξ_i}

G_i = []
for each neighbor j ≠ i:
    Receive M_j = {F_j, P_j, ξ_j}
    Δξ_j = Φ_mis(...)
    aligned_P_j = align(P_j, ξ_i, ξ_j, Δξ_j)
    refined_P_j = NEA(aligned_P_j, ...)
    G_i.append(concat(F_j, refined_P_j))
G_i = concat(F_i, G_i)

H_i = FSSB_blocks(G_i)
Ŷ_i = TaskNet(H_i)

Hyperparameters include l≈104l\approx10^4 tokens, d=128d=128, scene/grouping dimensions, top-k≈1 000k\approx1\,000, four SDSSB layers, and dual-scope sequence modeling.

4. Empirical Protocol

Evaluation utilizes three benchmark datasets spanning simulation and real-world settings:

  • OPV2V: Simulated LiDAR, 12,000 frames, 230,000 annotations.
  • V2V4Real: Real LiDAR, 20,000 frames, 240,000 annotations.
  • DAIR-V2X: Real, 71,000 frames combining vehicle and RSU sensors.

Primary metrics include 3D detection Average Precision (AP) at IoU thresholds {0.5,0.7}\{0.5, 0.7\}, communication bandwidth (bytes/frame), and computational cost (GFLOPs). Baselines for comparison are AttFuse, V2VNet, V2X-ViT, CoBEVT, Where2comm, SiCP, CoSDH, CollaMamba, and CoMamba.

Training employs PyTorch, Adam optimizer (learning rate 1e-3 with decay), 40 epochs, focal loss for classification, and smooth L1 for regression.

5. Comparative Results and Resource Analysis

CoPLOT consistently surpasses prior models in accuracy and efficiency. For OPV2V, V2V4Real, and DAIR-V2X, [email protected]/0.7 scores respectively are as follows:

Method OPV2V V2V4Real DAIR-V2X
CoMamba .947/.892 .582/.372 .662/.442
CoPLOT .973/.934 .644/.447 .761/.593

Communication and computation statistics demonstrate substantial efficiency:

  • Communication (OPV2V, bits per frame, [email protected]): CoBEVT: 2302^{30} B → 0.957 | Where2comm: 2282^{28} B → 0.947 | CoPLOT: 2242^{24} B → 0.973
  • Compute (OPV2V, GFLOPs, [email protected]): V2X-ViT: 545 → .963 | CollaMamba: 198 → .947 | CoPLOT: 176 → .973

Robustness under Gaussian pose noise (σ\sigma up to 1 m) shows CoPLOT’s AP drops by less than 2 points; ablation of NEA increases drop to more than 5 points. Ablation studies indicate major performance degradation when removing STR (–6.3 pp), frequency enhancement (–11.9 pp), or scene prompt (–7.8 pp).

6. Limitations and Implications

Point-level tokens enable full 3D preservation (height, edges), eliminating BEV’s artifact-induced limitations and supporting finer localization. Linear-time SSMs with frequency cues allow long-sequence modeling with O(N)O(N) complexity, avoiding the quadratic scaling of transformers. Dynamic top-kk filtering restricts resources to approximately 1,000 object tokens per agent, directly accounting for the observed bandwidth and computation reductions. Overhead scales linearly with the sum of per-agent top-kk tokens.

Limitations include the use of a static top-kk threshold (lacking scene adaptivity), reliance on a 2D convolutional detection head (suboptimal for direct 3D bounding box proposals), and restriction to LiDAR-only input. Planned extensions involve joint communication–computation–accuracy trade-off optimization, adaptive kk selection, and integration of multi-modal inputs (e.g., camera + LiDAR) (Li et al., 27 Aug 2025).

In summary, CoPLOT employs semantically filtered, frequency-aware point-level tokens and explicit closed-loop alignment to achieve state-of-the-art collaborative 3D detection accuracy and resource efficiency, establishing a new paradigm beyond BEV for collaborative perception.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoPLOT Framework.