CoPLOT: Collaborative 3D Perception
- CoPLOT is a collaborative perception framework that leverages sparse, semantically salient point-level tokens for efficient multi-agent 3D detection.
- It replaces dense BEV grids with adaptive 1D token sequences to preserve 3D spatial cues while reducing computation and communication overhead.
- Its innovative modules—STR, FSSM, and NEA—enable closed-loop alignment and frequency-enhanced state-space modeling, significantly improving detection performance in V2X applications.
CoPLOT is a collaborative perception framework that introduces Point-Level Optimized Tokens as an intermediate representation for multi-agent 3D perception tasks. By replacing traditional 2D bird’s-eye-view (BEV) feature exchanges with sparse, semantically salient point-level tokens, CoPLOT achieves enhanced object recognition and localization performance while substantially reducing computation and communication overhead in vehicle-to-everything (V2X) and similar applications (Li et al., 27 Aug 2025).
1. Motivation and Conceptual Overview
Traditional collaborative perception frameworks transmit dense BEV feature grids, incurring loss of critical vertical structure due to height compression and imposing significant resource demands, especially as grid size grows with multi-agent fusion. These BEV-based methods are also inefficient in representing empty space and fail to retain fine-grained 3D spatial cues, which are essential for precise detection and localization.
CoPLOT addresses these deficiencies by exchanging compact, semantically filtered point-level tokens. Each token encodes a 3D position, local geometric context, and semantic importance. The resulting token representation forms an adaptive 1D sequence tailored per agent, focusing computational and bandwidth resources on object-relevant content. This approach avoids BEV’s height-binning artifacts and scale-induced overhead, enabling resource reduction of up to 80–90% compared to BEV transmission.
Key distinguishing features include:
- Compact 1D point token sequences instead of 2D grids
- Dynamic top- token selection via semantic filtering
- Linear-time sequence modeling using state-space models (SSMs) with frequency-domain enhancement
- Closed-loop, explicit alignment for correction of inter-agent pose noise
2. Point-Native Processing Pipeline
Each agent processes its local observation through a sequence of modules, aligning and fusing information with received tokens from neighboring agents.
2.1. Semantic-Aware Token Reordering (STR)
The unordered point cloud is first tokenized and feature-encoded:
STR serializes the token set into a 1D sequence , grouping tokens by semantic and spatial adjacency. Scene-dynamic prompts are constructed as
with learned per scene and fixed. Each token is injected with the prompt and projected to a group-score vector, followed by softmax grouping and Z-order sorting. Tokens are then assigned semantic-importance scores and modulated:
Supervision employs a focal loss over ground-truth object annotations.
2.2. Frequency-Enhanced State-Space Model (FSSM)
Token sequences are modeled with SSMs that capture long-range dependencies in time per sequence length :
To enhance separation between object and background tokens, FSSM computes per-token compact spectral descriptors through local 2D discrete Fourier transforms (DFTs) over scene feature maps. These descriptors are injected into the SSM read-out, enabling modeling across both spatial and spectral domains. Dual-scope modeling considers both the global sequence and local windows.
2.3. Neighbor-to-Ego Alignment (NEA)
Upon receipt, neighboring agents’ tokens undergo closed-loop alignment. A global agent-level correction is computed by comparing fused and per-agent scene features. Neighbor coordinates are transformed into the ego frame using the predicted correction, followed by token-level refinement—residual offsets predicted via lightweight MLPs and further compensated according to local statistics. The process is supervised using mean-square error alignment loss.
Complete Pipeline Summary
The full pipeline for agent can be summarized as follows:
- Tokenization (), encoding (), and STR/FSSM-based sequence processing.
- Formation and broadcast of messages .
- Reception, NEA-based alignment, and aggregation of neighbor information.
- Fusion () using stacked FSSM blocks.
- Task head processing (), typically for 3D object detection.
3. Algorithmic Structure
The process is implemented as an iterative encoding and fusion pipeline, integrating STR and FSSM throughout. The following high-level pseudocode characterizes the workflow:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
X_i = tokenize(O_i) # grid-sample to l×d tokens F = X_i for t in range(T): F = STR(F) # semantic group, reorder F = FSSM(F) # dual-scope SSM + frequency cues select top-k tokens by semantic scores → F_i M_i = {F_i, P_i, ξ_i} G_i = [] for each neighbor j ≠i: Receive M_j = {F_j, P_j, ξ_j} Δξ_j = Φ_mis(...) aligned_P_j = align(P_j, ξ_i, ξ_j, Δξ_j) refined_P_j = NEA(aligned_P_j, ...) G_i.append(concat(F_j, refined_P_j)) G_i = concat(F_i, G_i) H_i = FSSB_blocks(G_i) Ŷ_i = TaskNet(H_i) |
Hyperparameters include tokens, , scene/grouping dimensions, top-, four SDSSB layers, and dual-scope sequence modeling.
4. Empirical Protocol
Evaluation utilizes three benchmark datasets spanning simulation and real-world settings:
- OPV2V: Simulated LiDAR, 12,000 frames, 230,000 annotations.
- V2V4Real: Real LiDAR, 20,000 frames, 240,000 annotations.
- DAIR-V2X: Real, 71,000 frames combining vehicle and RSU sensors.
Primary metrics include 3D detection Average Precision (AP) at IoU thresholds , communication bandwidth (bytes/frame), and computational cost (GFLOPs). Baselines for comparison are AttFuse, V2VNet, V2X-ViT, CoBEVT, Where2comm, SiCP, CoSDH, CollaMamba, and CoMamba.
Training employs PyTorch, Adam optimizer (learning rate 1e-3 with decay), 40 epochs, focal loss for classification, and smooth L1 for regression.
5. Comparative Results and Resource Analysis
CoPLOT consistently surpasses prior models in accuracy and efficiency. For OPV2V, V2V4Real, and DAIR-V2X, [email protected]/0.7 scores respectively are as follows:
| Method | OPV2V | V2V4Real | DAIR-V2X |
|---|---|---|---|
| CoMamba | .947/.892 | .582/.372 | .662/.442 |
| CoPLOT | .973/.934 | .644/.447 | .761/.593 |
Communication and computation statistics demonstrate substantial efficiency:
- Communication (OPV2V, bits per frame, [email protected]): CoBEVT: B → 0.957 | Where2comm: B → 0.947 | CoPLOT: B → 0.973
- Compute (OPV2V, GFLOPs, [email protected]): V2X-ViT: 545 → .963 | CollaMamba: 198 → .947 | CoPLOT: 176 → .973
Robustness under Gaussian pose noise ( up to 1 m) shows CoPLOT’s AP drops by less than 2 points; ablation of NEA increases drop to more than 5 points. Ablation studies indicate major performance degradation when removing STR (–6.3 pp), frequency enhancement (–11.9 pp), or scene prompt (–7.8 pp).
6. Limitations and Implications
Point-level tokens enable full 3D preservation (height, edges), eliminating BEV’s artifact-induced limitations and supporting finer localization. Linear-time SSMs with frequency cues allow long-sequence modeling with complexity, avoiding the quadratic scaling of transformers. Dynamic top- filtering restricts resources to approximately 1,000 object tokens per agent, directly accounting for the observed bandwidth and computation reductions. Overhead scales linearly with the sum of per-agent top- tokens.
Limitations include the use of a static top- threshold (lacking scene adaptivity), reliance on a 2D convolutional detection head (suboptimal for direct 3D bounding box proposals), and restriction to LiDAR-only input. Planned extensions involve joint communication–computation–accuracy trade-off optimization, adaptive selection, and integration of multi-modal inputs (e.g., camera + LiDAR) (Li et al., 27 Aug 2025).
In summary, CoPLOT employs semantically filtered, frequency-aware point-level tokens and explicit closed-loop alignment to achieve state-of-the-art collaborative 3D detection accuracy and resource efficiency, establishing a new paradigm beyond BEV for collaborative perception.