Collaborative Perceiver (CoP): 3D Detection & Prediction

Updated 4 July 2026

The paper introduces a vision-based multi-task model combining 3D detection and occupancy prediction with refined local density-aware supervision and voxel-height-guided sampling.
It employs collaborative feature fusion to integrate global BEV context with local height-aware details, achieving higher accuracy (e.g., 49.5% mAP on nuScenes).
CoP also extends to a conceptual framework for multi-agent joint perception and prediction via V2X-enabled collaborative scene completion.

Searching arXiv for the specified CoP papers and closely related collaborative perception work to ground the article in current literature. Collaborative Perceiver (CoP) denotes two closely related concepts in recent autonomous-driving literature. In one usage, it refers to a vision-based multi-task framework that jointly performs bird’s-eye-view (BEV) 3D object detection and 3D semantic occupancy prediction by leveraging local density-aware occupancy supervision, voxel-height-guided sampling, and global-local collaborative feature fusion (Yuan et al., 28 Jul 2025). In another, more conceptual usage, it names an agent-centric architecture implied by the framework for Collaborative Joint Perception and Prediction (Co-P&P), in which Vehicle-to-Everything (V2X) communication supports Collaborative Scene Completion (CSC), and a downstream joint Perception and Prediction (P&P) module infers both current scene state and future agent motion from the completed representation (Wan et al., 27 Jan 2025). Taken together, these usages situate CoP within the broader trajectory from collaborative perception toward modular, communication-aware, and prediction-capable multi-agent scene understanding (Ren et al., 2022).

1. Conceptual scope and definitions

The broader foundation for CoP is collaborative perception, also called cooperative perception, in which multiple agents such as vehicles and infrastructure share perception-related information so that every agent can perceive beyond its own line-of-sight and field-of-view (Ren et al., 2022). In formal terms, collaborative perception introduces a communication-and-fusion step where agents share some representation $m_i$ derived from observations $x_i$ or intermediate features $h_i$ , and each agent produces a refined output

$y_i^{\text{collab}} = F_i\big(x_i, \{m_j\}_{j \in \mathcal{N}_i}\big).$

This literature frames the participating vehicles and infrastructure as a distributed sensor array, with V2V, V2I, and more generally V2X communication providing the transport layer for collective perception messages (Ren et al., 2022).

Within this setting, CoP acquires a more specific meaning in the 2025 vision-based paper "Collaborative Perceiver: Elevating Vision-based 3D Object Detection via Local Density-Aware Spatial Occupancy" (Yuan et al., 28 Jul 2025). There, CoP is a vision-based multi-view BEV 3D object detector trained jointly with a 3D semantic occupancy prediction task. Its inputs are surround-view RGB images, and its outputs are 3D bounding boxes for traffic objects together with a voxelized 3D semantic occupancy grid over the driving area (Yuan et al., 28 Jul 2025).

A second and more architectural meaning is derived from the Co-P&P framework introduced in "The Components of Collaborative Joint Perception and Prediction -- A Conceptual Framework" (Wan et al., 27 Jan 2025). That paper states that a “Collaborative Perceiver (CoP)” would be an agent that collaboratively acquires a completed representation of the scene through CSC and jointly infers perception and prediction through P&P. It defines this derived notion as an agent-centric system that: first, uses V2X to collaboratively reconstruct a complete, consistent representation of the surrounding environment; and second, from that representation, jointly infers the state and future motion of all relevant agents and the scene, in a way that is modular, scalable, and robust to occlusions, limited FOV, and communication constraints (Wan et al., 27 Jan 2025).

These two usages are not identical. The vision-based CoP of (Yuan et al., 28 Jul 2025) is a concrete multi-task detector-occupancy model. The CoP implied by (Wan et al., 27 Jan 2025) is a higher-level design pattern for collaborative autonomous-driving systems. A plausible implication is that the same name now spans both a specific camera-only architecture and a more general systems concept linking collaboration, scene completion, and downstream inference.

2. Historical and methodological background

Collaborative perception emerged to address core limitations of single-vehicle sensing: occlusion, limited FoV, long-range sparsity, missing modalities, and cost (Ren et al., 2022). The survey literature organizes collaboration according to where it occurs in the perception pipeline: early collaboration shares raw sensor data, intermediate collaboration shares encoder features, late collaboration shares final predictions, and mixed collaboration combines multiple levels (Ren et al., 2022). For CoP, the intermediate regime is especially relevant because it preserves rich spatial and semantic information while remaining more communication-efficient than raw-data exchange.

The survey explicitly identifies feature-level collaboration as the natural regime for a “Collaborative Perceiver” architecture, where agents encode local observations into latent tokens, communicate a subset, and aggregate them by a shared Perceiver-like fusion module (Ren et al., 2022). Representative feature-level systems include F-Cooper, V2VNet, and DiscoNet, while Who2Com and When2Com introduce attention-based policies for deciding who should communicate with whom and when communication should occur (Ren et al., 2022). These methods supply the algorithmic context for later CoP formulations: graph-based fusion, attention-based aggregation, pose-aware alignment, and communication selection all recur in subsequent discussions of collaborative perception and collaborative prediction.

The conceptual Co-P&P paper extends this background by arguing that collaboration should not terminate at detection or segmentation (Wan et al., 27 Jan 2025). Classical autonomous-driving pipelines pass discrete outputs through detection, tracking, prediction, and planning, so perception errors compound across stages. End-to-end perception-and-prediction systems reduce this cumulative error but still operate on ego-only observations. Co-P&P therefore proposes a joint, end-to-end treatment of perception and motion prediction under collaboration, with CSC used upstream to mitigate occlusion and fill missing context before a joint P&P module operates on the completed scene (Wan et al., 27 Jan 2025).

By contrast, the vision-based CoP paper does not address V2X collaboration between multiple physical agents. Its “collaborative” aspect is intra-model and cross-task: the framework mines consistent structural and conceptual similarities shared between 3D object detection and occupancy prediction tasks, and fuses global BEV features with local height-aware features to obtain more robust representations (Yuan et al., 28 Jul 2025). This terminological divergence can be misleading. CoP in (Yuan et al., 28 Jul 2025) is collaborative across tasks and feature streams; CoP in (Wan et al., 27 Jan 2025) is collaborative across vehicles, infrastructure, and downstream prediction objectives.

3. Vision-based CoP: architecture and learning pipeline

The concrete CoP framework in (Yuan et al., 28 Jul 2025) is a multi-task learning system for camera-only BEV perception. Standard BEV pipelines such as BEVDet, BEVDepth, BEVDet4D, and BEVFormer extract 2D image features, lift them to 3D frustum or voxel space, collapse along the height axis, and run detection heads in BEV. According to the paper, this height-collapsing causes two principal deficiencies: loss of vertical structure and fine geometry, and loss of intrinsic environmental context such as roads, curbs, sidewalks, islands, medians, building facades, poles, vegetation, and free space versus occupied volume (Yuan et al., 28 Jul 2025). CoP addresses these deficiencies by introducing occupancy prediction as an auxiliary task and by explicitly recovering height-aware local structure.

The high-level pipeline proceeds from multi-camera images

$I \in \mathbb{R}^{N \times H_I \times W_I \times 3}$

through a ResNet + FPN image encoder, producing multi-scale multi-view features $f_i \in \mathbb{R}^{N \times C_i \times H_i \times W_i}$ (Yuan et al., 28 Jul 2025). A Lift-Splat-Shoot view transformation then uses a depth branch and a context branch to generate 3D voxel features

$f_v = \mathcal{F}_{lss}(f_c, f_d, \mathcal{K}^{\text{lidar}}_{\text{cam}}),$

where $\mathcal{K}^{\text{lidar}}_{\text{cam}}$ are camera–LiDAR extrinsics and intrinsics (Yuan et al., 28 Jul 2025). Collapsing these voxel features along height yields a standard global BEV representation

$f_g \in \mathbb{R}^{C' \times H' \times W'}.$

The distinctive components of CoP are then introduced. First, voxel-height-guided sampling defines multiple height intervals of interest and pools only the relevant voxels within each interval to obtain a set of local height-aware maps $f_{l_i}$ , which are aggregated by Squeeze-and-Excitation style attention into a local feature map

$x_i$ 0

Second, a global-local collaborative feature fusion module adaptively combines $x_i$ 1 and $x_i$ 2 to produce a unified BEV feature

$x_i$ 3

Third, a channel-to-height plugin elevates BEV channels back into the height dimension,

$x_i$ 4

so that a detection head can operate on $x_i$ 5 and an occupancy head can operate on $x_i$ 6 (Yuan et al., 28 Jul 2025).

The joint training objective is written as

$x_i$ 7

and

$x_i$ 8

where $x_i$ 9 is the task-balancing weight and $h_i$ 0 is a per-voxel local density weight from local density-aware occupancy (Yuan et al., 28 Jul 2025). The detection branch is built on BEVDet4D with BEVDepth-style depth supervision and temporal fusion, while the occupancy branch performs 3D semantic segmentation over $h_i$ 1 (Yuan et al., 28 Jul 2025).

The paper reports nuScenes test-set performance of $h_i$ 2 mAP and $h_i$ 3 NDS for CoP, exceeding BEVFormer, PETRv2, X3KD, and SOGDet in the comparison presented there (Yuan et al., 28 Jul 2025). On the nuScenes validation set with a ResNet-50 backbone and lower input resolution, CoP reaches $h_i$ 4 mAP and $h_i$ 5 NDS, outperforming BEVDet4D, BEVDepth, AeDet, Dual-BEV, and IA-BEV in the reported table (Yuan et al., 28 Jul 2025).

4. Local Density-aware Occupancy, VHS, and collaborative fusion

A central novelty of the vision-based CoP framework is Local Density-aware Occupancy (LDO) (Yuan et al., 28 Jul 2025). Conventional LiDAR-based occupancy ground truth is typically formed by aggregating sparse LiDAR points, voxelizing them, and marking voxels with any point as occupied. The paper identifies two resulting problems: sparse and non-uniform point density, and the assumption of homogeneous density within occupied voxels, which discards local density variations that correlate with object surfaces, shape fidelity, and visibility (Yuan et al., 28 Jul 2025).

LDO addresses this by generating dense occupancy ground truth from multi-frame LiDAR. Static points are aggregated in world coordinates, while dynamic points are aggregated per object across frames and then mapped back to the target LiDAR coordinate system to form a dense point cloud $h_i$ 6 (Yuan et al., 28 Jul 2025). For a dynamic object $h_i$ 7, the local density factor for voxel $h_i$ 8 is defined as

$h_i$ 9

and these factors are assembled into a global local density matrix $y_i^{\text{collab}} = F_i\big(x_i, \{m_j\}_{j \in \mathcal{N}_i}\big).$ 0 (Yuan et al., 28 Jul 2025). The resulting occupancy tensor $y_i^{\text{collab}} = F_i\big(x_i, \{m_j\}_{j \in \mathcal{N}_i}\big).$ 1 contains semantic or occupied-state labels together with density weights. The paper states that this representation encodes surface detail and object shape, local structure of fine objects such as poles, traffic cones, and pedestrians, environmental structure such as roads, pavements, sidewalks, curbs, and buildings, and implicit visibility and reliability (Yuan et al., 28 Jul 2025).

Voxel-Height-Guided Sampling (VHS) then uses the height distribution of occupied voxels in LDO ground truth, rather than raw LiDAR, as a prior for selecting semantically meaningful height ranges (Yuan et al., 28 Jul 2025). The paper’s example intervals for CoP-Base include Base Layer, Universal Layer, and Extended Focus Layer ranges such as $y_i^{\text{collab}} = F_i\big(x_i, \{m_j\}_{j \in \mathcal{N}_i}\big).$ 2, $y_i^{\text{collab}} = F_i\big(x_i, \{m_j\}_{j \in \mathcal{N}_i}\big).$ 3, $y_i^{\text{collab}} = F_i\big(x_i, \{m_j\}_{j \in \mathcal{N}_i}\big).$ 4, $y_i^{\text{collab}} = F_i\big(x_i, \{m_j\}_{j \in \mathcal{N}_i}\big).$ 5, $y_i^{\text{collab}} = F_i\big(x_i, \{m_j\}_{j \in \mathcal{N}_i}\big).$ 6, $y_i^{\text{collab}} = F_i\big(x_i, \{m_j\}_{j \in \mathcal{N}_i}\big).$ 7, $y_i^{\text{collab}} = F_i\big(x_i, \{m_j\}_{j \in \mathcal{N}_i}\big).$ 8, and $y_i^{\text{collab}} = F_i\big(x_i, \{m_j\}_{j \in \mathcal{N}_i}\big).$ 9 meters. After interval-specific pooling and attention-based aggregation, VHS yields a height-sensitive local feature $I \in \mathbb{R}^{N \times H_I \times W_I \times 3}$ 0 that complements the height-flattened global BEV representation (Yuan et al., 28 Jul 2025).

The global-local collaborative feature fusion module combines these two streams by first deriving contextual summaries

$I \in \mathbb{R}^{N \times H_I \times W_I \times 3}$ 1

then computing adaptive fusion weights

$I \in \mathbb{R}^{N \times H_I \times W_I \times 3}$ 2

and finally producing the unified feature

$I \in \mathbb{R}^{N \times H_I \times W_I \times 3}$ 3

This construction lets some channels be dominated by global BEV cues and others by local height-aware cues (Yuan et al., 28 Jul 2025).

The ablation study reported in (Yuan et al., 28 Jul 2025) attributes measurable improvements to each stage. Moving from sparse occupancy to dense occupancy improves mAP from $I \in \mathbb{R}^{N \times H_I \times W_I \times 3}$ 4 to $I \in \mathbb{R}^{N \times H_I \times W_I \times 3}$ 5, adding LDO increases it to $I \in \mathbb{R}^{N \times H_I \times W_I \times 3}$ 6, adding VHS increases it to $I \in \mathbb{R}^{N \times H_I \times W_I \times 3}$ 7, and adding CFF yields $I \in \mathbb{R}^{N \times H_I \times W_I \times 3}$ 8 mAP with $I \in \mathbb{R}^{N \times H_I \times W_I \times 3}$ 9 NDS. The paper also reports that LDO-guided multi-layer height sampling with BL + UL + EFL performs best among the tested sampling strategies, and that applying LDO weighting to TPVFormer and SurroundOcc improves their occupancy metrics without extra inference cost (Yuan et al., 28 Jul 2025).

These results indicate that CoP’s notion of collaboration is feature-structural rather than networked: global BEV context, local height-aware detail, detection supervision, and occupancy supervision are integrated into a common representational space. This suggests that the name “Collaborative Perceiver” in (Yuan et al., 28 Jul 2025) emphasizes collaboration between complementary geometric abstractions and tasks.

5. CoP as collaborative scene completion plus joint prediction

The Co-P&P framework in (Wan et al., 27 Jan 2025) provides a different route to CoP. The paper introduces Collaborative Joint Perception and Prediction as a new task for connected autonomous vehicles and infrastructure. Its objective is to jointly perform object detection and motion prediction directly from sensor data and map or traffic context, while using V2X collaboration to fill occlusions and missing context through Collaborative Scene Completion before joint Perception and Prediction (Wan et al., 27 Jan 2025).

The architecture is explicitly organized as a two-core-module design. CSC is task-agnostic and collaboration-centric. It receives ego LiDAR, communicated latent features from other agents’ LiDAR encoders, and the poses of all agents, and outputs a completed LiDAR frame or intermediate representation in ego coordinates (Wan et al., 27 Jan 2025). Conceptually, CSC is written as

$f_i \in \mathbb{R}^{N \times C_i \times H_i \times W_i}$ 0

where each partner feature is transformed to ego frame by $f_i \in \mathbb{R}^{N \times C_i \times H_i \times W_i}$ 1 before fusion (Wan et al., 27 Jan 2025). The paper states that remote agents send intermediate latent features rather than raw point clouds to save bandwidth, and that the ego vehicle plus other vehicles or roadside units reconstruct a completed LiDAR frame or scene representation (Wan et al., 27 Jan 2025).

The downstream P&P module is task-specific but resembles a single-vehicle end-to-end model operating on completed scenes (Wan et al., 27 Jan 2025). It takes the completed LiDAR scene, ego pose, HD map, and traffic light information. Internally it comprises a LiDAR encoder, a temporal encoder using Transformer-style attention,

$f_i \in \mathbb{R}^{N \times C_i \times H_i \times W_i}$ 2

a map encoder, a multi-agent interaction attention stage for map-to-agent and agent-to-agent reasoning, and a decoder that outputs a BEV detection map and a BEV flow field (Wan et al., 27 Jan 2025). The detection output includes map masks and object masks, while the prediction output encodes future motion as BEV flow or trajectories. The paper references forecasting metrics $f_i \in \mathbb{R}^{N \times C_i \times H_i \times W_i}$ 3, $f_i \in \mathbb{R}^{N \times C_i \times H_i \times W_i}$ 4, $f_i \in \mathbb{R}^{N \times C_i \times H_i \times W_i}$ 5, and $f_i \in \mathbb{R}^{N \times C_i \times H_i \times W_i}$ 6, and writes the combined objective as

$f_i \in \mathbb{R}^{N \times C_i \times H_i \times W_i}$ 7

The conceptual contribution is not a benchmarked system but a reframing of the pipeline. Instead of detector $f_i \in \mathbb{R}^{N \times C_i \times H_i \times W_i}$ 8 tracker $f_i \in \mathbb{R}^{N \times C_i \times H_i \times W_i}$ 9 predictor, CSC supplies a completed scene to a single neural module that jointly optimizes perception and prediction (Wan et al., 27 Jan 2025). The paper argues that this reduces cumulative errors, allows prediction to operate on rich latent features rather than discrete detections, and exposes visually occluded agents to the forecasting module through collaboration (Wan et al., 27 Jan 2025). It further proposes that a future CoP architecture should include a V2X interface, pose and time alignment, multi-agent feature fusion, domain-invariant intermediate representations, temporal backbones, map encoders, interaction modules, multitask decoders, and resource-aware collaboration triggers (Wan et al., 27 Jan 2025).

Relative to the broader survey, this formulation can be read as an extension of collaborative perception from perception-only outputs to joint scene understanding and forecasting (Ren et al., 2022). A plausible implication is that CoP in this sense functions less as a single network architecture and more as a systems abstraction: collaborative scene completion as a service layer, joint perception-prediction as the consumer, and planning and control downstream.

6. Evaluation, limitations, and research directions

The two CoP lines differ substantially in evaluation maturity. The vision-based CoP in (Yuan et al., 28 Jul 2025) provides experimental results on nuScenes, including quantitative benchmarks, ablations, and qualitative visualizations. The reported test-set result is $f_v = \mathcal{F}_{lss}(f_c, f_d, \mathcal{K}^{\text{lidar}}_{\text{cam}}),$ 0 NDS and $f_v = \mathcal{F}_{lss}(f_c, f_d, \mathcal{K}^{\text{lidar}}_{\text{cam}}),$ 1 mAP, and the paper attributes gains to dense occupancy, LDO, VHS, and CFF (Yuan et al., 28 Jul 2025). Qualitative analyses indicate cleaner and more continuous road surfaces and sidewalks in occupancy, better recovery of vertical structure, tighter 3D boxes with headings aligned to lanes and road directions, and fewer false positives and ghost boxes compared with SOGDet (Yuan et al., 28 Jul 2025).

Its limitations are correspondingly concrete. The paper notes computational cost and memory overhead from occupancy prediction and LDO ground-truth generation, dependence on dense occupancy ground truth constructed offline from multi-frame LiDAR and detailed annotations, and limited study of generalization beyond nuScenes or alternative sensor configurations (Yuan et al., 28 Jul 2025). The suggested directions are more efficient occupancy modeling, extension to multi-modal settings such as LiDAR + camera and radar, better handling of dynamic scenes and temporal consistency in occupancy, and domain adaptation or robustness across environments (Yuan et al., 28 Jul 2025).

By contrast, the Co-P&P-derived CoP remains conceptual and deliberately leaves several issues open (Wan et al., 27 Jan 2025). The paper identifies localization errors, asynchronous observations, domain shift from heterogeneous sensors, dependence on large-scale labeled datasets, communication constraints, and privacy and security as central challenges. It recommends pose refinement, differentiable alignment inside CSC, uncertainty-aware fusion, spatio-temporal scene completion, motion-aware fusion, explicit time-stamped representations, unified intermediate representations that abstract away raw sensor specifics, and large-scale multi-agent P&P datasets with semi-supervised or self-supervised learning (Wan et al., 27 Jan 2025).

These concerns closely align with the collaborative-perception survey, which emphasizes bandwidth limits, latency, packet drops, security threats, localization errors, time synchronization, heterogeneity of sensors, lack of large-scale public datasets, and scalability with many agents (Ren et al., 2022). The survey further suggests that a robust CoP should handle partial, delayed, or missing tokens; use temporal encoding to account for latency; incorporate confidence or integrity scores for received latents; and rely on sparse, scalable fusion mechanisms rather than all-to-all communication (Ren et al., 2022).

An important misconception is that CoP necessarily denotes V2X-based multi-agent collaboration. In the current literature, that is true for the conceptual line rooted in Co-P&P, but not for the camera-only architecture of (Yuan et al., 28 Jul 2025). Conversely, it would also be inaccurate to treat CoP solely as a detection-occupancy multi-task model, because (Wan et al., 27 Jan 2025) explicitly uses the term as a conceptual bridge toward collaborative joint perception and motion prediction.

7. Position within the literature

Within the camera-only BEV literature, CoP is positioned against methods such as BEVFormer, PETRv2, X3KD, SOGDet, BEVDet4D, BEVDepth, AeDet, Dual-BEV, and IA-BEV (Yuan et al., 28 Jul 2025). Its stated novelty lies in three coupled components: Local Density-aware Occupancy, Voxel-Height-Guided Sampling, and Global-Local Collaborative Feature Fusion. The paper differentiates itself from occupancy-focused methods such as TPVFormer, SurroundOcc, Occ3D, OccNeRF, and GaussianFormer by making occupancy a first-class auxiliary task tightly integrated with detection through shared features and collaborative fusion (Yuan et al., 28 Jul 2025). It also differentiates itself from BEV multi-task systems such as BEVFusion, M2BEV, Dual-BEV, and UniVision by using density-aware occupancy modeling and a more explicit global-local fusion mechanism (Yuan et al., 28 Jul 2025).

Within the collaborative-perception and multi-agent autonomy literature, the Co-P&P-derived CoP is positioned relative to F-Cooper, V2X-ViT, camera-based BEV fusion, Multi-Robot Scene Completion, CoRe, PnPNet, and ViP3D (Wan et al., 27 Jan 2025). Its main conceptual distinction is the decoupling of collaboration into CSC and the use of an end-to-end joint P&P module downstream, rather than a tightly coupled collaborative detector whose outputs are later consumed by a separate forecasting stage (Wan et al., 27 Jan 2025). The survey perspective further situates CoP among intermediate collaboration methods such as V2VNet and DiscoNet, attention-based communication policies such as Who2Com and When2Com, pose-robust methods, and MARL-based resource allocation strategies (Ren et al., 2022).

Taken together, the literature suggests two complementary trajectories under the CoP label. One trajectory seeks richer internal world models for camera-only BEV detection by coupling occupancy and detection supervision (Yuan et al., 28 Jul 2025). The other seeks system-level integration of collaboration, scene completion, perception, and prediction in connected autonomous vehicles (Wan et al., 27 Jan 2025). This suggests that “Collaborative Perceiver” is becoming a unifying term for architectures that replace narrow task heads and incomplete ego views with shared latent scene representations shaped by structural context, uncertainty, and complementary sources of information.