MamV2XCalib: LiDAR-Assisted Roadside Camera Calibration

Updated 3 August 2025

MamV2XCalib is a target-less calibration framework that uses vehicle-side LiDAR and multi-scale deep feature extraction to accurately estimate camera rotational parameters.
It integrates a state space model with temporal fusion and a 4D correlation volume to robustly manage calibration across diverse traffic scenarios.
Empirical evaluations reveal a significant reduction in rotational error compared to previous methods, demonstrating its scalability for real-world V2X deployments.

MamV2XCalib is a V2X-based, target-less infrastructure camera calibration framework that leverages vehicle-side LiDAR data, multi-scale deep feature extraction, and an efficient state space model to robustly and automatically estimate the rotational parameters of roadside cameras. It is specifically designed to achieve high-precision, scalable calibration in real-world V2X deployments without the need for manual targets, pre-prepared references, or road closures, and demonstrates strong performance across diverse and challenging traffic scenarios (Zhu et al., 31 Jul 2025).

1. Methodological Overview

MamV2XCalib performs infrastructure camera calibration by utilizing LiDAR point clouds collected from passing vehicles. The framework projects these point clouds, using an initial extrinsic guess (T_init), onto the image plane of the roadside camera. The primary objective is to estimate the rotational error (R_error) of the camera, exploiting the practical fact that hinge-mounted infrastructure cameras are subject to minimal translational displacement in operational conditions.

The method operates in a fully target-less paradigm. No specific calibration targets, checkerboards, or prepared environments are required; instead, environmental features and objects naturally present in traffic scenarios serve as the reference, as sampled by LiDAR-equipped vehicles during regular operation.

A multi-branch, multi-scale deep feature extraction network processes (i) the infrastructure camera image, (ii) the projected LiDAR-derived depth map, and (iii) auxiliary context features. Feature hierarchies are built using a ResNet-18 backbone with Feature Pyramid Network (FPN) to ensure that both coarse and high-frequency details are accessible for matching.

These features are used to construct a 4D correlation volume, a high-dimensional structure with shape h × w × h × w, encoding inner-product similarity between every possible image-deep map pixel pair across multi-scale representations. An iterative GRU-based refinement module then updates a pixel-level calibration flow field, capturing correspondences that directly constrain the rotational offset between the modalities.

2. State Space Model Integration

A distinguishing feature of MamV2XCalib is the use of a state space model based on the Mamba architecture for sequential modeling. After iterative calibration flow refinement is performed over several frames and iterations, the resulting spatiotemporal calibration maps (indexed by temporal T, iterative i, height h, width w) are split into patches and projected into a latent space using a convolutional encoder E(·). Spatial and temporal positional embeddings (p_s, p_t) are incorporated to preserve order and patch semantics. A dedicated aggregation token z_add is prepended to the sequence, which is then input to the bidirectional Mamba block (Mam):

$Z = [z_{add}, E(x_1^1), E(x_2^1), \ldots, E(x_n^T)] + p_s + p_t$

$z_{add} = \text{Mam}(Z)$

This architecture enables the efficient integration of information across both frames and iterative refinements, supporting linear scaling in sequence length—a critical consideration for long temporal windows typical of V2X scenarios.

Finally, the aggregated feature z_add is mapped through two fully connected layers to regress a quaternion representation of the camera’s rotational error.

3. Temporal Information Modeling

Temporal modeling is crucial for robustness in V2X scenarios because (1) vehicle trajectories provide shifting perspectives over time, and (2) occlusions or poor instantaneous overlap between modalities can cause individual frames to be unreliable or ambiguous.

MamV2XCalib constructs sequences of calibration flow maps drawn from multiple consecutive frames and iterative refinement steps. These are concatenated into a spatiotemporal input to the Mamba block, enhanced with positional embeddings to encode frame and patch context. The bidirectional state space model then fuses these multi-frame features, inherently learning to emphasize high-confidence correspondences while discounting temporally inconsistent or noisy information, thereby stabilizing the final rotation estimation.

This temporal fusion directly addresses failure cases common in single-frame or non-recurrent approaches, such as transient occlusions or viewpoint outliers, and reduces both bias and variance in calibration outcomes.

4. Multi-Scale Features and 4D Correlation Volume

MamV2XCalib leverages multi-scale feature extraction for both image and depth modalities, employing feature pyramids to capture cross-level correspondences. The 4D correlation volume is a tensor $\in \mathbb{R}^{h \times w \times h \times w}$ where each element encodes the inner product between deep features extracted from the image and depth (LiDAR) domains at every spatial position and scale.

This high-dimensional similarity representation enables the calibration module to perform dense matching between modalities despite the limited overlap characteristic of wide-baseline V2X setups (i.e., large angle differences between vehicle-side LiDAR scans and fixed infrastructure camera views).

An iterative GRU-based refinement updates a calibration flow, initializing from a zero-field and progressively mining the local high-similarity regions of the correlation volume within a specified box radius r. Each iteration incrementally reduces the rotational misalignment, as the flow is geometrically interpretable in terms of the camera’s extrinsic rotation parameters.

5. Empirical Performance and Evaluation

MamV2XCalib was evaluated on the V2X-Seq and TUMTraf-V2X datasets, encompassing a diverse range of real-world traffic scenes, diurnal conditions, and sensor configurations.

On V2X-Seq (initial camera rotation error sampled in [−20°, +20°]), MamV2XCalib achieved a mean residual rotation error of 0.6313° (standard deviation 0.3211°).
On TUMTraf-V2X (error sampled in [−5°, +5°], including nighttime scenarios), mean error was 0.267°.
Comparisons with prior art (e.g., LCCNet, a LiDAR-camera calibration method for single-car setups) demonstrate both lower mean errors and a substantial reduction in outlier (failure) rate.
Ablation studies confirm that both the iterative refinement and state space temporal modeling are indispensable: removing the Mamba block or replacing it with less expressive fusion mechanisms results in marked degradation of accuracy and stability.

6. Innovations and Practical Implications

MamV2XCalib introduces several innovations:

The first infrastructure camera calibration solution leveraging vehicle-generated LiDAR data in a target-less, completely automatic manner suitable for large-scale road networks.
The use of a 4D correlation volume and feature pyramid architectures enables calibration under sparse or limited image-depth overlaps typical in V2X deployment.
Temporal fusion via a state space model achieves accuracy and robustness with comparatively fewer model parameters than conventional Transformer-based fusion; this suggests a lower computational and memory footprint for deployment.
The system’s focus on rotational error estimation is particularly well-suited to hinge-mounted cameras, for which translation drift is minimal and the dominant calibration failure mode is orientation.
The absence of prepared calibration targets or manual procedures allows deployment in busy roads with negligible disruption, and enables continuous or on-demand recalibration with minimal operational overhead.

Deployment of MamV2XCalib enhances roadside perception reliability, ensures geometric consistency between vehicle and infrastructure sensor modalities, and supports improved object detection, tracking, and cooperative driving behaviors in scenarios that demand robust large-scale V2X fusion.

7. Limitations and Outlook

While MamV2XCalib demonstrates compelling robustness, several considerations may influence deployment and further research:

The calibration accuracy is predicated on the diversity and quality of vehicle-side LiDAR data traversing the camera’s field of view. Congested or sensor-sparse areas may experience reduced update frequency or require extended integration periods.
The approach assumes that translational shifts of infrastructure cameras are negligible; significant camera motion or altered mounting geometries outside the rotational subspace would necessitate extensions to recover full extrinsics.
While the use of the Mamba state space model is computationally efficient and effective for the observed datasets, generalization to ultra-long sequences or denser sensor mesh deployments may require additional architectural scale-up or adaptation.

Nevertheless, MamV2XCalib represents a significant step toward fully autonomous, scalable, infrastructure sensor calibration leveraging vehicle-infrastructure cooperation—a capability essential for the advancement of future intelligent transportation systems (Zhu et al., 31 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model (2025)

Follow Topic

Get notified by email when new papers are published related to MamV2XCalib.