PETRv2+YOLOv8+MLP Framework
- The paper presents a unified multi-stage framework that integrates PETRv2 for 3D lane detection, YOLOv8 for traffic-element detection, and MLP-based heads for topology prediction.
- It employs advanced augmentation, pseudo-labeling, and modular training to achieve state-of-the-art performance on the OpenLaneV2 dataset with 55% OLS.
- The results highlight significant improvements in lane AP, traffic AP, and topology metrics, setting a new benchmark for autonomous driving perception.
PETRv2+YOLOv8+MLP denotes a unified multi-stage framework for topology reasoning in autonomous driving, centered on high-precision detection of road centerlines (lanes) and traffic elements with subsequent topology prediction via multi-layer perceptron (MLP) heads. Developed for the OpenLane Topology Challenge, this system integrates PETRv2 (a 3D-aware query-based detector) for multi-view lane detection, YOLOv8 for monocular traffic-element detection, and bespoke MLP-based heads to infer directed topology graphs reflecting lane-lane and lane-traffic relationships. The pipeline establishes state-of-the-art results, achieving 55% OLS (OpenLane Score) on the OpenLaneV2 test set, 8 points ahead of the next best submission (Wu et al., 2023).
1. Pipeline Architecture and Overview
The PETRv2+YOLOv8+MLP framework is architected as a three-stage, non-ensembled pipeline:
- PETRv2-based centerline detection: Extracts 3D lane curves from synchronized multi-view images.
- YOLOv8 traffic-element detection: Identifies and localizes traffic elements (e.g., lights, signs) from a single front-view image.
- MLP-based topology heads: Computes lane-lane and lane-traffic relationships, outputting adjacency probabilities for topology graph construction.
Each detector is trained separately and their weights are fixed before topology head training, ensuring modularity and facilitating the analysis of each stage’s contribution to overall system performance.
2. PETRv2 Centerline (Lane) Detection
PETRv2 leverages synchronized multi-view inputs of resolution 1550×2048 and a backbone (ResNet-50, VOV, or, optimally, ViT-L). Image features are processed through an FPN-style neck, generating multi-scale feature maps (for ViT-L: per-layer feature dimension , pyramided to ).
Key PETRv2 components:
- 3D position embedding: For each candidate 3D point , its projection into camera is , yielding image coordinates . Local features are bilinearly sampled and combined across views with learned weights :
- Transformer decoder: 6 layers operate on 3D-aware features.
- Lane queries: learned 3D seed points; each represents a candidate lane as a -control-point Bezier, flattened and fed through the decoder.
- Prediction heads:
- produces class logits for each query (foreground/background).
- predicts control-point offsets for Bezier representation.
Loss:
Training details include bird’s-eye-view augmentation (BDA), HSV jitter, and careful flipping. The ViT-L backbone trained for 48 epochs with BDA achieves 35.28% lane AP on OpenLaneV2 validation.
| Backbone | Epochs | BDA | Val Lane AP (%) |
|---|---|---|---|
| ResNet-50 | 20 | – | 18.15 |
| VOV | 20 | – | 21.01 |
| ViT-L | 20 | – | 28.11 |
| ViT-L | 48 | – | 34.16 |
| ViT-L | 48 | ✓ | 35.28 |
3. YOLOv8 Traffic-Element Detection
YOLOv8-x, an anchor-free detector, processes single front-view images (896×1550) using the C2f backbone, PANet neck, and three-head multiscale outputs (strides 8, 16, 32). Modifications and enhancements include:
- Strong augmentation: Mosaic, MixUp, HSV jitter.
- Classification loss reweighting: Foreground class loss multiplied by 2.
- Class resampling: Rare classes upsampled by up to .
- Pseudo-labeling: High-confidence model outputs from validation set incorporated as extra training data.
- Test-time augmentation (TTA): Inference with multi-scale inputs ().
YOLOv8 loss formulation:
where and are BCE terms, .
Performance increases across sequential ablations, culminating in 79.89% traffic AP on the validation set.
| Modification | Gain (%) | Val Traffic AP (%) |
|---|---|---|
| Baseline | – | 65.32 |
| +Strong aug | +3.77 | 69.09 |
| +Reweight cls | +2.81 | 71.90 |
| +Class-resample | +3.01 | 74.91 |
| +Pseudo labels | +2.95 | 77.86 |
| +TTA | +2.03 | 79.89 |
4. MLP-Based Topology Heads
Once lane and traffic detections are obtained, topology prediction is handled by ultra-lightweight MLP heads operating on frozen features.
4.1 Lane-Lane Topology
- Inputs: Decoded lane features ; 3D control point coordinates projected via .
- Feature fusion: .
- Pairwise feature tensor: .
- MLP topology head: Predicts connection probability .
- Loss: Focal loss per entry.
4.2 Lane-Traffic Topology
- Inputs: Lane features and processed traffic features (projected via from raw detections).
- Pairwise features: .
- MLP head: .
- Loss: Focal loss across all pairs.
Total topology loss is
The topology heads are trained for 10 epochs with AdamW (learning rate ).
5. Training, Inference, and Data Augmentation
- Stage 1 & 2: Separate training of PETRv2 and YOLOv8 on OpenLaneV2 train set.
- Stage 3: Freeze both detectors, train MLP topology heads.
- Pseudo-labels: High-confidence detections on val included for detector training.
- Augmentation:
- Lane: BDA, HSV jitter, random flip (with directional sign preservation).
- Traffic: Mosaic, MixUp, HSV jitter, resampling of rare classes.
Optimizer configurations:
- PETRv2: AdamW, (main), (backbone), 48 epochs, weight decay .
- YOLOv8: Default YOLOv8-x schedule, 20 epochs, fine-tuned from COCO.
- MLP heads: AdamW, , 10 epochs.
For inference, PETRv2 extracts centerlines, YOLOv8 detects traffic, and MLP heads predict adjacency probabilities, thresholded (e.g., 0.5) to form directed graphs.
6. Quantitative Results and Ablations
On OpenLaneV2 validation, the full system achieves:
- Lane AP (DET_l): 35.28%
- Traffic AP (DET_t): 79.89%
- Lane-Lane Topology (TOP_{ll}): 23.01%
- Lane-Traffic Topology (TOP_{lt}): 33.34%
- OLS:
Leaderboard test set entry ("MFV", no ensembling): DET_l = 36%, DET_t = 80%, TOP_{ll} = 23%, TOP_{lt} = 33%, OLS = 55%.
Key ablations:
- Upgrading backbone, increasing epochs: +7.2% lane AP
- BDA: +1.12% lane AP
- Traffic-data augmentation and tricks: +14.6% traffic AP
- Detector gains directly improve topology metrics—for example, advancing detectors from (DET_l=28.1, DET_t=68.8) to (35.3, 79.9) increases topology by TOP_{ll} +9.3, TOP_{lt} +11.7.
7. Significance and Implications
PETRv2+YOLOv8+MLP demonstrates that high-performance topology reasoning in autonomous driving can be achieved by a multi-stage pipeline that harnesses advanced 3D query-based representations for centerline detection, robust monocular traffic-element detection, and lightweight but effective MLPs for topology graph prediction. Each stage is independently strengthened with targeted augmentation, class rebalancing, and pseudo-labeling strategies, resulting in substantial performance gains that elevate both detection and topology metrics. The strong empirical gains (up to 55% OLS, +8 points over the next best method) set a benchmark for modular, scalable topological lane reasoning approaches in autonomous perception (Wu et al., 2023).