PETRv2+YOLOv8+MLP Framework

Updated 8 December 2025

The paper presents a unified multi-stage framework that integrates PETRv2 for 3D lane detection, YOLOv8 for traffic-element detection, and MLP-based heads for topology prediction.
It employs advanced augmentation, pseudo-labeling, and modular training to achieve state-of-the-art performance on the OpenLaneV2 dataset with 55% OLS.
The results highlight significant improvements in lane AP, traffic AP, and topology metrics, setting a new benchmark for autonomous driving perception.

PETRv2+YOLOv8+MLP denotes a unified multi-stage framework for topology reasoning in autonomous driving, centered on high-precision detection of road centerlines (lanes) and traffic elements with subsequent topology prediction via multi-layer perceptron (MLP) heads. Developed for the OpenLane Topology Challenge, this system integrates PETRv2 (a 3D-aware query-based detector) for multi-view lane detection, YOLOv8 for monocular traffic-element detection, and bespoke MLP-based heads to infer directed topology graphs reflecting lane-lane and lane-traffic relationships. The pipeline establishes state-of-the-art results, achieving 55% OLS (OpenLane Score) on the OpenLaneV2 test set, 8 points ahead of the next best submission (Wu et al., 2023).

1. Pipeline Architecture and Overview

The PETRv2+YOLOv8+MLP framework is architected as a three-stage, non-ensembled pipeline:

PETRv2-based centerline detection: Extracts 3D lane curves from synchronized multi-view images.
YOLOv8 traffic-element detection: Identifies and localizes traffic elements (e.g., lights, signs) from a single front-view image.
MLP-based topology heads: Computes lane-lane and lane-traffic relationships, outputting adjacency probabilities for topology graph construction.

Each detector is trained separately and their weights are fixed before topology head training, ensuring modularity and facilitating the analysis of each stage’s contribution to overall system performance.

2. PETRv2 Centerline (Lane) Detection

PETRv2 leverages synchronized multi-view inputs of resolution 1550×2048 and a backbone (ResNet-50, VOV, or, optimally, ViT-L). Image features are processed through an FPN-style neck, generating multi-scale feature maps (for ViT-L: per-layer feature dimension $C=768$ , pyramided to $\{C,C/2,C/4,C/8\}$ ).

Key PETRv2 components:

3D position embedding: For each candidate 3D point $X=(x,y,z)$ , its projection into camera $i$ is $x_ĩ = P_i [X;1]$ , yielding image coordinates $(u_i, v_i)$ . Local features are bilinearly sampled and combined across views with learned weights $w_i$ :

$F_\mathrm{3D}(X) = \sum_{i=1}^{V} w_i \cdot f_i + E_\mathrm{pos}(X),\quad E_\mathrm{pos}(X) = \mathrm{MLP}_\mathrm{3D}(X)$

Transformer decoder: 6 layers operate on 3D-aware features.
Lane queries: $N=300$ learned 3D seed points; each represents a candidate lane as a $M$ -control-point Bezier, flattened and fed through the decoder.
Prediction heads:
- $\mathrm{MLP}_{\mathrm{cls}}$ produces class logits for each query (foreground/background).
- $\mathrm{MLP}_{\mathrm{lane}}$ predicts $\Delta$ control-point offsets for Bezier representation.

Loss:

$L_{\mathrm{lane}} = 1.5 \cdot L_{\mathrm{focal}} + 0.0075 \cdot ||\mathrm{pred}_{\mathrm{pts}} - \mathrm{gt}_{\mathrm{pts}}||_1$

Training details include bird’s-eye-view augmentation (BDA), HSV jitter, and careful flipping. The ViT-L backbone trained for 48 epochs with BDA achieves 35.28% lane AP on OpenLaneV2 validation.

Backbone	Epochs	BDA	Val Lane AP (%)
ResNet-50	20	–	18.15
VOV	20	–	21.01
ViT-L	20	–	28.11
ViT-L	48	–	34.16
ViT-L	48	✓	35.28

3. YOLOv8 Traffic-Element Detection

YOLOv8-x, an anchor-free detector, processes single front-view images (896×1550) using the C2f backbone, PANet neck, and three-head multiscale outputs (strides 8, 16, 32). Modifications and enhancements include:

Strong augmentation: Mosaic, MixUp, HSV jitter.
Classification loss reweighting: Foreground class loss multiplied by 2.
Class resampling: Rare classes upsampled by up to $20\times$ .
Pseudo-labeling: High-confidence model outputs from validation set incorporated as extra training data.
Test-time augmentation (TTA): Inference with multi-scale inputs ( $\mathrm{scales}\in[0.7,1.4]$ ).

YOLOv8 loss formulation:

$L_{\mathrm{YOLO}} = L_{\mathrm{obj}} + L_{\mathrm{cls}} + L_{\mathrm{box}},$

where $L_{\mathrm{obj}}$ and $L_{\mathrm{cls}}$ are BCE terms, $L_{\mathrm{box}}=1-\mathrm{CIoU}(\mathrm{pred}_{\mathrm{box}},\mathrm{gt}_{\mathrm{box}})$ .

Performance increases across sequential ablations, culminating in 79.89% traffic AP on the validation set.

Modification	Gain (%)	Val Traffic AP (%)
Baseline	–	65.32
+Strong aug	+3.77	69.09
+Reweight cls	+2.81	71.90
+Class-resample	+3.01	74.91
+Pseudo labels	+2.95	77.86
+TTA	+2.03	79.89

4. MLP-Based Topology Heads

Once lane and traffic detections are obtained, topology prediction is handled by ultra-lightweight MLP heads operating on frozen features.

4.1 Lane-Lane Topology

Inputs: Decoded lane features $D \in \mathbb{R}^{N \times C}$ ; 3D control point coordinates $P \in \mathbb{R}^{N \times 3M}$ projected via $\mathrm{MLP}_{\mathrm{coord}}$ .
Feature fusion: $F = D + \mathrm{MLP}_{\mathrm{coord}}(P)$ .
Pairwise feature tensor: $F_{ll}[i,j] = [F[i]; F[j]] \in \mathbb{R}^{N \times N \times 2C}$ .
MLP topology head: Predicts connection probability $ŷ_{ij} = \sigma(\mathrm{MLP}_{\mathrm{top}}(F_{ll}[i,j]))$ .
Loss: Focal loss per $N^2$ entry.

4.2 Lane-Traffic Topology

Inputs: Lane features $F$ and processed traffic features $T$ (projected via $\mathrm{MLP}_{\mathrm{traffic}}$ from raw detections).
Pairwise features: $F_{lt}[i,k] = [F[i]; T[k]]$ .
MLP head: $ŷ_{ik} = \sigma(\mathrm{MLP}_{\mathrm{top2}}(F_{lt}[i,k]))$ .
Loss: Focal loss across all $N \cdot T$ pairs.

Total topology loss is

$L_{\mathrm{top}} = L_{ll} + L_{lt}$

The topology heads are trained for 10 epochs with AdamW (learning rate $2 \times 10^{-4}$ ).

5. Training, Inference, and Data Augmentation

Stage 1 & 2: Separate training of PETRv2 and YOLOv8 on OpenLaneV2 train set.
Stage 3: Freeze both detectors, train MLP topology heads.
Pseudo-labels: High-confidence detections on val included for detector training.
Augmentation:
- Lane: BDA, HSV jitter, random flip (with directional sign preservation).
- Traffic: Mosaic, MixUp, HSV jitter, resampling of rare classes.

Optimizer configurations:

PETRv2: AdamW, $2 \times 10^{-4}$ (main), $2 \times 10^{-5}$ (backbone), 48 epochs, weight decay $1 \times 10^{-4}$ .
YOLOv8: Default YOLOv8-x schedule, 20 epochs, fine-tuned from COCO.
MLP heads: AdamW, $2 \times 10^{-4}$ , 10 epochs.

For inference, PETRv2 extracts centerlines, YOLOv8 detects traffic, and MLP heads predict adjacency probabilities, thresholded (e.g., 0.5) to form directed graphs.

6. Quantitative Results and Ablations

On OpenLaneV2 validation, the full system achieves:

Lane AP (DET_l): 35.28%
Traffic AP (DET_t): 79.89%
Lane-Lane Topology (TOP_{ll}): 23.01%
Lane-Traffic Topology (TOP_{lt}): 33.34%
OLS: $0.5 \cdot (23.01 + 33.34)=53.29\%$

Leaderboard test set entry ("MFV", no ensembling): DET_l = 36%, DET_t = 80%, TOP_{ll} = 23%, TOP_{lt} = 33%, OLS = 55%.

Key ablations:

Upgrading backbone, increasing epochs: +7.2% lane AP
BDA: +1.12% lane AP
Traffic-data augmentation and tricks: +14.6% traffic AP
Detector gains directly improve topology metrics—for example, advancing detectors from (DET_l=28.1, DET_t=68.8) to (35.3, 79.9) increases topology by TOP_{ll} +9.3, TOP_{lt} +11.7.

7. Significance and Implications

PETRv2+YOLOv8+MLP demonstrates that high-performance topology reasoning in autonomous driving can be achieved by a multi-stage pipeline that harnesses advanced 3D query-based representations for centerline detection, robust monocular traffic-element detection, and lightweight but effective MLPs for topology graph prediction. Each stage is independently strengthened with targeted augmentation, class rebalancing, and pseudo-labeling strategies, resulting in substantial performance gains that elevate both detection and topology metrics. The strong empirical gains (up to 55% OLS, +8 points over the next best method) set a benchmark for modular, scalable topological lane reasoning approaches in autonomous perception (Wu et al., 2023).

PDF Markdown Chat (Pro)

References (1)

The 1st-place Solution for CVPR 2023 OpenLane Topology in Autonomous Driving Challenge (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to PETRv2+YOLOv8+MLP.