Graph-based Topology Reasoning for Driving Scenes (2304.05277v2)

Published 11 Apr 2023 in cs.CV

Abstract: Understanding the road genome is essential to realize autonomous driving. This highly intelligent problem contains two aspects - the connection relationship of lanes, and the assignment relationship between lanes and traffic elements, where a comprehensive topology reasoning method is vacant. On one hand, previous map learning techniques struggle in deriving lane connectivity with segmentation or laneline paradigms; or prior lane topology-oriented approaches focus on centerline detection and neglect the interaction modeling. On the other hand, the traffic element to lane assignment problem is limited in the image domain, leaving how to construct the correspondence from two views an unexplored challenge. To address these issues, we present TopoNet, the first end-to-end framework capable of abstracting traffic knowledge beyond conventional perception tasks. To capture the driving scene topology, we introduce three key designs: (1) an embedding module to incorporate semantic knowledge from 2D elements into a unified feature space; (2) a curated scene graph neural network to model relationships and enable feature interaction inside the network; (3) instead of transmitting messages arbitrarily, a scene knowledge graph is devised to differentiate prior knowledge from various types of the road genome. We evaluate TopoNet on the challenging scene understanding benchmark, OpenLane-V2, where our approach outperforms all previous works by a great margin on all perceptual and topological metrics. The code is released at https://github.com/OpenDriveLab/TopoNet

Citations (23)

View on Semantic Scholar

Summary

The paper introduces TopoNet, an end-to-end architecture that integrates multi-view feature extraction, deformable Transformer decoders, and a scene graph neural network for fine-grained driving scene topology reasoning.
The paper reports significant improvements, including a 15–84% increase in lane centerline detection performance and enhanced TE-LC assignments on the OpenLane-V2 benchmark.
The paper validates its design choices through extensive ablation studies, demonstrating that scene knowledge graphs and SGNN message passing are key for robust urban scene interpretation.

Graph-Based Topology Reasoning for Driving Scenes: TopoNet

Motivation and Problem Definition

The capability to infer scene-level topology—including the connectivity and semantics of lanes, and the assignment of traffic elements such as lights and signs to lanes—is fundamental for robust autonomous driving in complex urban environments. Traditional map-building and perception techniques are limited in their capacity for fine-grained reasoning over heterogeneous entities, often focusing solely on instance detection or vectorized representations, and neglecting topological relationships and cross-entity assignments.

Figure 1: Topology structure of driving scenes. Relationships among lane centerlines (LL) and between centerlines and traffic elements (LT) are critical for downstream planning and navigation.

TopoNet Architecture

TopoNet is an end-to-end scene graph reasoning architecture designed for bird's eye view (BEV) topology understanding. It models both the perception and relational reasoning tasks, enabling simultaneous detection and topological assignment for both lane centerlines (LC) and traffic elements (TE).

The pipeline consists of four stages: feature extraction from multi-view images, perspective-to-BEV feature transformation, deformable Transformer-based decoding in two parallel branches (TE and LC), and a Scene Graph Neural Network (SGNN) which models interactions and topological relations via message passing. The arcs handle both spatial/structural (LL) and semantic (LT) assignments.

Figure 2: Overview of TopoNet, illustrating multi-stage fusion, deformable Transformer decoders for TE and LC, and a message-passing SGNN for heterogeneous relation learning.

Scene Graph Neural Network: Structure and Knowledge Injection

TopoNet leverages an SGNN that facilitates explicit message-passing among detected instances, conditioned on the underlying heterogenous scene graph. The SGNN propagates information through directed edges corresponding to centerline connections and TE-LC assignment, controlled by adjacency matrices and topology-aware weighting, refining instance embeddings with both local and relational context.

A key innovation is the integration of a scene knowledge graph, which encodes prior knowledge by assigning class-specific learnable weights for each edge type (e.g., TE category, successor/predecessor/self-loop in LC). This allows for directional, semantic-aware aggregation, essential in scenes with complex regulatory elements.

Figure 3: Scene knowledge graph example, demonstrating class- and direction-aware weighting for message passing among LCs and TEs.

The semantic embedding for TE queries ensures that spatially variant but semantically similar detections can influence the connected LC queries, addressing modality gap and non-uniform importance across TE types.

Learning, Supervision, and Training Regime

Supervision operates at multiple granularities: detection heads for TEs (2D bounding boxes) and LCs (3D point sequences), and a topology head for pairwise relationship classification. Label assignment utilizes the Hungarian algorithm for instance matching; losses include focal and regression terms for detection, and a focal loss for sparse pairwise topological edge prediction. Supervision is applied at every Transformer decoder and SGNN layer.

The model is designed for large-scale BEV with a unified ResNet50 backbone + FPN, and trained with AdamW, extensive data augmentation, and iterative cross-branch feature fusion. Hyperparameters controlling the ratio of propagated features (e.g., $\beta_{ll}$ , $\beta_{lt}$ ) are ablated to probe relational sensitivity.

Quantitative Results and Claims

On the OpenLane-V2 benchmark, TopoNet reports substantial improvements over previous state-of-the-art in both perceptual and topological metrics:

LC detection (DET $_l$ ): TopoNet surpasses previous SOTA by 15–84% absolute.
TE–LC topology (TOP $_{lt}$ ): TopoNet outperforms prior approaches by significant margins, especially in assigning traffic lights and signs to lane instances.
Multi-task OpenLane-V2 Score (OLS): TopoNet yields the highest scores, confirming the synergy between relational reasoning and instance detection.

The model maintains ~10 FPS inference rate on A100 class GPUs with input size $512\times676$ , proving its capacity for real-time deployment.

Ablative Analysis and Qualitative Performance

A comprehensive set of ablation studies eliminate architectural and relational components, verifying:

The marginal contributions of SGNN and scene knowledge graph design, with removal leading to measurable drops on DET $_l$ , TOP $_{ll}$ , TOP $_{lt}$ , and OLS.
Both LL and LT propagation are required (removal of either degrades respective topology perception).
The traffic element embedding module is critical for aligning TE and LC feature spaces.
Additional GNN layers degrade performance, evidencing oversmoothing and reducing embedding discrimination.

Qualitative analysis on challenging OpenLane-V2 data showcases TopoNet's ability to output complete lane graphs with correct TE–LC assignments even in dense, urban, and occluded settings, while legacy approaches fail to capture such relational patterns.

Figure 4: TopoNet and competitors on OpenLane-V2. TopoNet recovers lane graphs and semantic connections superiorly in complex urban intersections.

Figure 5: Failure case for TopoNet under large-area occlusion by a bus. The model still avoids invalid topological assignments, indicating robustness against spurious annotations.

Practical and Theoretical Implications

By unifying semantic and structural reasoning in a single architecture with explicit scene graph modeling and inductive relational priors, TopoNet represents a shift from task-specific object detection towards beta-structural perception systems, aligned with the requirements of real-world AV planners and motion predictors. The explicit incorporation of task-relevant priors and knowledge graphs addresses the longstanding limitation of neural methods in topological and regulatory context modeling.

The approach exposes multiple directions for further work, including:

Joint optimization of post-processing steps, such as merging/pruning lane candidates in an end-to-end fashion.
Extension to additional traffic element classes and more expressive knowledge graphs.
Integration of uncertainty modeling for topology under occlusion or annotation error.
Cross-task transfer to prediction and planning, leveraging richer and more interpretable scene representations.

Conclusion

TopoNet provides a robust solution to driving scene topology reasoning by combining end-to-end instance detection with explicit scene graph modeling and knowledge-driven message passing. Its advances in relational reasoning, semantic embedding, and unified multi-branch propagation demonstrate the necessity and benefit of explicit topology modeling for downstream autonomous driving perception stacks. The work motivates further convergence between scene graph neural methodologies and real-time AV perception/planning pipelines, serving as a strong foundation for future structural and interpretable urban scene understanding systems.