- The paper introduces TopoNet, an end-to-end architecture that integrates multi-view feature extraction, deformable Transformer decoders, and a scene graph neural network for fine-grained driving scene topology reasoning.
- The paper reports significant improvements, including a 15–84% increase in lane centerline detection performance and enhanced TE-LC assignments on the OpenLane-V2 benchmark.
- The paper validates its design choices through extensive ablation studies, demonstrating that scene knowledge graphs and SGNN message passing are key for robust urban scene interpretation.
Graph-Based Topology Reasoning for Driving Scenes: TopoNet
Motivation and Problem Definition
The capability to infer scene-level topology—including the connectivity and semantics of lanes, and the assignment of traffic elements such as lights and signs to lanes—is fundamental for robust autonomous driving in complex urban environments. Traditional map-building and perception techniques are limited in their capacity for fine-grained reasoning over heterogeneous entities, often focusing solely on instance detection or vectorized representations, and neglecting topological relationships and cross-entity assignments.
Figure 1: Topology structure of driving scenes. Relationships among lane centerlines (LL) and between centerlines and traffic elements (LT) are critical for downstream planning and navigation.
TopoNet Architecture
TopoNet is an end-to-end scene graph reasoning architecture designed for bird's eye view (BEV) topology understanding. It models both the perception and relational reasoning tasks, enabling simultaneous detection and topological assignment for both lane centerlines (LC) and traffic elements (TE).
The pipeline consists of four stages: feature extraction from multi-view images, perspective-to-BEV feature transformation, deformable Transformer-based decoding in two parallel branches (TE and LC), and a Scene Graph Neural Network (SGNN) which models interactions and topological relations via message passing. The arcs handle both spatial/structural (LL) and semantic (LT) assignments.
Figure 2: Overview of TopoNet, illustrating multi-stage fusion, deformable Transformer decoders for TE and LC, and a message-passing SGNN for heterogeneous relation learning.
Scene Graph Neural Network: Structure and Knowledge Injection
TopoNet leverages an SGNN that facilitates explicit message-passing among detected instances, conditioned on the underlying heterogenous scene graph. The SGNN propagates information through directed edges corresponding to centerline connections and TE-LC assignment, controlled by adjacency matrices and topology-aware weighting, refining instance embeddings with both local and relational context.
A key innovation is the integration of a scene knowledge graph, which encodes prior knowledge by assigning class-specific learnable weights for each edge type (e.g., TE category, successor/predecessor/self-loop in LC). This allows for directional, semantic-aware aggregation, essential in scenes with complex regulatory elements.
Figure 3: Scene knowledge graph example, demonstrating class- and direction-aware weighting for message passing among LCs and TEs.
The semantic embedding for TE queries ensures that spatially variant but semantically similar detections can influence the connected LC queries, addressing modality gap and non-uniform importance across TE types.
Learning, Supervision, and Training Regime
Supervision operates at multiple granularities: detection heads for TEs (2D bounding boxes) and LCs (3D point sequences), and a topology head for pairwise relationship classification. Label assignment utilizes the Hungarian algorithm for instance matching; losses include focal and regression terms for detection, and a focal loss for sparse pairwise topological edge prediction. Supervision is applied at every Transformer decoder and SGNN layer.
The model is designed for large-scale BEV with a unified ResNet50 backbone + FPN, and trained with AdamW, extensive data augmentation, and iterative cross-branch feature fusion. Hyperparameters controlling the ratio of propagated features (e.g., βll​, βlt​) are ablated to probe relational sensitivity.
Quantitative Results and Claims
On the OpenLane-V2 benchmark, TopoNet reports substantial improvements over previous state-of-the-art in both perceptual and topological metrics:
- LC detection (DETl​): TopoNet surpasses previous SOTA by 15–84% absolute.
- TE–LC topology (TOPlt​): TopoNet outperforms prior approaches by significant margins, especially in assigning traffic lights and signs to lane instances.
- Multi-task OpenLane-V2 Score (OLS): TopoNet yields the highest scores, confirming the synergy between relational reasoning and instance detection.
The model maintains ~10 FPS inference rate on A100 class GPUs with input size 512×676, proving its capacity for real-time deployment.
A comprehensive set of ablation studies eliminate architectural and relational components, verifying:
- The marginal contributions of SGNN and scene knowledge graph design, with removal leading to measurable drops on DETl​, TOPll​, TOPlt​, and OLS.
- Both LL and LT propagation are required (removal of either degrades respective topology perception).
- The traffic element embedding module is critical for aligning TE and LC feature spaces.
- Additional GNN layers degrade performance, evidencing oversmoothing and reducing embedding discrimination.
Qualitative analysis on challenging OpenLane-V2 data showcases TopoNet's ability to output complete lane graphs with correct TE–LC assignments even in dense, urban, and occluded settings, while legacy approaches fail to capture such relational patterns.

Figure 4: TopoNet and competitors on OpenLane-V2. TopoNet recovers lane graphs and semantic connections superiorly in complex urban intersections.
Figure 5: Failure case for TopoNet under large-area occlusion by a bus. The model still avoids invalid topological assignments, indicating robustness against spurious annotations.
Practical and Theoretical Implications
By unifying semantic and structural reasoning in a single architecture with explicit scene graph modeling and inductive relational priors, TopoNet represents a shift from task-specific object detection towards beta-structural perception systems, aligned with the requirements of real-world AV planners and motion predictors. The explicit incorporation of task-relevant priors and knowledge graphs addresses the longstanding limitation of neural methods in topological and regulatory context modeling.
The approach exposes multiple directions for further work, including:
- Joint optimization of post-processing steps, such as merging/pruning lane candidates in an end-to-end fashion.
- Extension to additional traffic element classes and more expressive knowledge graphs.
- Integration of uncertainty modeling for topology under occlusion or annotation error.
- Cross-task transfer to prediction and planning, leveraging richer and more interpretable scene representations.
Conclusion
TopoNet provides a robust solution to driving scene topology reasoning by combining end-to-end instance detection with explicit scene graph modeling and knowledge-driven message passing. Its advances in relational reasoning, semantic embedding, and unified multi-branch propagation demonstrate the necessity and benefit of explicit topology modeling for downstream autonomous driving perception stacks. The work motivates further convergence between scene graph neural methodologies and real-time AV perception/planning pipelines, serving as a strong foundation for future structural and interpretable urban scene understanding systems.