InterFuser: Autonomous Driving Framework

Updated 27 September 2025

InterFuser is an autonomous driving framework that integrates synchronized RGB and LiDAR sensor data using a unified transformer architecture for comprehensive scene understanding.
It produces interpretable semantic outputs, including waypoints and object density maps, which directly constrain control predictions to ensure safety.
Validated in CARLA simulations, InterFuser achieves state-of-the-art performance with enhanced route completion and reduced infractions in complex urban scenarios.

InterFuser is an end-to-end autonomous driving framework designed for safety-critical urban navigation by fusing multi-modal, multi-view sensor data in a transformer architecture to achieve comprehensive scene understanding and interpretable intermediate representations. Developed to address challenges in explainability and safety prevalent in deep learning-based driving agents, InterFuser processes synchronized RGB image streams and LiDAR above-ground bird’s-eye-views, providing both control predictions and semantic outputs that directly constrain the vehicle to remain within designated safety sets. The approach is validated in the CARLA simulation environment, where it establishes state-of-the-art performance across established benchmarks and forms the basis for subsequent safety assessment, testing, and differential analysis frameworks.

1. Architectural Overview and Sensor Fusion Mechanism

InterFuser employs a unified transformer-based fusion scheme for multi-modal sensor inputs, integrating three standard RGB camera views (left, center, right), a focused view for traffic lights, and LiDAR BEV into a single high-dimensional representation. Each sensor input is individually processed by a dedicated CNN backbone, typically a ResNet, to produce channel-reduced feature maps via 1×1 convolution. These features are spatially tokenized, augmented with fixed 2D sinusoidal positional encodings and sensor-specific learnable embeddings ( $v_i^{(x,y)} = z_i^{(x,y)} + s_i + e^{(x,y)}$ ).

Tokens from all sensor locations are concatenated and passed through a multi-layer transformer encoder, leveraging multi-head attention to model cross-sensor dependencies and global contextual reasoning. This structure enables the representation of both static and dynamic entities, including rare or adversarial events such as occluded objects and emergent traffic signals. The fusion architecture supersedes hierarchical or separately-fused designs by delivering a genuine end-to-end attention-based amalgamation of all sensor modalities.

2. Intermediate Semantic Representations and Safety Constraints

A distinguishing feature of InterFuser is its production of interpretable intermediate outputs. The decoder stage receives learned query vectors for waypoints, object density maps, and traffic rules. The object density map has grid cells parametrized with the probability of object presence, offsets, bounding box metrics, heading, and velocity. Traffic rule queries elicit signals indicating light status, stop sign existence, and junction states.

InterFuser’s safety controller exploits these semantic maps to constrain longitudinal acceleration and speed via a linear programming formulation. Specifically, the future safe distance estimates ( $s_t$ ) derived from the density map are used to solve:

$\text{maximize} \quad v_d^1 \ \text{subject to:} \ (v_0+v_d^1) \cdot T \leq s_1 \ (v_0+v_d^1) \cdot T + (v_d^1+v_d^2) \cdot T \leq s_2 \ |v_d^1-v_0| \cdot T \leq a_{\max} \ |v_d^2-v_d^1| \cdot T \leq a_{\max} \ 0 \leq v_d^1, v_d^2 \leq v_{\max}$

Here, $v_0$ is the current speed, $T$ is the time step, and $a_{\max}$ is the maximum allowed acceleration, ensuring collision avoidance while complying with scene constraints.

3. Benchmark Performance and Comparative Evaluation

InterFuser achieves the highest driving score (76.18) on the public CARLA Leaderboard, accompanied by superior route completion and infraction metrics. Extensive ablation studies confirm that the addition of LiDAR and extra imaging views incrementally reduces infractions such as collisions and red-light violations. When compared to competing end-to-end models (TransFuser, LBC, NEAT), InterFuser consistently yields better safety and task completion results due to its transformer-driven global fusion and utilization of interpretable outputs in downstream control.

Empirical studies in Town05 and 42 routes benchmarks reinforce these observations, with InterFuser demonstrating robust performance under complex scenarios, including occlusion and adversarial events, which prior approaches fail to address as effectively.

4. Technical Specifics and Loss Formulations

The network’s loss schema is the weighted sum of three components:

$\mathcal{L} = \lambda_{pt} \mathcal{L}_{pt} + \lambda_{map} \mathcal{L}_{map} + \lambda_{tf} \mathcal{L}_{tf}$

Where $\mathcal{L}_{pt}$ is the waypoint $L_1$ loss, $\mathcal{L}_{map}$ covers existence and meta-information in the object density grid, and $\mathcal{L}_{tf}$ pertains to traffic rule signal accuracy. The decoder employs queries with individualized positional encoding to prevent permutation invariance, and the lateral steering is dictated by a PID controller using predicted waypoints’ headings.

The backbone modularity allows for expansion with additional sensors or modalities, maintaining architectural flexibility and scalability.

5. Safety Assessment, Testing, and Limitations

Subsequent research leverages InterFuser as a reference agent in multiple system-level test frameworks, notably EpiTESTER (Lu et al., 2023), CoCoMEGA (Yousefizadeh et al., 5 Dec 2024), CCTest (Li et al., 21 Jan 2025), and CoCoMagic (Yousefizadeh et al., 20 Sep 2025). These frameworks use evolutionary and co-evolutionary search algorithms, metamorphic relations, and interpretability modules to rigorously probe safety, uncover non-deterministic decision boundaries, and diagnose behavioral divergences across system versions.

CCTest reveals that InterFuser’s otherwise reliable behavioral output in nominal scenarios can degrade in critical situations—specifically merging/lane-change contexts near dynamic safety margins—displaying non-deterministic transitions and accident-prone verdicts. CoCoMEGA and CoCoMagic expose subtle metamorphic violations resulting from minor sensor input perturbations, with up to 287% more distinct high-severity behavioral differences identified compared to baselines. These findings indicate that aggregate leaderboard metrics may overestimate InterFuser’s safety, and suggest integrating both qualitative and quantitative evaluations for robust assurance.

6. Interpretability and Diagnostics in Deployment

InterFuser’s production of safety mind maps and semantic traffic rules underpins diagnostic routines in both simulated and real-world deployments. Interpretability approaches such as RuleFit generate compact rule sets from test cases, elucidating root causes behind behavioral divergences and supporting targeted debugging efforts. This supports scalability to larger sensor suites and adaptation of the safety module to real-world edge-case scenarios, enhancing transparency and facilitating evidence-driven updates to the safety controller.

7. Future Directions and Scalability

Prospective lines of work outlined in the original publication (Shao et al., 2022), as well as in the context of extended testing frameworks, include integration of probabilistic trajectory predictors and advanced control schemes (e.g., reachability-based planners), further scaling to complex multi-sensor environments, and deployment in real-world heterogeneous traffic environments. The architecture remains adaptable, with both the core transformer fusion technique and the interpretable output modules designed for expansion and integration with evolving safety assessment methodologies.

In summary, InterFuser has established itself as a benchmark autonomous driving agent, integrating interpretable sensor fusion at the architectural level and acting as the reference point for current research in system-level safety evaluation, metamorphic testing, and explainable AI-driven control for autonomous vehicles.