- The paper introduces a unified compact multi-task model that fuses RGB, DVS, and multi-layer LiDAR data via intermediate fusion to support semantic segmentation, depth estimation, LiDAR segmentation, and BEVP.
- It employs an innovative adaptive loss weighting strategy based on a modified GradNorm algorithm to balance gradient conflicts, reducing metric variance while enhancing overall performance.
- Experiments on simulation and real-world datasets show the model achieves competitive accuracy with less than 2% of traditional parameters, enabling efficient deployment on edge devices.
Compact Multi-Task Autonomous Driving Perception via Balanced Learning and Multi-Sensor Fusion
Introduction
This work addresses the inefficiencies pervasive in autonomous driving perception pipelines, specifically the proliferation of single-task models and the high resource costs associated with state-of-the-art deep learning architectures. The authors propose a unified, compact deep multi-task learning (MTL) model that simultaneously handles semantic segmentation, depth estimation, LiDAR segmentation, and bird's eye view projection (BEVP), leveraging a diverse sensor suite (RGB, DVS, LiDAR) and advanced multi-modal fusion. The chief methodological contributions are an adaptive loss weighting strategy based on a modified GradNorm algorithm—targeting gradient-conflict in MTL—and a robust intermediate fusion mechanism for multi-modal, multi-view input integration.
Methodology
Model Architecture
The network is constructed in a modular encoder-decoder paradigm, incorporating task-specific decoders for each output modality and spatial view. RGB and DVS input streams are responsible for dense per-pixel tasks in multiple ego-vehicle-oriented views (Front, Left, Right, Rear), while LiDAR provides global context from a top-down perspective, supporting both point-based LS and BEVP tasks. The design employs extensive skip connections inspired by U-Net to maximize information transfer from encoders to decoders. Sensor fusion is operationalized via intermediate fusion layers, wherein representations across modalities are concatenated at designated bottlenecks, exposing the decoders to a fused latent space.
A critical architectural innovation is the pre-processing and representation of LiDAR data. The authors advocate for a multi-channel 3D tensor encoding (15 layers), which preserves vertical structure and distinguishes object classes by their height, in contrast to prior 2D projections which may obscure critical semantic cues.
Adaptive Loss Weighting
A key technical obstacle in MTL is negotiating the imbalance between divergent losses, which can lead to convergence toward only one task or under-training of others. Rather than manual loss weight tuning—computationally infeasible as task count and heterogeneity increases—the proposed framework extends GradNorm loss balancing. The modification entails less frequent loss weight updates (once per epoch instead of per step) and accommodates the network’s multi-modal, multi-bottleneck structure, updating loss weights only after the last training batch of each epoch. The algorithm automatically computes the relative inverse training rates and re-scales gradient norms to ensure equitable learning rates across tasks. The resulting system stabilizes overall optimization dynamics and improves joint performance without hand-tuning.
Experimental Validation
Evaluation is conducted on three diverse CARLA simulation datasets and the nuScenes-lidarseg real-world dataset, encompassing variable weather, illumination, and scene complexity. Metrics include per-task IoU (SS, LS, BEVP), MAE (DE), total metric (summed error across tasks), and metric variance (cross-task performance variance), and comparisons focus both on accuracy and resource requirements (model size, parameter count, GPU utilization, inference speed).
Key empirical outcomes:
- The 15-layer LiDAR representation model consistently outperforms its 1-layer counterpart across simulation and real-world datasets, particularly elevating LS and BEVP performance.
- Adaptive loss weighting (modified GradNorm) reduces the total metric and variance, with a trade-off: performance on certain tasks (DE or SS) may decrease slightly, but overall task discrepancy is minimized and LS/BEVP are notably enhanced.
- Compared with dedicated single-task and multi-task baselines (PolarNet for LS, Chen et al. for BEVP, vanilla GradNorm for SS+DE), the proposed model achieves better or comparable accuracy with less than 2% of the parameter count, lower GPU usage, and substantially faster per-frame inference.
- On the largest and most diverse dataset (CARLA set C), PolarNet achieves marginal gains in LS, but the unified model remains highly competitive, highlighting the value of shared representation and fusion over brute-force parameter scaling.
Implications and Future Directions
This research substantiates the feasibility and desirability of compact, multi-task architectures for autonomous vehicle perception—achieving efficiency gains crucial for deployment on edge devices without sacrificing predictive fidelity. The demonstration that multi-modal intermediate fusion and joint loss balancing can match or surpass ensembles of task-specialized networks suggests a shift toward integrated, resource-efficient solutions.
On a theoretical level, the extension of loss balancing algorithms to complex multi-modality, multi-bottleneck architectures provides a template for future MTL work, especially as task diversity increases. The use of adaptive, data-driven balancing obviates manual loss weighting, paving the way for more robust, generalizable systems.
Practically, such architectures make it plausible to run a full perception stack on embedded hardware, supporting real-time operation and energy efficiency. The generalization to nuScenes-lidarseg, even without event camera (DVS) data, underscores the model’s adaptability to sensor availability in various platforms.
Future research avenues include:
- Automated architecture search for optimal branch sharing and fusion locations, further minimizing manual design.
- End-to-end integration incorporating planning and control modules, seeking a fully differentiable, single-pass autonomous driving pipeline.
- Extension to additional sensor modalities (e.g., radar) and new downstream tasks such as detection, tracking, or affordance estimation.
Conclusion
The presented system establishes a rigorous baseline for compact autonomous driving perception networks, demonstrating that judicious sensor fusion, multi-task learning, and dynamic loss balancing can jointly optimize accuracy, computational cost, and deployability. These results represent a significant step toward practical, scalable multi-task learning platforms in automotive AI, with implications for both research and real-world deployment scenarios.
Reference: "Towards Compact Autonomous Driving Perception with Balanced Learning and Multi-sensor Fusion" (2606.02979)