LeTFuser: Light-weight End-to-end Transformer-Based Sensor Fusion for Autonomous Driving with Multi-Task Learning (2310.13135v3)
Abstract: In end-to-end autonomous driving, the utilization of existing sensor fusion techniques and navigational control methods for imitation learning proves inadequate in challenging situations that involve numerous dynamic agents. To address this issue, we introduce LeTFuser, a lightweight transformer-based algorithm for fusing multiple RGB-D camera representations. To perform perception and control tasks simultaneously, we utilize multi-task learning. Our model comprises of two modules, the first being the perception module that is responsible for encoding the observation data obtained from the RGB-D cameras. Our approach employs the Convolutional vision Transformer (CvT) \cite{wu2021cvt} to better extract and fuse features from multiple RGB cameras due to local and global feature extraction capability of convolution and transformer modules, respectively. Encoded features combined with static and dynamic environments are later employed by our control module to predict waypoints and vehicular controls (e.g. steering, throttle, and brake). We use two methods to generate the vehicular controls levels. The first method uses a PID algorithm to follow the waypoints on the fly, whereas the second one directly predicts the control policy using the measurement features and environmental state. We evaluate the model and conduct a comparative analysis with recent models on the CARLA simulator using various scenarios, ranging from normal to adversarial conditions, to simulate real-world scenarios. Our method demonstrated better or comparable results with respect to our baselines in term of driving abilities. The code is available at \url{https://github.com/pagand/e2etransfuser/tree/cvpr-w} to facilitate future studies.
- Human navigational intent inference with probabilistic and optimal approaches. In 2022 International Conference on Robotics and Automation (ICRA), pages 8562–8568. IEEE, 2022.
- Online probabilistic model identification using adaptive recursive mcmc. In 2023 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2023.
- Label efficient visual abstractions for autonomous driving. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2338–2345. IEEE, 2020.
- Roifusion: 3d object detection from lidar and vision. IEEE Access, 9:51710–51721, 2021.
- Learning by cheating. In Conference on Robot Learning, pages 66–75. PMLR, 2020.
- Persformer: 3d lane detection via perspective transformer and the openlane benchmark. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII, pages 550–567. Springer, 2022.
- Neat: Neural attention fields for end-to-end autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15793–15803, 2021.
- Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. Pattern Analysis and Machine Intelligence (PAMI), 2022.
- Earlinet single calculus chain–technical–part 1: Pre-processing of raw lidar data. Atmospheric Measurement Techniques, 9(2):491–507, 2016.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Multixnet: Multiclass multistage multimodal motion prediction. In 2021 IEEE Intelligent Vehicles Symposium (IV), pages 435–442. IEEE, 2021.
- Carla: An open urban driving simulator. In Conference on robot learning, pages 1–16. PMLR, 2017.
- Multi-view fusion of sensor data for improved perception and prediction in autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2349–2357, 2022.
- Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems, 22(3):1341–1360, 2020.
- Can autonomous vehicles identify, recover from, and adapt to distribution shifts? In International Conference on Machine Learning, pages 3145–3153. PMLR, 2020.
- Toward human-like lane following behavior in urban environment with a learning-based behavior-induction potential map. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1409–1416. IEEE, 2017.
- Multi-modal sensor fusion-based deep neural network for end-to-end autonomous driving with scene understanding. IEEE Sensors Journal, 21(10):11781–11790, 2020.
- Bernhard Jaeger. Expert drivers for autonomous driving. PhD thesis, Master’s thesis, University of Tübingen, 2021. 1, 3, 8, 13, 2021.
- Learning to drive in a day. In 2019 International Conference on Robotics and Automation (ICRA), pages 8248–8254. IEEE, 2019.
- Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European conference on computer vision (ECCV), pages 641–656, 2018a.
- Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7345–7353, 2019.
- Cirl: Controllable imitative reinforcement learning for vision-based self-driving. In Proceedings of the European conference on computer vision (ECCV), pages 584–599, 2018b.
- An efficient depth map preprocessing method based on structure-aided domain transform smoothing for 3d view generation. PloS one, 12(4):e0175910, 2017.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.