Driving with InternVL: Oustanding Champion in the Track on Driving with Language of the Autonomous Grand Challenge at CVPR 2024 (2412.07247v1)
Abstract: This technical report describes the methods we employed for the Driving with Language track of the CVPR 2024 Autonomous Grand Challenge. We utilized a powerful open-source multimodal model, InternVL-1.5, and conducted a full-parameter fine-tuning on the competition dataset, DriveLM-nuScenes. To effectively handle the multi-view images of nuScenes and seamlessly inherit InternVL's outstanding multimodal understanding capabilities, we formatted and concatenated the multi-view images in a specific manner. This ensured that the final model could meet the specific requirements of the competition task while leveraging InternVL's powerful image understanding capabilities. Meanwhile, we designed a simple automatic annotation strategy that converts the center points of objects in DriveLM-nuScenes into corresponding bounding boxes. As a result, our single model achieved a score of 0.6002 on the final leadboard.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.