UnO: Unsupervised Occupancy Fields for Perception and Forecasting

Published 12 Jun 2024 in cs.CV, cs.AI, cs.LG, and cs.RO | (2406.08691v1)

Abstract: Perceiving the world and forecasting its future state is a critical task for self-driving. Supervised approaches leverage annotated object labels to learn a model of the world -- traditionally with object detections and trajectory predictions, or temporal bird's-eye-view (BEV) occupancy fields. However, these annotations are expensive and typically limited to a set of predefined categories that do not cover everything we might encounter on the road. Instead, we learn to perceive and forecast a continuous 4D (spatio-temporal) occupancy field with self-supervision from LiDAR data. This unsupervised world model can be easily and effectively transferred to downstream tasks. We tackle point cloud forecasting by adding a lightweight learned renderer and achieve state-of-the-art performance in Argoverse 2, nuScenes, and KITTI. To further showcase its transferability, we fine-tune our model for BEV semantic occupancy forecasting and show that it outperforms the fully supervised state-of-the-art, especially when labeled data is scarce. Finally, when compared to prior state-of-the-art on spatio-temporal geometric occupancy prediction, our 4D world model achieves a much higher recall of objects from classes relevant to self-driving.

Abstract PDF HTML Upgrade to Chat

Citations (5)

View on Semantic Scholar

Summary

The paper introduces a novel self-supervised approach using LiDAR data to learn continuous 4D occupancy fields for improved object detection and trajectory prediction.
It employs a ResNet-based BEV encoder and an implicit deformable attention decoder to overcome limitations of voxel discretization.
The model demonstrates state-of-the-art forecasting metrics on benchmarks like Argoverse 2, nuScenes, and KITTI, reducing reliance on labeled data.

UnO: Unsupervised Occupancy Fields for Perception and Forecasting

In the pursuit of improving self-driving capabilities, the paper "UnO: Unsupervised Occupancy Fields for Perception and Forecasting" introduces a novel approach to model the world using unsupervised learning from LiDAR data. The primary objective is to learn a continuous 4D occupancy field — spanning both spatial and temporal domains — that can be applied to critical tasks such as object detection and trajectory prediction, traditionally reliant on supervised data. This paper addresses the inherent limitations of supervised methods, such as the high cost of labeled data and their inability to capture the full extent of potential road scenarios.

Methodology

The proposed model, UnO (Unsupervised Occupancy), fundamentally diverges from existing approaches by leveraging the vast volume of unlabeled sensor data available from LiDAR. UnO utilizes self-supervision to infer occupancy labels, hence eschewing the need for exhaustive and predefined object annotations. This self-supervised training regime is vital for generalizing beyond the rigid confines of labeled datasets, offering a more comprehensive and nuanced understanding of the driving environment.

LiDAR Data Processing

UnO first processes a sequence of historical LiDAR sweeps by voxelizing the point clouds into a BEV (Bird’s-Eye View) representation. This 2D BEV feature map is then encoded using a ResNet-based convolutional neural network, resulting in a multi-scale feature representation that captures relevant spatial features at different resolutions. This representation is crucial for the downstream occupancy and forecasting tasks.

Implicit Occupancy Decoder

To predict the occupancy at any continuous point in space and time, UnO employs an implicit occupancy decoder that uses deformable attention mechanisms. This decoder takes the spatial and temporal coordinates of the query points and produces occupancy probabilities. Such an approach ensures that the model can operate with fine spatial granularity and continuous time resolution, addressing limitations posed by voxel-based discretization.

Training and Objectives

UnO’s training leverages a binary cross-entropy loss computed over a set of positive (occupied) and negative (unoccupied) query points derived from the LiDAR data. This formulation allows the model to learn from the occupancy patterns implicit in the LiDAR returns without explicit object labels. The paper demonstrates that this method captures the geometry, dynamics, and semantics required for tasks such as object detection and motion forecasting.

Applications and Transferability

To illustrate the efficacy and versatility of UnO, the paper evaluates its performance on two primary tasks: point cloud forecasting and BEV semantic occupancy forecasting.

Point Cloud Forecasting

UnO achieves state-of-the-art performance on several benchmarks, including Argoverse 2, nuScenes, and KITTI, outperforming other self-supervised models like 4D-Occ. The model's ability to predict future LiDAR returns accurately from past data is measured using metrics such as Chamfer Distance, Near Field Chamfer Distance, and depth error metrics. Notably, UnO excels in capturing the dynamic behavior of objects, a critical factor for real-world applications in autonomous driving.

BEV Semantic Occupancy Forecasting

By fine-tuning the pre-trained UnO on labeled semantic data, the model sets new performance standards in BEV semantic occupancy forecasting. The paper highlights significant improvements, particularly in scenarios with limited training data, demonstrating UnO’s strong few-shot generalization capabilities. This promising result underscores UnO's potential to reduce reliance on extensive labeled datasets, thereby lowering the barrier to deploying autonomous systems in diverse environments.

Evaluation on Relevant Object Classes

The study presents an insightful evaluation of UnO’s ability to recall occupancy for various object classes by analyzing the unsupervised predictions within annotated 3D bounding boxes. UnO significantly outperforms existing methodologies across a range of classes, particularly excelling in predicting occupancy for small and rare objects. This fine-grained evaluation underscores UnO’s robust understanding of the environment, further validated by its real-world applicability in handling dynamics and object extents unseen in traditional methods.

Implications and Future Work

The introduction of UnO marks a pivotal step towards more robust, scalable, and adaptive perception systems for autonomous vehicles. This unsupervised approach not only reduces dependency on labeled data but also extends the model’s applicability to unpredictable scenarios, thus enhancing safety and reliability. Future research can build upon UnO’s framework by integrating additional sensor modalities, refining occupancy prediction with even finer granularity, and exploring its application in other domains requiring dynamic environment modeling.

Conclusion

The "UnO: Unsupervised Occupancy Fields for Perception and Forecasting" paper contributes a substantial advancement in the field of autonomous driving. By significantly improving unsupervised occupancy prediction and demonstrating strong transferability to downstream tasks, UnO lays the groundwork for more resilient and cost-effective self-driving systems. The model’s ability to anticipate future states with high fidelity and minimal supervision portends a future where autonomous systems can operate safely and efficiently in a multitude of complex real-world scenarios.

Markdown