KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D (2109.13410v2)

Published 28 Sep 2021 in cs.CV

Abstract: For the last few decades, several major subfields of artificial intelligence including computer vision, graphics, and robotics have progressed largely independently from each other. Recently, however, the community has realized that progress towards robust intelligent systems such as self-driving cars requires a concerted effort across the different fields. This motivated us to develop KITTI-360, successor of the popular KITTI dataset. KITTI-360 is a suburban driving dataset which comprises richer input modalities, comprehensive semantic instance annotations and accurate localization to facilitate research at the intersection of vision, graphics and robotics. For efficient annotation, we created a tool to label 3D scenes with bounding primitives and developed a model that transfers this information into the 2D image domain, resulting in over 150k images and 1B 3D points with coherent semantic instance annotations across 2D and 3D. Moreover, we established benchmarks and baselines for several tasks relevant to mobile perception, encompassing problems from computer vision, graphics, and robotics on the same dataset, e.g., semantic scene understanding, novel view synthesis and semantic SLAM. KITTI-360 will enable progress at the intersection of these research areas and thus contribute towards solving one of today's grand challenges: the development of fully autonomous self-driving systems.

Citations (473)

View on Semantic Scholar

Summary

The paper presents KITTI-360, a dataset that overcomes semantic labeling limitations by offering synchronized 2D and 3D annotations with over 150,000 images and 1 billion 3D points.
The paper details a novel annotation methodology using a WebGL-based tool and a non-local multi-field CRF model to ensure consistent labeling between 2D and 3D domains.
The paper establishes benchmarks for semantic segmentation, instance segmentation, and 3D scene understanding, providing a robust framework to advance autonomous driving research.

KITTI-360: A Dataset for Urban Scene Understanding

The paper "KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D" introduces a comprehensive dataset aimed at advancing research in autonomous driving through enhanced scene understanding. The authors present KITTI-360 as an extension of the KITTI dataset, addressing limitations in semantic labeling by providing richer 2D and 3D annotations, novel input modalities, and reliable localization data.

Dataset Characteristics

KITTI-360 broadens urban scene understanding by offering extensive semantic annotations across both 2D and 3D domains. It comprises over 150,000 images and 1 billion 3D points, labeled using a coherent semantic framework. The dataset includes various sensor modalities such as fisheye and stereo cameras, alongside LiDAR scans, allowing for 360-degree perception.

Annotation Process

A notable contribution is the development of a WebGL-based annotation tool that facilitates efficient labeling in 3D space. This tool enables the annotation of objects using bounding primitives, ensuring consistency between 2D and 3D annotations. Additionally, the dataset encompasses static and dynamic elements, with semi-automatic processes implemented to annotate moving objects efficiently.

Methodology

The authors introduce a Conditional Random Field (CRF) model for transferring 3D annotations into dense 2D semantic and instance labels. This approach leverages both geometric constraints from 3D primitives and learned priors from CNNs, resulting in more accurate and coherent annotations. The methodology includes a non-local multi-field CRF model that jointly reasons about semantics and instances across 2D and 3D spaces.

Benchmarks and Baselines

KITTI-360 establishes several benchmarks for tasks such as semantic segmentation, instance segmentation, and 3D scene understanding:

2D/3D Semantic Segmentation: Baselines include FCN and PSPNet for 2D, and PointNet/PointNet++ for 3D.
Instance Segmentation: Evaluated using Mask R-CNN, highlighting challenges in delineating urban environments.
3D Tasks: Include bounding box detection and scene completion, affirming the dataset’s utility for complex tasks like SLAM.

Implications and Future Directions

KITTI-360's integration of diverse modalities and consistent annotations across 2D and 3D presents significant potential to drive innovation in autonomous systems. The dataset's depth and breadth open avenues for research in cross-disciplinary domains such as computer vision, graphics, and robotics. Future work could explore improvements in scene completion and novel view synthesis, leveraging KITTI-360's extensive annotations to refine autonomous driving models.

In conclusion, KITTI-360 represents a robust framework for advancing autonomous driving research, setting benchmarks that encourage the development of sophisticated algorithms capable of real-world application. The dataset not only addresses existing challenges but also lays a foundation for future explorations into fully autonomous systems.

PDF Markdown