Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 54 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 22 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 99 tok/s Pro

Kimi K2 196 tok/s Pro

GPT OSS 120B 333 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection (2504.21614v1)

Published 30 Apr 2025 in cs.CV

Abstract: With an ever-increasing availability of data, it has become more and more challenging to select and label appropriate samples for the training of machine learning models. It is especially difficult to detect long-tail classes of interest in large amounts of unlabeled data. This holds especially true for Intelligent Transportation Systems (ITS), where vehicle fleets and roadside perception systems generate an abundance of raw data. While industrial, proprietary data engines for such iterative data selection and model training processes exist, researchers and the open-source community suffer from a lack of an openly available system. We present the Mcity Data Engine, which provides modules for the complete data-based development cycle, beginning at the data acquisition phase and ending at the model deployment stage. The Mcity Data Engine focuses on rare and novel classes through an open-vocabulary data selection process. All code is publicly available on GitHub under an MIT license: https://github.com/mcity/mcity_data_engine

Summary

Iterative Model Improvement Through Open-Vocabulary Data Selection in the Mcity Data Engine

The paper "Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection" presents a framework for addressing the challenges inherent in dataset selection and labeling within machine learning applications, particularly in Intelligent Transportation Systems (ITS). The ever-growing volume of raw data from vehicle fleets and roadside perception systems presents researchers with significant difficulty in detecting and selecting rare and novel classes—a situation exacerbated by long-tail data distribution.

The authors introduce the Mcity Data Engine (MDE) as a comprehensive, open-source solution designed to support the entire data-driven model development cycle, from data acquisition to model deployment, focusing on maximal utilization of datasets that include rare and novel instances. The MDE's functionality spans data acquisition, storage, selection, labeling, training, validation, and deployment, intending to provide robust support for rare class detection through an open-vocabulary data engine approach.

Methodological Highlights

The paper describes the MDE, which supports various data sources, prioritizing vision datasets and data from the University of Michigan's Smart Intersections Project in Ann Arbor, Michigan. The data engine converts datasets into a unified Voxel51 format, facilitating efficient storage and processing through integrated access schemes, including AWS S3.

A key methodological innovation lies in the MDE's ensemble of open-vocabulary object detection models for data selection, allowing a natural language query of classes of interest. The ensemble consists of models such as OWL-ViT, Grounding-DINO, and OmDet-Turbo, recommended for task optimization via a seed dataset evaluation process. The selection process employs a consensus-based method to mitigate false positives and negatives, thus enhancing the noise-filtering capabilities in the detection outputs.

For data labeling, the MDE integrates with CVAT for human-assisted labeling, along with supporting automated labeling through pre-trained models for segmentation and depth estimation. It further facilitates multi-dataset alignment for different class label schemes using zero-shot classification models.

Evaluation and Results

The paper evaluates the MDE using a real-world application of Vulnerable Road User (VRU) detection in fisheye camera data, emphasizing the model's ability to identify and label VRUs effectively. An ensemble of models optimized through systematic evaluation demonstrates superior recall, validating the engine's capability to leverage open-vocabulary models for detecting rare instances. The evaluation, conducted using a seed dataset with VRU instances, concludes with an ensemble consensus process achieving high true-positive detection rates, indicating substantial improvements in data selection efficacy.

Implications and Future Directions

The availability of the MDE as an open-source solution has significant implications for both research and practical applications. For academia, the engine's comprehensive support for data-driven model development presents potential improvement avenues for existing ITS models. Practically, the MDE's deployment capabilities align well with real-world systems, aiding applications such as near-miss detection and roadside assistance systems.

The paper suggests future developments focusing on refining model prediction robustness and integration with additional operational design domains and real-world roadside perception systems. The presented work furthers the potential of leveraging cutting-edge object detection models and data-centric development processes for ITS improvements.

In conclusion, this research contributes a valuable toolset for iterative model improvement in ITS and related fields by presenting a novel approach to open-vocabulary data selection, addressing the challenges of long-tail data distribution, and enhancing the accessibility and utility of large datasets for machine learning model refinement.