Focus: Querying Large Video Datasets with Low Latency and Low Cost (1801.03493v1)

Published 10 Jan 2018 in cs.DB, cs.CV, and cs.DC

Abstract: Large volumes of videos are continuously recorded from cameras deployed for traffic control and surveillance with the goal of answering "after the fact" queries: identify video frames with objects of certain classes (cars, bags) from many days of recorded video. While advancements in convolutional neural networks (CNNs) have enabled answering such queries with high accuracy, they are too expensive and slow. We build Focus, a system for low-latency and low-cost querying on large video datasets. Focus uses cheap ingestion techniques to index the videos by the objects occurring in them. At ingest-time, it uses compression and video-specific specialization of CNNs. Focus handles the lower accuracy of the cheap CNNs by judiciously leveraging expensive CNNs at query-time. To reduce query time latency, we cluster similar objects and hence avoid redundant processing. Using experiments on video streams from traffic, surveillance and news channels, we see that Focus uses 58X fewer GPU cycles than running expensive ingest processors and is 37X faster than processing all the video at query time.

Citations (266)

View on Semantic Scholar

Summary

The paper’s main contribution is a system that leverages inexpensive, compressed CNNs at ingest time to build indexed object detections for efficient querying.
The methodology employs clustering of similar video objects to minimize redundant processing and achieve query speeds 37 times faster than conventional methods.
Experimental results demonstrate up to 58x GPU cost savings and at least 95% precision and recall, showcasing its practical impact for scalable video analytics.

Focus: Querying Large Video Datasets with Low Latency and Low Cost

This paper presents a novel system, termed {}, designed to address challenges in querying large-scale video datasets with specific emphasis on achieving low-latency and low-cost query responses. The problem is rooted in the widespread deployment of cameras for surveillance and traffic management, which generates enormous volumes of video data. These datasets require efficient mechanisms for querying specific objects, a process traditionally hindered by high computational costs and latency due to the use of complex convolutional neural networks (\cnns).

The proposed system {} strategically balances the precocity of ingest-time computational requirements with the efficiency of query processing, utilizing a combination of cheap ingest techniques and advanced video processing methodologies. Key components of the system include the use of cheaper, compressed \cnns at ingest-time to build an index of detected objects across video frames and a clustering mechanism to minimize redundant processing at query time. This clustering leverages the similarity between objects to optimize resource usage further.

Technical Details

Ingest-Time Optimization:
- The system employs cheaper and specialized versions of \cnns focused on achieving low-cost operations during the ingest phase. This involves compressing \cnns by reducing layers and fine-tuning them to specific video contexts.
- At ingest, multiple candidate class labels (top-K results) provided by these cheaper \cnns are indexed, enhancing the system's recall capabilities when matching against more expensive, high-fidelity \cnns at query time.
Query-Time Efficiency:
- During query execution, instead of classifying each frame independently, {} clusters similar objects based on features extracted by the cheaper \cnns. This clustering ensures that only representative objects (centroids) are classified using more resource-intensive \cnns, thereby reducing computational overhead.
Balanced Ingest and Query Costs:
- The system intelligently selects parameters for ingest \cnns and clustering mechanisms by weighing the trade-offs between ingest costs and query latency. This adaptability provides a flexible approach to different application scenarios, such as traffic monitoring or event-based surveillance, where query characteristics may vary significantly.

Results and Implications

Experimental evaluations of {} on datasets spanning traffic, surveillance, and news videos demonstrate remarkable reductions in computational costs. On average, the system is found to be 58 times more cost-effective in terms of GPU usage compared to existing ingest-all strategies, while facilitating queries approximately 37 times faster than traditional query-all methods. Furthermore, the system consistently achieves high accuracy levels with at least 95% precision and recall.

The implications of this research extend to practical applications that demand scalable and efficient video analytics solutions. Particularly, the ability to swiftly query large video datasets without prohibitive costs presents significant opportunities in urban management, security, and data-driven decision-making environments. This methodology also encourages the reconsideration of traditional split between ingest and query phases, suggesting potential avenues for more integrated video processing frameworks.

Future Directions

Future developments could focus on refining the system's clustering algorithms based on live data characteristics and addressing dynamic changes in video stream contents without substantial retraining costs. Additionally, enhancing the sophistication of model specialization for specific video sources could further improve accuracy, thereby reducing dependency on high-complexity \cnns at query time.

This paper provides a significant contribution towards achieving efficient, low-latency querying in large-scale video datasets and presents a foundation for subsequent research into adaptive, context-aware video data processing systems.

PDF Markdown