Unsupervised Video Continual Learning via Non-Parametric Deep Embedded Clustering (2508.21773v1)

Published 29 Aug 2025 in cs.CV, cs.AI, and cs.LG

Abstract: We propose a realistic scenario for the unsupervised video learning where neither task boundaries nor labels are provided when learning a succession of tasks. We also provide a non-parametric learning solution for the under-explored problem of unsupervised video continual learning. Videos represent a complex and rich spatio-temporal media information, widely used in many applications, but which have not been sufficiently explored in unsupervised continual learning. Prior studies have only focused on supervised continual learning, relying on the knowledge of labels and task boundaries, while having labeled data is costly and not practical. To address this gap, we study the unsupervised video continual learning (uVCL). uVCL raises more challenges due to the additional computational and memory requirements of processing videos when compared to images. We introduce a general benchmark experimental protocol for uVCL by considering the learning of unstructured video data categories during each task. We propose to use the Kernel Density Estimation (KDE) of deep embedded video features extracted by unsupervised video transformer networks as a non-parametric probabilistic representation of the data. We introduce a novelty detection criterion for the incoming new task data, dynamically enabling the expansion of memory clusters, aiming to capture new knowledge when learning a succession of tasks. We leverage the use of transfer learning from the previous tasks as an initial state for the knowledge transfer to the current learning task. We found that the proposed methodology substantially enhances the performance of the model when successively learning many tasks. We perform in-depth evaluations on three standard video action recognition datasets, including UCF101, HMDB51, and Something-to-Something V2, without using any labels or class boundaries.

Collections

Summary

The paper presents a novel approach for unsupervised video continual learning by integrating deep embedded video features with a KDE-based clustering mechanism.
It employs video transformers and dynamic memory augmentation to efficiently detect novel data and mitigate catastrophic forgetting.
Experimental results on UCF101, HMDB51, and SSv2 demonstrate robust performance with high cluster accuracy and effective novelty detection.

Unsupervised Video Continual Learning via Non-Parametric Deep Embedded Clustering

Introduction

This paper presents a new approach for unsupervised video continual learning (uVCL) by introducing a non-parametric system for handling the complexities of unsupervised video data. Unlike previous methods mostly focused on supervised continual learning, this research tackles the unsupervised setup where neither task boundaries nor labels are provided. Utilizing a Kernel Density Estimation (KDE)-based method, this work aims to maintain stability while learning new tasks continuously.

Proposed Method

The framework, uVCL-KDE, is based on utilizing video transformer networks for feature extraction and KDE for clustering. It employs unsupervised video transformer models to derive deep embedded video features. These features are then probabilistically represented using KDE to form clusters that represent the data distribution.

Figure 1: Overview of the proposed unsupervised video continual learning based on the Kernel Density Estimation (uVCL-KDE).

This methodology introduces a dynamic memory augmentation strategy, balancing memory efficiency with the need to preserve past learned tasks. The KDE-based system allows for flexible clustering without predefined class boundaries, supporting continual learning where data distribution inherently changes over time.

Experimental Results

The experimental evaluation was conducted across three datasets: UCF101, HMDB51, and SSv2, without using class labels. uVCL-KDE demonstrated robust performance in handling unlabeled data by efficiently managing cluster expansion based on feature novelty detection. The results showed that the video continual learning model can dynamically adjust to new data while mitigating the effects of catastrophic forgetting.

Figure 2: uVCL results on UCF101, HMDB51, and SSv2 considering the first fold data. Inside the brackets for each method, the bandwidth $h$ for the mean-shift clustering is specified.

One noticeable outcome was the ability to maintain high cluster accuracy, confirming the efficacy of clustering and novelty detection mechanisms within the framework. The integration of a linear RBF layer further enhances cluster separability.

Novelty Detection and Memory Management

A cornerstone of the method is its ability to detect and incorporate novel data into existing clusters or form new ones, based on a KDE-derived novelty metric. This is crucial in video learning scenarios where new categories can continuously emerge. Besides, an effective memory management technique is employed to replay critical features, ensuring the resilience of prior learning against new task intrusions.

Figure 3: Evaluation of Backward Forgetting (BWF) demonstrates the system's ability to retain knowledge of previous tasks through controlled memory replay mechanisms.

Conclusion

The unsupervised continual learning model proposed sets a considerable milestone in handling real-world video datasets without reliance on labeled data. By leveraging non-parametric deep embedded clustering, it effectively handles dynamic and complex video data environments. Future work may focus on further enhancing the novelty detection criteria and memory buffer dynamics to optimize learning potential and resource management further.

Figure 4: Visualization of latent space using t-SNE, showing the cluster formation in a 2D reduced space after learning all tasks. This visualization helps verify the semantic consistency of identified clusters.

This approach provides a scalable, efficient, and effective solution for video domains, offering a significant step forward in the field of AI video processing and handling unsupervised dynamics in real-time applications.