Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach (2312.14138v1)
Abstract: Weakly-supervised temporal action localization aims to localize action instances in videos with only video-level action labels. Existing methods mainly embrace a localization-by-classification pipeline that optimizes the snippet-level prediction with a video classification loss. However, this formulation suffers from the discrepancy between classification and detection, resulting in inaccurate separation of foreground and background (F&B) snippets. To alleviate this problem, we propose to explore the underlying structure among the snippets by resorting to unsupervised snippet clustering, rather than heavily relying on the video classification loss. Specifically, we propose a novel clustering-based F&B separation algorithm. It comprises two core components: a snippet clustering component that groups the snippets into multiple latent clusters and a cluster classification component that further classifies the cluster as foreground or background. As there are no ground-truth labels to train these two components, we introduce a unified self-labeling mechanism based on optimal transport to produce high-quality pseudo-labels that match several plausible prior distributions. This ensures that the cluster assignments of the snippets can be accurately associated with their F&B labels, thereby boosting the F&B separation. We evaluate our method on three benchmarks: THUMOS14, ActivityNet v1.2 and v1.3. Our method achieves promising performance on all three benchmarks while being significantly more lightweight than previous methods. Code is available at https://github.com/Qinying-Liu/CASE
- Self-labelling via simultaneous clustering and representation learning. In ICLR, 2019.
- Matrix scaling: A geometric proof of sinkhorn’s theorem. Linear algebra and its applications, 268:1–8, 1998.
- Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
- Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
- Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS, 2020.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
- Deep discriminative clustering analysis. arXiv preprint arXiv:1905.01681, 2019.
- Deep adaptive image clustering. In ICCV, 2017.
- Dual-evidential learning for weakly-supervised temporal action localization. In ECCV, 2022.
- Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NeurIPS, 2013.
- Multi-scale fusion subspace clustering using similarity constraint. In CVPR, 2020.
- A unified objective for novel class discovery. In ICCV, 2021.
- Fine-grained temporal contrastive learning for weakly-supervised temporal action localization. In CVPR, 2022.
- Dynamic few-shot visual learning without forgetting. In CVPR, 2018.
- Associative deep clustering: Training a classification network with no labels. In GCPR, 2018.
- Asm-loc: Action-aware segment modeling for weakly-supervised temporal action localization. In CVPR, 2022.
- Cross-modal consensus network for weakly supervised temporal action localization. In ACMMM, 2021.
- Deep semantic clustering by partition confidence maximisation. In CVPR, 2020.
- Relational prototypical network for weakly supervised temporal action localization. In AAAI, 2020.
- Modeling sub-actions for weakly supervised temporal action localization. Transactions on Image Processing, 2021.
- Two-branch relational prototypical network for weakly supervised temporal action localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Foreground-action consistency network for weakly supervised temporal action localization. In ICCV, 2021.
- Weakly supervised temporal action localization via representative snippet knowledge propagation. In CVPR, 2022.
- A hybrid attention mechanism for weakly-supervised temporal action localization. In AAAI, 2021.
- Invariant information clustering for unsupervised image classification and segmentation. In ICCV, 2019.
- THUMOS challenge: Action recognition with a large number of classes, 2014.
- Background suppression network for weakly-supervised temporal action localization. In AAAI, 2020.
- Weakly-supervised temporal action localization by uncertainty modeling. In AAAI, 2021.
- Exploring denoised cross-video contrast for weakly-supervised temporal action localization. In CVPR, 2022.
- Class-balanced pixel-level self-labeling for domain adaptive semantic segmentation. In CVPR, 2022.
- Weakly-supervised temporal action detection for fine-grained videos with hierarchical atomic actions. In ECCV, 2022.
- Actionness inconsistency-guided contrastive learning for weakly-supervised temporal action localization. In AAAI, 2023.
- Completeness modeling and context separation for weakly supervised temporal action localization. In CVPR, 2019.
- Progressive boundary refinement network for temporal action detection. In AAAI, 2020.
- Collaborating domain-shared and target-specific feature clustering for cross-domain 3d action recognition. In ECCV, 2022.
- Unleashing the potential of adjacent snippets for weakly-supervised temporal action localization. In ICME, 2023.
- Improve temporal action proposals using hierarchical context. Pattern Recognition, 140:109560, 2023.
- Weakly supervised temporal action localization through learning explicit subspaces for action and context. In AAAI, 2021.
- Action unit memory network for weakly supervised temporal action localization. In CVPR, 2021.
- Weakly-supervised action localization with expectation-maximization multi-instance learning. In ECCV, 2020.
- Weakly supervised action selection learning in video. In CVPR, 2021.
- Adversarial background-aware loss for weakly-supervised temporal activity localization. In ECCV, 2020.
- D2-net: Weakly-supervised action localization via discriminative embeddings and denoised activations. In ICCV, 2021.
- 3c-net: Category count and center loss for weakly-supervised action localization. In ICCV, 2019.
- Weakly supervised action localization by sparse temporal pooling network. In CVPR, 2018.
- Weakly-supervised action localization with background modeling. In ICCV, 2019.
- Unsupervised visual representation learning by synchronous momentum grouping. In ECCV, 2022.
- Refineloc: Iterative refinement for weakly-supervised action localization. In WACV, 2021.
- W-talc: Weakly-supervised temporal activity localization and classification. In ECCV, 2018.
- Unsupervised visual representation learning by online constrained k-means. 2022.
- Acm-net: Action context modeling network for weakly-supervised temporal action localization. Transactions on Image Processing, 2021.
- Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In ECCV, 2018.
- Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, 2016.
- Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In ICCV, 2017.
- Richard Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4):402–405, 1967.
- Order-preserving wasserstein distance for sequence matching. In CVPR, 2017.
- Scan: Learning to classify images without labels. In ECCV, 2020.
- Exploring sub-action granularity for weakly supervised temporal action localization. IEEE Transactions on Circuits and Systems for Video Technology, 2021.
- Untrimmednets for weakly supervised action recognition and detection. In CVPR, 2017.
- Unsupervised feature learning by cross-level instance-group discrimination. In CVPR, 2021.
- Deep comprehensive correlation mining for image clustering. In ICCV, 2019.
- Unsupervised deep embedding for clustering analysis. In ICML, 2016.
- Joint unsupervised learning of deep representations and image clusters. In CVPR, 2016.
- Uncertainty guided collaborative training for weakly supervised temporal action detection. In CVPR, 2021.
- Adversarial learning for robust deep clustering. In NeurIPS, 2020.
- Deep spectral clustering using dual autoencoder network. In CVPR, 2019.
- Acgnet: Action complement graph network for weakly-supervised temporal action localization. In AAAI, 2022.
- A duality based approach for realtime tv-l 1 optical flow. Pattern Recognition, 2007.
- Two-stream consensus network for weakly-supervised temporal action localization. In ECCV, 2020.
- Cola: Weakly-supervised temporal action localization with snippet contrastive learning. In CVPR, 2021.
- Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, 2018.
- Step-by-step erasion, one-by-one collection: A weakly supervised temporal action detector. In ACMMM, 2018.
- Improving weakly supervised temporal action localization by bridging train-test gap in pseudo labels. In CVPR, 2023.