Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach (2312.14138v1)

Published 21 Dec 2023 in cs.CV

Abstract: Weakly-supervised temporal action localization aims to localize action instances in videos with only video-level action labels. Existing methods mainly embrace a localization-by-classification pipeline that optimizes the snippet-level prediction with a video classification loss. However, this formulation suffers from the discrepancy between classification and detection, resulting in inaccurate separation of foreground and background (F&B) snippets. To alleviate this problem, we propose to explore the underlying structure among the snippets by resorting to unsupervised snippet clustering, rather than heavily relying on the video classification loss. Specifically, we propose a novel clustering-based F&B separation algorithm. It comprises two core components: a snippet clustering component that groups the snippets into multiple latent clusters and a cluster classification component that further classifies the cluster as foreground or background. As there are no ground-truth labels to train these two components, we introduce a unified self-labeling mechanism based on optimal transport to produce high-quality pseudo-labels that match several plausible prior distributions. This ensures that the cluster assignments of the snippets can be accurately associated with their F&B labels, thereby boosting the F&B separation. We evaluate our method on three benchmarks: THUMOS14, ActivityNet v1.2 and v1.3. Our method achieves promising performance on all three benchmarks while being significantly more lightweight than previous methods. Code is available at https://github.com/Qinying-Liu/CASE

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Self-labelling via simultaneous clustering and representation learning. In ICLR, 2019.
  2. Matrix scaling: A geometric proof of sinkhorn’s theorem. Linear algebra and its applications, 268:1–8, 1998.
  3. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
  4. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
  5. Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS, 2020.
  6. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  7. Deep discriminative clustering analysis. arXiv preprint arXiv:1905.01681, 2019.
  8. Deep adaptive image clustering. In ICCV, 2017.
  9. Dual-evidential learning for weakly-supervised temporal action localization. In ECCV, 2022.
  10. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NeurIPS, 2013.
  11. Multi-scale fusion subspace clustering using similarity constraint. In CVPR, 2020.
  12. A unified objective for novel class discovery. In ICCV, 2021.
  13. Fine-grained temporal contrastive learning for weakly-supervised temporal action localization. In CVPR, 2022.
  14. Dynamic few-shot visual learning without forgetting. In CVPR, 2018.
  15. Associative deep clustering: Training a classification network with no labels. In GCPR, 2018.
  16. Asm-loc: Action-aware segment modeling for weakly-supervised temporal action localization. In CVPR, 2022.
  17. Cross-modal consensus network for weakly supervised temporal action localization. In ACMMM, 2021.
  18. Deep semantic clustering by partition confidence maximisation. In CVPR, 2020.
  19. Relational prototypical network for weakly supervised temporal action localization. In AAAI, 2020.
  20. Modeling sub-actions for weakly supervised temporal action localization. Transactions on Image Processing, 2021.
  21. Two-branch relational prototypical network for weakly supervised temporal action localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  22. Foreground-action consistency network for weakly supervised temporal action localization. In ICCV, 2021.
  23. Weakly supervised temporal action localization via representative snippet knowledge propagation. In CVPR, 2022.
  24. A hybrid attention mechanism for weakly-supervised temporal action localization. In AAAI, 2021.
  25. Invariant information clustering for unsupervised image classification and segmentation. In ICCV, 2019.
  26. THUMOS challenge: Action recognition with a large number of classes, 2014.
  27. Background suppression network for weakly-supervised temporal action localization. In AAAI, 2020.
  28. Weakly-supervised temporal action localization by uncertainty modeling. In AAAI, 2021.
  29. Exploring denoised cross-video contrast for weakly-supervised temporal action localization. In CVPR, 2022.
  30. Class-balanced pixel-level self-labeling for domain adaptive semantic segmentation. In CVPR, 2022.
  31. Weakly-supervised temporal action detection for fine-grained videos with hierarchical atomic actions. In ECCV, 2022.
  32. Actionness inconsistency-guided contrastive learning for weakly-supervised temporal action localization. In AAAI, 2023.
  33. Completeness modeling and context separation for weakly supervised temporal action localization. In CVPR, 2019.
  34. Progressive boundary refinement network for temporal action detection. In AAAI, 2020.
  35. Collaborating domain-shared and target-specific feature clustering for cross-domain 3d action recognition. In ECCV, 2022.
  36. Unleashing the potential of adjacent snippets for weakly-supervised temporal action localization. In ICME, 2023.
  37. Improve temporal action proposals using hierarchical context. Pattern Recognition, 140:109560, 2023.
  38. Weakly supervised temporal action localization through learning explicit subspaces for action and context. In AAAI, 2021.
  39. Action unit memory network for weakly supervised temporal action localization. In CVPR, 2021.
  40. Weakly-supervised action localization with expectation-maximization multi-instance learning. In ECCV, 2020.
  41. Weakly supervised action selection learning in video. In CVPR, 2021.
  42. Adversarial background-aware loss for weakly-supervised temporal activity localization. In ECCV, 2020.
  43. D2-net: Weakly-supervised action localization via discriminative embeddings and denoised activations. In ICCV, 2021.
  44. 3c-net: Category count and center loss for weakly-supervised action localization. In ICCV, 2019.
  45. Weakly supervised action localization by sparse temporal pooling network. In CVPR, 2018.
  46. Weakly-supervised action localization with background modeling. In ICCV, 2019.
  47. Unsupervised visual representation learning by synchronous momentum grouping. In ECCV, 2022.
  48. Refineloc: Iterative refinement for weakly-supervised action localization. In WACV, 2021.
  49. W-talc: Weakly-supervised temporal activity localization and classification. In ECCV, 2018.
  50. Unsupervised visual representation learning by online constrained k-means. 2022.
  51. Acm-net: Action context modeling network for weakly-supervised temporal action localization. Transactions on Image Processing, 2021.
  52. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In ECCV, 2018.
  53. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, 2016.
  54. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In ICCV, 2017.
  55. Richard Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4):402–405, 1967.
  56. Order-preserving wasserstein distance for sequence matching. In CVPR, 2017.
  57. Scan: Learning to classify images without labels. In ECCV, 2020.
  58. Exploring sub-action granularity for weakly supervised temporal action localization. IEEE Transactions on Circuits and Systems for Video Technology, 2021.
  59. Untrimmednets for weakly supervised action recognition and detection. In CVPR, 2017.
  60. Unsupervised feature learning by cross-level instance-group discrimination. In CVPR, 2021.
  61. Deep comprehensive correlation mining for image clustering. In ICCV, 2019.
  62. Unsupervised deep embedding for clustering analysis. In ICML, 2016.
  63. Joint unsupervised learning of deep representations and image clusters. In CVPR, 2016.
  64. Uncertainty guided collaborative training for weakly supervised temporal action detection. In CVPR, 2021.
  65. Adversarial learning for robust deep clustering. In NeurIPS, 2020.
  66. Deep spectral clustering using dual autoencoder network. In CVPR, 2019.
  67. Acgnet: Action complement graph network for weakly-supervised temporal action localization. In AAAI, 2022.
  68. A duality based approach for realtime tv-l 1 optical flow. Pattern Recognition, 2007.
  69. Two-stream consensus network for weakly-supervised temporal action localization. In ECCV, 2020.
  70. Cola: Weakly-supervised temporal action localization with snippet contrastive learning. In CVPR, 2021.
  71. Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, 2018.
  72. Step-by-step erasion, one-by-one collection: A weakly supervised temporal action detector. In ACMMM, 2018.
  73. Improving weakly supervised temporal action localization by bridging train-test gap in pseudo labels. In CVPR, 2023.
Citations (7)

Summary

  • The paper introduces a clustering-centric strategy that leverages unsupervised snippet clustering to improve foreground and background separation beyond traditional classification losses.
  • It employs an optimal transport-based self-labeling mechanism to generate high-quality pseudo-labels, ensuring accurate alignment between snippet clusters and action regions.
  • Evaluations on THUMOS14 and ActivityNet demonstrate competitive accuracy and enhanced computational efficiency compared to existing weakly-supervised approaches.

Overview of "Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach"

The paper introduces a novel clustering-based approach for addressing the task of weakly-supervised temporal action localization (WTAL). WTAL involves identifying and localizing action instances in videos using only video-level action labels, without detailed frame annotations. A significant challenge in WTAL is accurately distinguishing between foreground (action) and background snippets solely based on video-level labels. Traditional methods often rely heavily on classification pipelines optimized via video classification loss, which are likely ineffective in action localization due to the inherent disparity between classification and detection tasks. The paper proposes a method that emphasizes unsupervised snippet clustering for enhanced foreground-background separation, aiming to discover the intrinsic structure among video snippets beyond the reliance on a video classification loss.

Key Contributions

  1. Clustering-based F{Content}B Separation: The proposed method innovates on snippet clustering as the fundamental backbone for foreground and background (F{content}B) snippet separation. It incorporates a clustering mechanism followed by a cluster classification component, which categorizes clusters into foreground or background.
  2. Self-labeling Mechanism via Optimal Transport: To facilitate clustering without ground-truth annotations, the authors developed a self-labeling strategy. This approach utilizes optimal transport theory to generate high-quality pseudo-labels which adhere to plausible prior distributions, thereby enabling a more accurate association between snippet cluster assignments and F{content}B labels.
  3. Efficiency and Performance: Implementation of this method on datasets such as THUMOS14 and ActivityNet v1.2/v1.3 shows compelling results, achieving high accuracy while maintaining computational efficiency. The clustering-based approach is presented as significantly more lightweight compared to existing methods while yielding promising performance improvements.

Theoretical and Practical Implications

  • By viewing F{content}B separation as a clustering problem, the approach broadens the horizon for tasks dependent on limited supervisory signals, showcasing that unsupervised clustering holds potential in reducing reliance on traditional classification losses.
  • The proposal of optimal transport for pseudo-label generation offers a robust foundation for tasks beyond WTAL, emphasizing a transferable methodology across domains where similar separations are required without extensive labeled data.
  • Practically, the reduced requirement for manually annotated training data streamlines the deployment of action localization frameworks, potentially enabling scalability in commercial and real-world video processing applications.

Future Developments

The incorporation of self-supervised learning techniques like contrastive clustering or advanced self-supervised loss functions might further enhance the clustering accuracy, especially as the need for high-resolution temporal localization in varied video types grows. Also, extending this framework to multi-modal datasets could leverage rich cross-modal information inherent in complex environments, promising further robustness in temporal action localization tasks. Although the current method primarily addresses two-class separation (foreground and background), exploring its extension towards multi-class segmentation would represent a logical progression, expanding its applicability in comprehensive video understanding pipelines.