Boundary-Recovering Network for Temporal Action Detection (2408.09354v1)
Abstract: Temporal action detection (TAD) is challenging, yet fundamental for real-world video applications. Large temporal scale variation of actions is one of the most primary difficulties in TAD. Naturally, multi-scale features have potential in localizing actions of diverse lengths as widely used in object detection. Nevertheless, unlike objects in images, actions have more ambiguity in their boundaries. That is, small neighboring objects are not considered as a large one while short adjoining actions can be misunderstood as a long one. In the coarse-to-fine feature pyramid via pooling, these vague action boundaries can fade out, which we call 'vanishing boundary problem'. To this end, we propose Boundary-Recovering Network (BRN) to address the vanishing boundary problem. BRN constructs scale-time features by introducing a new axis called scale dimension by interpolating multi-scale features to the same temporal length. On top of scale-time features, scale-time blocks learn to exchange features across scale levels, which can effectively settle down the issue. Our extensive experiments demonstrate that our model outperforms the state-of-the-art on the two challenging benchmarks, ActivityNet-v1.3 and THUMOS14, with remarkably reduced degree of the vanishing boundary problem.
- Diagnosing error in temporal action detectors, in: Proceedings of the European conference on computer vision (ECCV), pp. 256–272.
- Sst: Single-stream temporal action proposals, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2911–2920.
- Fast temporal activity proposals for efficient detection of human actions in untrimmed videos, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1914–1923.
- Quo vadis, action recognition? a new model and the kinetics dataset, in: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308.
- Rethinking the faster r-cnn architecture for temporal action localization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40, 834–848.
- Tallformer: Temporal action localization with a long-memory transformer, in: European Conference on Computer Vision, Springer. pp. 503–521.
- Daps: Deep action proposals for action understanding, in: European Conference on Computer Vision, Springer. pp. 768–784.
- Activitynet: A large-scale video benchmark for human activity understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970.
- Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982 .
- Ctap: Complementary temporal action proposal generation, in: Proceedings of the European Conference on Computer Vision (ECCV), pp. 68–83.
- Fast r-cnn, in: Proceedings of the IEEE international conference on computer vision, pp. 1440–1448.
- Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587.
- THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 .
- Learning coarse and fine features for precise temporal action localization. IEEE Access 7, 149797–149809. URL: https://doi.org/10.1109/ACCESS.2019.2946898, doi:10.1109/ACCESS.2019.2946898.
- Self-feedback detr for temporal action detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10286–10296.
- Learning salient boundary feature for anchor-free temporal action localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3320–3329.
- Bmn: Boundary-matching network for temporal action proposal generation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3889–3898.
- Bsn: Boundary sensitive network for temporal action proposal generation, in: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19.
- Feature pyramid networks for object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125.
- Focal loss for dense object detection, in: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988.
- Progressive boundary refinement network for temporal action detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11612–11619.
- Multi-shot temporal event localization: a benchmark, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12596–12606.
- Multi-granularity generator for temporal action proposal, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3604–3613.
- Gaussian temporal awareness networks for action localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 344–353.
- Proposal-free temporal action detection via global segmentation mask learning, in: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, Springer. pp. 645–662.
- Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in neural information processing systems, pp. 91–99.
- U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer. pp. 234–241.
- Class semantics-based attention for action detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13739–13748.
- Bsn++: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2602–2610.
- Relaxed transformer decoders for direct action proposal generation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13526–13535.
- Fcos: Fully convolutional one-stage object detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636.
- A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459.
- Attention is all you need, in: Advances in neural information processing systems, pp. 5998–6008.
- Rcl: Recurrent continuous localization for temporal action detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13566–13575.
- Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification, in: Proceedings of the European conference on computer vision (ECCV), pp. 305–321.
- G-tad: Sub-graph localization for temporal action detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10156–10165.
- Temporal action localization by structured maximal sums, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3684–3692.
- Graph convolutional networks for temporal action localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7094–7103.
- Actionformer: Localizing moments of actions with transformers, in: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, Springer. pp. 492–510.
- Video self-stitching graph network for temporal action localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13658–13667.
- Temporal action detection with structured segment networks, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 2914–2923.
- Cuhk & ethz & siat submission to activitynet challenge 2017. arXiv preprint arXiv:1710.08011 8.
- Enriching local and global contexts for temporal action localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13516–13525.
- Learning disentangled classification and localization representations for temporal action localization, in: Proceedings of the AAAI Conference on Artificial Intelligence.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.