Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Action Sensitivity Learning for the Ego4D Episodic Memory Challenge 2023 (2306.09172v2)

Published 15 Jun 2023 in cs.CV

Abstract: This report presents ReLER submission to two tracks in the Ego4D Episodic Memory Benchmark in CVPR 2023, including Natural Language Queries and Moment Queries. This solution inherits from our proposed Action Sensitivity Learning framework (ASL) to better capture discrepant information of frames. Further, we incorporate a series of stronger video features and fusion strategies. Our method achieves an average mAP of 29.34, ranking 1st in Moment Queries Challenge, and garners 19.79 mean R1, ranking 2nd in Natural Language Queries Challenge. Our code will be released.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Soft-nms–improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision, pages 5561–5569, 2017.
  2. Internvideo-ego4d: A pack of champion solutions to ego4d challenges. arXiv preprint arXiv:2211.09529, 2022.
  3. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  4. Omnivore: A single model for many visual modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16102–16112, 2022.
  5. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  6. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34:11846–11858, 2021.
  7. Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3320–3329, 2021.
  8. Egocentric video-language pretraining. Advances in Neural Information Processing Systems, 35:7575–7586, 2022.
  9. Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3889–3898, 2019.
  10. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  11. Reler@ zju-alibaba submission to the ego4d natural language queries challenge 2022. arXiv preprint arXiv:2207.00383, 2022.
  12. A simple transformer-based model for ego4d natural language queries challenge. arXiv preprint arXiv:2211.08704, 2022.
  13. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  14. Naq: Leveraging narrations as queries to supervise episodic memory. In Computer Vision and Pattern Recognition (CVPR), 2023 IEEE Conference on. IEEE, 2023.
  15. Action sensitivity learning for temporal action localization. arXiv preprint arXiv:2305.15701, 2023.
  16. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems.
  17. Interactive prototype learning for egocentric action recognition. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8148–8157, 2021.
  18. Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision. IEEE Transactions on Multimedia, 2022.
  19. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10156–10165, 2020.
  20. Actionformer: Localizing moments of actions with transformers. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, pages 492–510. Springer, 2022.
  21. Span-based localizing network for natural language video localization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6543–6554, 2020.
  22. Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12870–12877, 2020.
  23. Centerclip: Token clustering for efficient text-video retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 970–981, 2022.
  24. Distance-iou loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 12993–13000, 2020.
  25. Deformable detr: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations.
Citations (7)

Summary

We haven't generated a summary for this paper yet.