Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-View Video-Based Learning: Leveraging Weak Labels for Frame-Level Perception

Published 18 Mar 2024 in cs.CV | (2403.11616v2)

Abstract: For training a video-based action recognition model that accepts multi-view video, annotating frame-level labels is tedious and difficult. However, it is relatively easy to annotate sequence-level labels. This kind of coarse annotations are called as weak labels. However, training a multi-view video-based action recognition model with weak labels for frame-level perception is challenging. In this paper, we propose a novel learning framework, where the weak labels are first used to train a multi-view video-based base model, which is subsequently used for downstream frame-level perception tasks. The base model is trained to obtain individual latent embeddings for each view in the multi-view input. For training the model using the weak labels, we propose a novel latent loss function. We also propose a model that uses the view-specific latent embeddings for downstream frame-level action recognition and detection tasks. The proposed framework is evaluated using the MM Office dataset by comparing several baseline algorithms. The results show that the proposed base model is effectively trained using weak labels and the latent embeddings help the downstream models improve accuracy.

Authors (2)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (14)
  1. “Survey on multi-view learning,” Artificial Intelligence, pp. 1–41, October 2018.
  2. “Deep multi-view learning: A survey,” Neurocomputing, pp. 185–196, March 2017.
  3. “Multi-view learning in the presence of view disagreement,” Neural Computation, pp. 2319–2351, September 2015.
  4. “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proceedings of International Conference on Learning Representations, May 2021.
  5. “Multi-view and multi-modal event detection utilizing transformer-based multi-sensor fusion,” 2022.
  6. “Deep multimodal learning: A survey,” IEEE Transactions on Big Data, vol. 5, no. 3, pp. 415–433, September 2019.
  7. “Multi-view learning in deep neural networks,” arXiv preprint arXiv:1802.05365, February 2018.
  8. “Two-stream convolutional networks for action recognition in videos,” pp. 568–576, December 2014.
  9. “Multi-view convolutional neural networks for 3d shape recognition,” pp. 945–953, December 2015.
  10. “Conditional random fields for activity recognition in smart environments,” in Proc. of the 1st ACM International Health Informatics Symposium, November 2010, pp. 282–286.
  11. “Multi-view discriminant analysis,” Pattern Recognition, pp. 1713–1723, July 2009.
  12. “Learning representations for multi-view data with multi-view pointwise mutual information,” in Proceedings of the 32nd International Conference on Machine Learning (ICML), July 2015, pp. 1855–1864.
  13. “Ultralytics yolov8,” 2023.
  14. “Sound event detection and time-frequency segmentation from weakly labelled data,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 3, pp. 531–542, March 2018.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.