Video Relationship Detection Using Mixture of Experts (2403.03994v1)
Abstract: Machine comprehension of visual information from images and videos by neural networks faces two primary challenges. Firstly, there exists a computational and inference gap in connecting vision and language, making it difficult to accurately determine which object a given agent acts on and represent it through language. Secondly, classifiers trained by a single, monolithic neural network often lack stability and generalization. To overcome these challenges, we introduce MoE-VRD, a novel approach to visual relationship detection utilizing a mixture of experts. MoE-VRD identifies language triplets in the form of < subject, predicate, object> tuples to extract relationships from visual processing. Leveraging recent advancements in visual relationship detection, MoE-VRD addresses the requirement for action recognition in establishing relationships between subjects (acting) and objects (being acted upon). In contrast to single monolithic networks, MoE-VRD employs multiple small models as experts, whose outputs are aggregated. Each expert in MoE-VRD specializes in visual relationship learning and object tagging. By utilizing a sparsely-gated mixture of experts, MoE-VRD enables conditional computation and significantly enhances neural network capacity without increasing computational complexity. Our experimental results demonstrate that the conditional computation capabilities and scalability of the mixture-of-experts approach lead to superior performance in visual relationship detection compared to state-of-the-art methods.
- One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 221–230, 2017.
- Learning video object segmentation from static images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2663–2672, 2017.
- Learning video object segmentation with visual memory. In Proceedings of the IEEE International Conference on Computer Vision, pages 4481–4490, 2017.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- A discriminative learning framework with pairwise constraints for video object classification. IEEE transactions on pattern analysis and machine intelligence, 28(4):578–593, 2006.
- Real-time object classification in video surveillance based on appearance learning. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007.
- Seq-nms for video object detection. arXiv preprint arXiv:1602.08465, 2016.
- Object detection from video tubelets with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 817–825, 2016.
- Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
- Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4768–4777, 2017.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- Spatiotemporal relation networks for video action recognition. IEEE Access, 7:14969–14976, 2019. doi:10.1109/ACCESS.2019.2894025.
- Deep spatiotemporal relation learning with 3d multi-level dense fusion for video action recognition. IEEE Access, 7:15222–15229, 2019. doi:10.1109/ACCESS.2019.2895472.
- Action recognition in video sequences using deep bi-directional lstm with cnn features. IEEE Access, 6:1155–1166, 2018. doi:10.1109/ACCESS.2017.2778011.
- Video Visual Relation Detection via Iterative Inference, page 3654–3663. Association for Computing Machinery, New York, NY, USA, 2021. ISBN 9781450386517. URL https://doi.org/10.1145/3474085.3475263.
- Separate visual pathways for perception and action. Trends in neurosciences, 15(1):20–25, 1992.
- Event structure, conceptual spaces and the semantics of verbs. Theoretical linguistics, 38(3-4):159–193, 2012.
- Human action recognition using 3d convolutional neural networks with 3d motion cuboids in surveillance videos. Procedia computer science, 133:471–477, 2018.
- First and second order dynamics in a hierarchical som system for action recognition. Applied Soft Computing, 59:574–585, 2017a. doi:https://doi.org/10.1016/j.asoc.2017.06.007.
- Human action recognition using attention based lstm network with dilated cnn features. Future Generation Computer Systems, 125:820–830, 2021.
- Zahra Gharaee. Hierarchical growing grid networks for skeleton based action recognition. Cognitive Systems Research, 63:11–29, 2020. doi:https://doi.org/10.1016/j.cogsys.2020.05.002.
- Online recognition of actions involving objects. Biologically inspired cognitive architectures, 22:10–19, 2017b. doi:https://doi.org/10.1016/j.bica.2017.09.007.
- Learning actions from natural language instructions using an on-world embodied cognitive architecture. Frontiers in Neurorobotics, 15:48, 2021.
- Video relationship reasoning using gated spatio-temporal energy graph. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10424–10433, 2019.
- Video visual relation detection. In ACM International Conference on Multimedia, Mountain View, CA USA, October 2017.
- Video relation detection with spatio-temporal graph. In Proceedings of the 27th ACM International Conference on Multimedia, pages 84–93, 2019.
- Beyond short-term snippet: Video relation detection with spatio-temporal global context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10840–10849, June 2020.
- Video relation detection via multiple hypothesis association. In Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, page 3127–3135, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450379885. doi:10.1145/3394171.3413764. URL https://doi.org/10.1145/3394171.3413764.
- Visual relation grounding in videos. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part VI, volume 12351 of Lecture Notes in Computer Science, pages 447–464. Springer, 2020. doi:10.1007/978-3-030-58539-6_27.
- A bayesian approach to reinforcement learning of vision-based vehicular control. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 3947–3954. IEEE, 2021. doi:10.1109/ICPR48806.2021.9412200.
- Low-rank approximations for conditional feedforward computation in deep neural networks. In 2nd International Conference on Learning Representations, ICLR, 2014. URL http://arxiv.org/abs/1312.4461.
- Conditional computation in neural networks for faster models. CoRR, abs/1511.06297, 2015. URL http://arxiv.org/abs/1511.06297.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR. OpenReview.net, 2017. URL https://openreview.net/forum?id=B1ckMDqlg.
- Do cifar-10 classifiers generalize to cifar-10? arXiv preprint arXiv:1806.00451, 2018.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- A survey on ensemble learning. Frontiers of Computer Science, 14(2):241–258, 2020.
- Recognition of emotions using multimodal physiological signals and an ensemble deep learning model. Computer methods and programs in biomedicine, 140:93–110, 2017.
- Visual relationship detection with language priors. In European conference on computer vision, pages 852–869. Springer, 2016.
- Visual relationship detection with object spatial distribution. In 2017 IEEE International Conference on Multimedia and Expo (ICME), pages 379–384. IEEE, 2017.
- Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV), pages 684–699, 2018.
- Visual relationship detection with deep structural ranking. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pages 7098–7105. AAAI Press, 2018. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16491.
- The open images dataset v4. International Journal of Computer Vision, 128(7):1956–1981, 2020.
- Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2604–2613, 2019.
- Spatial-temporal relation reasoning for action prediction in videos. Int. J. Comput. Vis., 129(5):1484–1505, 2021. doi:10.1007/s11263-020-01409-9.
- Video relation detection via tracklet based visual transformer. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4833–4837, 2021.
- Vrdformer: End-to-end video visual relation detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18836–18846, June 2022.
- Interventional video relation detection. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4091–4099, 2021.
- Concept-enhanced relation network for video visual relation inference. IEEE Transactions on Circuits and Systems for Video Technology, 2022.
- Classification-then-grounding: Reformulating video scene graphs as temporal bipartite graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19497–19506, 2022.
- Social fabric: Tubelet compositions for video relation detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13485–13494, 2021.
- Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
- Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1930–1939, 2018.
- Deep mixture of experts via shallow embedding. In Uncertainty in Artificial Intelligence, pages 552–562. PMLR, 2020.
- Variational mixture-of-experts autoencoders for multi-modal deep generative models. In Advances in Neural Information Processing Systems, pages 15718–15729, 2019.
- Towards crowdsourced training of large neural networks using decentralized mixture-of-experts. Advances in Neural Information Processing Systems, 33:3659–3672, 2020.
- Scaling vision with sparse mixture of experts. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 8583–8595, 2021.
- Learning factored representations in a deep mixture of experts. In 2nd International Conference on Learning Representations, ICLR 2014, 2014. URL http://arxiv.org/abs/1312.4314.
- Teuvo Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480, 1990.
- Self-reinforcing unsupervised matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 4278–4284. AAAI Press, 2017. URL http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14806.
- Annotating objects and relations in user-generated videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, pages 279–287. ACM, 2019.
- Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- 3-d relation network for visual relation recognition in videos. Neurocomputing, 432:91–100, 2021.