YOLOR-Based Multi-Task Learning (2309.16921v1)
Abstract: Multi-task learning (MTL) aims to learn multiple tasks using a single model and jointly improve all of them assuming generalization and shared semantics. Reducing conflicts between tasks during joint learning is difficult and generally requires careful network design and extremely large models. We propose building on You Only Learn One Representation (YOLOR), a network architecture specifically designed for multitasking. YOLOR leverages both explicit and implicit knowledge, from data observations and learned latents, respectively, to improve a shared representation while minimizing the number of training parameters. However, YOLOR and its follow-up, YOLOv7, only trained two tasks at once. In this paper, we jointly train object detection, instance segmentation, semantic segmentation, and image captioning. We analyze tradeoffs and attempt to maximize sharing of semantic information. Through our architecture and training strategies, we find that our method achieves competitive performance on all tasks while maintaining a low parameter count and without any pre-training. We will release code soon.
- Exploiting task relatedness for multiple task learning. In Bernhard Schölkopf and Manfred K. Warmuth (eds.), Learning Theory and Kernel Machines, pp. 567–580, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg. ISBN 978-3-540-45167-9.
- YOLOv4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
- Yolact: Real-time instance segmentation. In ICCV, 2019.
- Language models are few-shot learners. 2020.
- Coco-stuff: Thing and stuff classes in context. CoRR, abs/1612.03716, 2016. URL http://arxiv.org/abs/1612.03716.
- Coco-stuff: Thing and stuff classes in context. In Computer vision and pattern recognition (CVPR), 2018 IEEE conference on. IEEE, 2018.
- Panoptic segmentation-based attention for image captioning. Applied Sciences, 10:391, 01 2020. doi:10.3390/app10010391.
- Rich Caruana. Multitask learning. Machine Learning, 28, 07 1997. doi:10.1023/A:1007379606734.
- A unified sequence interface for vision tasks. arXiv preprint arXiv:2206.07669, 2022.
- Microsoft COCO captions: Data collection and evaluation server. CoRR, abs/1504.00325, 2015. URL http://arxiv.org/abs/1504.00325.
- Vision transformer adapter for dense predictions, 2023.
- Michael Crawshaw. Multi-task learning with deep neural networks: A survey. CoRR, abs/2009.09796, 2020. URL https://arxiv.org/abs/2009.09796.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
- Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 845–850, Beijing, China, July 2015. Association for Computational Linguistics. doi:10.3115/v1/P15-2139. URL https://aclanthology.org/P15-2139.
- Efficiently identifying task groupings for multi-task learning. CoRR, abs/2109.04617, 2021. URL https://arxiv.org/abs/2109.04617.
- Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014.
- Tom Heskes. Solving a huge number of similar tasks: A combination of multi-task learning and a hierarchical bayesian approach. Proceedings of the 15th International Conference on Machine Learning, 07 1998.
- Deep visual-semantic alignments for generating image descriptions. CoRR, abs/1412.2306, 2014. URL http://arxiv.org/abs/1412.2306.
- Panoptic segmentation. CoRR, abs/1801.00868, 2018. URL http://arxiv.org/abs/1801.00868.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. ECCV 2020, 2020.
- On better exploring and exploiting task relationships in multi-task learning: Joint model and feature learning. CoRR, abs/1904.01747, 2019. URL http://arxiv.org/abs/1904.01747.
- Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. URL http://arxiv.org/abs/1405.0312.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Language models are unsupervised multitask learners. 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
- Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. CoRR, abs/1603.01249, 2016. URL http://arxiv.org/abs/1603.01249.
- Sebastian Ruder. An overview of multi-task learning in deep neural networks. CoRR, abs/1706.05098, 2017. URL http://arxiv.org/abs/1706.05098.
- GitHub - saahiluppal/catr: CATR: Image Captioning with Transformers — github.com. https://github.com/saahiluppal/catr, 2018.
- Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
- Videvo. Crosswalk In Hong Kong. https://www.videvo.net/video/crosswalk-in-hong-kong/286959/. [Accessed 09-09-2023].
- Show and tell: A neural image caption generator. CoRR, abs/1411.4555, 2014. URL http://arxiv.org/abs/1411.4555.
- YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6 2023a.
- Designing network design strategies through gradient path analysis. Journal of Information Science and Engineering, 39(4):975–995, 7 2023b. URL https://jise.iis.sinica.edu.tw/JISESearch/pages/View/PaperView.jsf?keyId=193_2660.
- You only learn one representation: Unified network for multiple tasks. Journal of Information Science and Engineering, 39(3):691–709, 5 2023c. URL https://jise.iis.sinica.edu.tw/JISESearch/pages/View/PaperView.jsf?keyId=192_2655.
- Internimage: Exploring large-scale vision foundation models with deformable convolutions. arXiv preprint arXiv:2211.05778, 2022.
- Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431, 2016.
- Show, attend and tell: Neural image caption generation with visual attention. CoRR, abs/1502.03044, 2015. URL http://arxiv.org/abs/1502.03044.
- Image captioning in the transformer age, 2022.
- Trace norm regularised deep multi-task learning. CoRR, abs/1606.04038, 2016. URL http://arxiv.org/abs/1606.04038.
- mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1Ddp1-Rb.
- Vinvl: Making visual representations matter in vision-language models. CVPR 2021, 2021.
- Yu Zhang and Qiang Yang. A survey on multi-task learning. CoRR, abs/1707.08114, 2017. URL http://arxiv.org/abs/1707.08114.