AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One (2312.06709v5)
Abstract: A handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous downstream tasks. VFMs like CLIP, DINOv2, SAM are trained with distinct objectives, exhibiting unique characteristics for various downstream tasks. We find that despite their conceptual differences, these models can be effectively merged into a unified model through multi-teacher distillation. We name this approach AM-RADIO (Agglomerative Model -- Reduce All Domains Into One). This integrative approach not only surpasses the performance of individual teacher models but also amalgamates their distinctive features, such as zero-shot vision-language comprehension, detailed pixel-level understanding, and open vocabulary segmentation capabilities. In pursuit of the most hardware-efficient backbone, we evaluated numerous architectures in our multi-teacher distillation pipeline using the same training recipe. This led to the development of a novel architecture (E-RADIO) that exceeds the performance of its predecessors and is at least 7x faster than the teacher models. Our comprehensive benchmarking process covers downstream tasks including ImageNet classification, ADE20k semantic segmentation, COCO object detection and LLaVa-1.5 framework. Code: https://github.com/NVlabs/RADIO
- Variational information distillation for knowledge transfer. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9155–9163, Los Alamitos, CA, USA, 2019. IEEE Computer Society.
- Ensemble knowledge distillation for learning improved and efficient networks. In European Conference on Artificial Intelligence, 2019.
- Foundational models defining a new era in vision: A survey and outlook, 2023.
- Do deep nets really need to be deep? In Advances in Neural Information Processing Systems, pages 2654–2662, 2014.
- Knowledge distillation: A good teacher is patient and consistent. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10915–10924, Los Alamitos, CA, USA, 2022. IEEE Computer Society.
- High-performance large-scale image recognition without normalization, 2021.
- Efficientvit: Multi-scale linear attention for high-resolution dense prediction, 2023.
- Emerging properties in self-supervised vision transformers, 2021.
- Reproducible scaling laws for contrastive language-image learning, 2022.
- Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7482–7491, Los Alamitos, CA, USA, 2018. IEEE Computer Society.
- MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2015.
- Efficient knowledge distillation from an ensemble of teachers. In Interspeech, 2017.
- Datacomp: In search of the next generation of multimodal datasets, 2023.
- Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
- Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Fastervit: Fast vision transformers with hierarchical attention, 2023.
- A comprehensive overhaul of feature distillation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1921–1930, Los Alamitos, CA, USA, 2019. IEEE Computer Society.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Learning anytime predictions in neural networks via adaptive loss balancing. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI Press, 2019.
- Like what you like: Knowledge distill via neuron selectivity transfer. CoRR, abs/1707.01219, 2017.
- GQA: a new dataset for compositional question answering over real-world images. CoRR, abs/1902.09506, 2019.
- Openclip, 2021.
- Ultralytics yolov8, 2023.
- Region-aware pretraining for open-vocabulary object detection with vision transformers, 2023.
- Paraphrasing complex network: Network compression via factor transfer. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, page 2765–2774, Red Hook, NY, USA, 2018. Curran Associates Inc.
- Segment anything, 2023.
- Knowledge distillation by on-the-fly native ensemble, 2018.
- Exploring plain vision transformer backbones for object detection, 2022a.
- Mvitv2: Improved multiscale vision transformers for classification and detection, 2022b.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning, 2023b.
- Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing, 415:106–113, 2020.
- Swin transformer v2: Scaling up capacity and resolution, 2022a.
- A convnet for the 2020s, 2022b.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
- Improved knowledge distillation via teacher assistant. In AAAI Conference on Artificial Intelligence, 2019.
- Dinov2: Learning robust visual features without supervision, 2023.
- Edgevits: Competing light-weight cnns on mobile devices with vision transformers. In ECCV, 2022.
- Feature-level ensemble knowledge distillation for aggregating knowledge from multiple networks. In European Conference on Artificial Intelligence, 2020.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Designing network design spaces, 2020.
- Fitnets: Hints for thin deep nets. CoRR, abs/1412.6550, 2014.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021.
- Towards vqa models that can read. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Dynamic network quantization for efficient video inference. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7355–7365, Los Alamitos, CA, USA, 2021. IEEE Computer Society.
- Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016.
- Efficientnetv2: Smaller models and faster training. CoRR, abs/2104.00298, 2021.
- Maxvit: Multi-axis vision transformer, 2022.
- Sam-clip: Merging vision foundation models towards semantic and spatial understanding, 2023.
- Contrastive learning rivals masked image modeling in fine-tuning via feature distillation, 2022.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- Self-training with noisy student improves imagenet classification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2020.
- Demystifying clip data. 2023.
- Evalai: Towards better evaluation systems for ai agents, 2019.
- Model compression with two-stage multi-teacher knowledge distillation for web question answering system. In Proceedings of the 13th International Conference on Web Search and Data Mining, page 690–698, New York, NY, USA, 2020. Association for Computing Machinery.
- Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 1285–1294, New York, NY, USA, 2017. Association for Computing Machinery.
- Metaformer is actually what you need for vision, 2022.
- Reinforced multi-teacher selection for knowledge distillation, 2020.
- Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
- Highlight every step: Knowledge distillation via collaborative teaching. IEEE Transactions on Cybernetics, 52(4):2070–2081, 2022.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Scene parsing through ade20k dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5122–5130, 2017.
- Konrad Zuchniak. Multi-teacher knowledge distillation as an effective method for compressing ensembles of neural networks, 2023.