Swiss Army Knife: Synergizing Biases in Knowledge from Vision Foundation Models for Multi-Task Learning (2410.14633v2)
Abstract: Vision Foundation Models (VFMs) have demonstrated outstanding performance on numerous downstream tasks. However, due to their inherent representation biases originating from different training paradigms, VFMs exhibit advantages and disadvantages across distinct vision tasks. Although amalgamating the strengths of multiple VFMs for downstream tasks is an intuitive strategy, effectively exploiting these biases remains a significant challenge. In this paper, we propose a novel and versatile "Swiss Army Knife" (SAK) solution, which adaptively distills knowledge from a committee of VFMs to enhance multi-task learning. Unlike existing methods that use a single backbone for knowledge transfer, our approach preserves the unique representation bias of each teacher by collaborating the lightweight Teacher-Specific Adapter Path modules with the Teacher-Agnostic Stem. Through dynamic selection and combination of representations with Mixture-of-Representations Routers, our SAK is capable of synergizing the complementary strengths of multiple VFMs. Extensive experiments show that our SAK remarkably outperforms prior state of the arts in multi-task learning by 10% on the NYUD-v2 benchmark, while also providing a flexible and robust framework that can readily accommodate more advanced model designs. Project page: https://innovator-zero.github.io/SAK/ .
- MTLoRA: Low-rank adaptation approach for efficient multi-task learning. In CVPR, 2024.
- Variational information distillation for knowledge transfer. In CVPR, 2019.
- Efficient controllable multi-task architectures. In ICCV, 2023.
- Do deep nets really need to be deep? In NeurIPS, 2014.
- Generative modeling for multi-task visual learning. In ICML, 2022.
- Knowledge distillation: A good teacher is patient and consistent. In CVPR, 2022.
- Stochastic filter groups for multi-task CNNs: Learning specialist and generalist convolution kernels. In ICCV, 2019.
- Exploring relational context for multi-task dense prediction. In ICCV, 2021.
- Automated search for resource-efficient branched multi-task networks. In BMVC, 2020.
- Model compression. In SIGKDD, 2006.
- Learning lightweight object detectors via multi-teacher progressive distillation. In ICML, 2023.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
- Adaptformer: Adapting vision transformers for scalable visual recognition. In NeurIPS, 2022.
- GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML, 2018.
- Just pick a sign: Optimizing deep multitask models with gradient sign dropout. In NeurIPS, 2020.
- Mod-Squad: Designing mixtures of experts as modular multi-task learners. In CVPR, 2023.
- Reproducible scaling laws for contrastive language-image learning. In CVPR, 2023.
- Michael Crawshaw. Multi-task learning with deep neural networks: A survey. arXiv preprint arXiv:2009.09796, 2020.
- Vision transformers need registers. In ICLR, 2024.
- ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Robert M. French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
- Efficient knowledge distillation from an ensemble of teachers. In Interspeech, 2017.
- NDDR-CNN: Layerwise feature fusing in multi-task CNNs by neural discriminative dimensionality reduction. In CVPR, 2019.
- Multi-task self-training for learning general representations. In ICCV, 2021.
- Dynamic task prioritization for multitask learning. In ECCV, 2018.
- Learning to branch for multi-task learning. In ICML, 2020.
- Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.
- A comprehensive overhaul of feature distillation. In ICCV, 2019.
- Distilling the knowledge in a neural network. In NeurIPS Deep Learning Workshop, 2014.
- Parameter-efficient transfer learning for NLP. In ICML, 2019.
- Going beyond multi-task dense prediction with synergy embedding models. In CVPR, 2024a.
- GNAS: A greedy neural architecture search method for multi-attribute learning. In ACM MM, 2018.
- YOLO-Med: Multi-task interaction network for biomedical images. In ICASSP, 2024b.
- Multi-task learning with attention for end-to-end autonomous driving. In CVPR, 2021.
- Online knowledge distillation for multi-task learning. In WACV, 2023.
- Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
- RotoGrad: Gradient homogenization in multitask learning. In ICLR, 2022.
- Reparameterizing convolutions for incremental multi-task learning without task interference. In ECCV, 2020.
- BRAVE: Broadening the visual encoding of vision-language models. In ECCV, 2024.
- Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, 2018.
- Segment anything. In ICCV, 2023.
- Knowledge distillation for multi-task learning. In ECCV Workshops, 2020.
- Multi-task learning with 3D-aware regularization. In ICLR, 2024.
- M3ViT: Mixture-of-experts vision transformer for efficient multi-task learning with model-accelerator co-design. In NeurIPS, 2022.
- MTMamba: Enhancing multi-task dense scene understanding by mamba-based decoders. In ECCV, 2024.
- SPHINX: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
- Conflict-averse gradient descent for multi-task learning. In NeurIPS, 2021a.
- Towards impartial multi-task learning. In ICLR, 2021b.
- End-to-end multi-task learning with attention. In CVPR, 2019.
- Polyhistor: Parameter-efficient multi-task adaptation for dense vision tasks. In NeurIPS, 2022.
- Swin Transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021c.
- Wisdom of committee: Distilling from foundation model to specialized application model. arXiv preprint arXiv:2402.14035, 2024.
- Learning multiple tasks with multilinear relationship networks. In NeurIPS, 2017.
- Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification. In CVPR, 2017.
- FedHCA2: Towards hetero-client federated multi-task learning. In CVPR, 2024a.
- Task indicating transformer for task-conditional dense predictions. In ICASSP, 2024b.
- Prompt guided transformer for multi-task dense prediction. IEEE Transactions on Multimedia, 2024c.
- Knowledge amalgamation from heterogeneous networks by common feature learning. In IJCAI, 2019.
- Collaboration by competition: Self-coordinated knowledge amalgamation for multi-talent student learning. In ECCV, 2020.
- Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In ECCV, 2018.
- Lexicon3d: Probing visual foundation models for complex 3d scene understanding. In NeurIPS, 2024.
- Attentive single-tasking of multiple tasks. In CVPR, 2019.
- Learning to detect natural image boundaries using local brightness, color, and texture cues. TPAMI, 26(5):530–549, 2004.
- Time- Memory- and Parameter-Efficient Visual Adaptation. In CVPR, 2024.
- Beyond shared hierarchies: Deep multitask learning through soft layer ordering. In ICLR, 2018.
- Cross-stitch networks for multi-task learning. In CVPR, 2016.
- The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014.
- AdaMTL: Adaptive input-dependent inference for efficient multi-task learning. In CVPR, 2023.
- DINOv2: Learning robust visual features without supervision. TMLR, 2024.
- Frozen transformers in language models are effective visual encoder layers. In ICLR, 2024.
- In NeurIPS, 2019.
- Supervised evaluation of image segmentation and object proposal techniques. TPAMI, 38(7):1465–1478, 2015.
- Masked AutoDecoder is effective multi-task vision generalist. In CVPR, 2024.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- AM-RADIO: Agglomerative vision foundation model reduce all domains into one. In CVPR, 2024.
- SAM 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024.
- Controllable dynamic multi-task architectures. In CVPR, 2022.
- FitNets: Hints for thin deep nets. In ICLR, 2015.
- Routing networks: Adaptive selection of non-linear functions for multi-task learning. In ICLR, 2018.
- Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
- Latent multi-task architecture learning. In AAAI, 2019.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- UNIC: Universal classification models via multi-teacher distillation. In ECCV, 2024.
- Deep model compression: Distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650, 2016.
- Multi-task learning as multi-objective optimization. In NeurIPS, 2018.
- Theia: Distilling diverse vision foundation models for robot learning. In CoRL, 2024.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In ICLR, 2017.
- Amalgamating knowledge towards comprehensive classification. In AAAI, 2019a.
- Customizing student networks from heterogeneous teachers via adaptive knowledge amalgamation. In ICCV, 2019b.
- Region-Based representations revisited. In CVPR, 2024.
- Efficient computation sharing for multi-task visual scene understanding. In ICCV, 2023.
- Indoor segmentation and support inference from RGBD images. In ECCV, 2012.
- Scale-aware task message transferring for multi-task learning. In ICME, 2023.
- Adaptive task-wise message passing for multi-task learning: A spatial interaction perspective. IEEE Transactions on Circuits and Systems for Video Technology, 2024a.
- Task-Interaction-Free multi-task learning with efficient hierarchical feature representation. In ACM MM, 2024b.
- SAM-Lightening: A lightweight segment anything model with dilated flash attention to achieve 30 times acceleration. arXiv preprint arXiv:2403.09195, 2024.
- Task switching network for multi-task learning. In ICCV, 2021.
- DIME-FM: Distilling multimodal and efficient foundation models. In ICCV, 2023.
- Semi-supervised knowledge amalgamation for sequence classification. In AAAI, 2021.
- Knowledge amalgamation for multi-label classification via label dependency transfer. In AAAI, 2023.
- Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. In NeurIPS, 2024a.
- Eyes wide shut? Exploring the visual shortcomings of multimodal LLMs. In CVPR, 2024b.
- Training data-efficient image transformers & distillation through attention. In ICML, 2021.
- Branched multi-task networks: Deciding what layers to share. In BMVC, 2019.
- MTI-Net: Multi-scale task interaction networks for multi-task learning. In ECCV, 2020.
- Multi-task learning for dense prediction tasks: A survey. TPAMI, 44(7):3614–3633, 2021.
- Knowledge transfer from vision foundation models for efficient training of small task-specific models. In ICML, 2024.
- Unifying heterogeneous classifiers with distillation. In CVPR, 2019.
- Task adaptive parameter sharing for multi-task learning. In CVPR, 2022.
- RepViT-SAM: Towards real-time segmenting anything. arXiv preprint arXiv:2312.05760, 2023.
- SAM-CLIP: Merging vision foundation models towards semantic and spatial understanding. In CVPR, 2024a.
- TSP-Transformer: Task-specific prompts boosted transformer for holistic scene understanding. In WACV, 2024b.
- Task-aware low-rank adaptation of segment anything model. arXiv preprint arXiv:2403.10971, 2024c.
- Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models. In ICLR, 2021.
- Class incremental learning with multi-teacher distillation. In CVPR, 2024.
- Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
- VMT-Adapter: Parameter-efficient transfer learning for multi-task dense scene understanding. In AAAI, 2024a.
- TFUT: Task fusion upward transformer model for multi-task learning on dense prediction. Computer Vision and Image Understanding, 244:104014, 2024b.
- PAD-Net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In CVPR, 2018.
- MTFormer: Multi-task learning via transformer and cross-task reasoning. In ECCV, 2022.
- Multi-task learning with multi-query transformer for dense prediction. IEEE Transactions on Circuits and Systems for Video Technology, 2023a.
- Deformable mixer transformer with gating for multi-task learning of dense prediction. arXiv preprint arXiv:2308.05721, 2023b.
- DeMT: Deformable mixer transformer for multi-task learning of dense prediction. In AAAI, 2023c.
- Multi-task learning with knowledge distillation for dense prediction. In ICCV, 2023d.
- CLIP-KD: An empirical study of clip model distillation. In CVPR, 2024a.
- Depth Anything: Unleashing the power of large-scale unlabeled data. In CVPR, 2024b.
- Depth Anything V2. In NeurIPS, 2024c.
- Contrastive multi-task dense prediction. In AAAI, 2023.
- Deep multi-task representation learning: A tensor factorisation approach. In ICLR, 2017.
- Multi-task dense prediction via mixture of low-rank experts. In CVPR, 2024d.
- Multi-objective meta learning. In NeurIPS, 2021.
- Inverted pyramid multi-task transformer for dense scene understanding. In ECCV, 2022.
- TaskExpert: Dynamically assembling multi-task representations with memorial mixture-of-experts. In ICCV, 2023a.
- TaskPrompter: Spatial-channel multi-task prompting for dense scene understanding. In ICLR, 2023b.
- InvPT++: Inverted pyramid multi-task transformer for visual scene understanding. TPAMI, 2024.
- Student becoming the master: Knowledge amalgamation for joint scene parsing, depth estimation, and more. In CVPR, 2019a.
- Amalgamating filtered knowledge: learning task-customized student from multi-task teachers. In IJCAI, 2019b.
- Learning from multiple teacher networks. In SIGKDD, 2017.
- Unleashing the power of multi-task learning: A comprehensive survey spanning traditional, deep, and pretrained foundation model eras. arXiv preprint arXiv:2404.18961, 2024.
- Gradient surgery for multi-task learning. In NeurIPS, 2020.
- Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
- Sigmoid loss for language image pre-training. In ICCV, 2023.
- Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023a.
- Rethinking of feature interaction for multi-task learning on dense prediction. arXiv preprint arXiv:2312.13514, 2023b.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023c.
- Transfer vision patterns for multi-task pixel learning. In ACM MM, 2021.
- Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2021.
- Joint task-recursive learning for semantic segmentation and depth estimation. In ECCV, 2018.
- Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In CVPR, 2019.
- TAMM: TriAdapter multi-modal learning for 3d shape understanding. In CVPR, 2024a.
- EfficientViT-SAM: Accelerated segment anything model without performance loss. arXiv preprint arXiv:2402.05008, 2024b.
- EdgeSAM: Prompt-in-the-loop distillation for on-device deployment of SAM. arXiv preprint arXiv:2312.06660, 2023.
- Image BERT pre-training with online tokenizer. In ICLR, 2022.
- Pattern-structure diffusion for multi-task learning. In CVPR, 2020.
- MoVA: Adapting mixture of vision experts to multimodal context. In NeurIPS, 2024.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.