Task Indicating Transformer for Task-conditional Dense Predictions (2403.00327v1)
Abstract: The task-conditional model is a distinctive stream for efficient multi-task learning. Existing works encounter a critical limitation in learning task-agnostic and task-specific representations, primarily due to shortcomings in global context modeling arising from CNN-based architectures, as well as a deficiency in multi-scale feature interaction within the decoder. In this paper, we introduce a novel task-conditional framework called Task Indicating Transformer (TIT) to tackle this challenge. Our approach designs a Mix Task Adapter module within the transformer block, which incorporates a Task Indicating Matrix through matrix decomposition, thereby enhancing long-range dependency modeling and parameter-efficient feature adaptation by capturing intra- and inter-task features. Moreover, we propose a Task Gate Decoder module that harnesses a Task Indicating Vector and gating mechanism to facilitate adaptive multi-scale feature refinement guided by task embeddings. Experiments on two public multi-task dense prediction benchmarks, NYUD-v2 and PASCAL-Context, demonstrate that our approach surpasses state-of-the-art task-conditional methods.
- “Multi-task learning for dense prediction tasks: A survey,” TPAMI, vol. 44, no. 7, pp. 3614–3633, 2021.
- “Efficient controllable multi-task architectures,” in ICCV, 2023, pp. 5740–5751.
- “Cross-stitch networks for multi-task learning,” in CVPR, 2016, pp. 3994–4003.
- “Nddr-cnn: Layerwise feature fusing in multi-task cnns by neural discriminative dimensionality reduction,” in CVPR, 2019, pp. 3205–3214.
- “End-to-end multi-task learning with attention,” in CVPR, 2019, pp. 1871–1880.
- “Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing,” in CVPR, 2018, pp. 675–684.
- “Mti-net: Multi-scale task interaction networks for multi-task learning,” in ECCV, 2020, pp. 527–543.
- “Exploring relational context for multi-task dense prediction,” in ICCV, 2021, pp. 15869–15878.
- “Inverted pyramid multi-task transformer for dense scene understanding,” in ECCV, 2022, pp. 514–530.
- “Scale-aware task message transferring for multi-task learning,” in ICME, 2023, pp. 1859–1864.
- “Attentive single-tasking of multiple tasks,” in CVPR, 2019, pp. 1851–1860.
- “Reparameterizing convolutions for incremental multi-task learning without task interference,” in ECCV, 2020, pp. 689–707.
- “Task switching network for multi-task learning,” in ICCV, 2021, pp. 8291–8300.
- “Compositetasking: Understanding images by spatial composition of tasks,” in CVPR, 2021, pp. 6870–6880.
- “Non-local neural networks,” in CVPR, 2018, pp. 7794–7803.
- “Attention is all you need,” NeurIPS, vol. 30, 2017.
- “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
- “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021, pp. 10012–10022.
- “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in ICCV, 2021, pp. 568–578.
- “Vision transformers for dense prediction,” in ICCV, 2021, pp. 12179–12188.
- “Mtformer: Multi-task learning via transformer and cross-task reasoning,” in ECCV, 2022, pp. 304–321.
- “Multi-task learning with multi-query transformer for dense prediction,” IEEE TCSVT, 2023.
- “Demt: Deformable mixer transformer for multi-task learning of dense prediction,” in AAAI, 2023, vol. 37, pp. 3072–3080.
- “Transfer vision patterns for multi-task pixel learning,” in ACM MM, 2021, pp. 97–106.
- “Parameter-efficient transfer learning for nlp,” in ICML, 2019, p. 2790–2799.
- “Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks,” in ACL, 2021, pp. 565–576.
- “Polyhistor: Parameter-efficient multi-task adaptation for dense vision tasks,” NeurIPS, vol. 35, pp. 36889–36901, 2022.
- “Vision transformer adapter for dense predictions,” in ICLR, 2023.
- “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” in EMNLP, 2014.
- “Delving deeper into convolutional networks for learning video representations,” in ICLR, 2016.
- “Indoor segmentation and support inference from rgbd images,” in ECCV, 2012, pp. 746–760.
- “The role of context for object detection and semantic segmentation in the wild,” in CVPR, 2014, pp. 891–898.