Multi-Task Self-Training for Learning General Representations (2108.11353v1)

Published 25 Aug 2021 in cs.CV

Abstract: Despite the fast progress in training specialized models for various tasks, learning a single general model that works well for many tasks is still challenging for computer vision. Here we introduce multi-task self-training (MuST), which harnesses the knowledge in independent specialized teacher models (e.g., ImageNet model on classification) to train a single general student model. Our approach has three steps. First, we train specialized teachers independently on labeled datasets. We then use the specialized teachers to label an unlabeled dataset to create a multi-task pseudo labeled dataset. Finally, the dataset, which now contains pseudo labels from teacher models trained on different datasets/tasks, is then used to train a student model with multi-task learning. We evaluate the feature representations of the student model on 6 vision tasks including image recognition (classification, detection, segmentation)and 3D geometry estimation (depth and surface normal estimation). MuST is scalable with unlabeled or partially labeled datasets and outperforms both specialized supervised models and self-supervised models when training on large scale datasets. Lastly, we show MuST can improve upon already strong checkpoints trained with billions of examples. The results suggest self-training is a promising direction to aggregate labeled and unlabeled training data for learning general feature representations.

Citations (91)

View on Semantic Scholar

Summary

The paper introduces Multi-Task Self-Training (MuST), a method that trains a general vision model by using teacher models on labeled data to generate pseudo-labels on unlabeled data, which then trains a student model across diverse tasks.
MuST-trained models significantly outperform specialized supervised and state-of-the-art self-supervised baselines across six vision tasks and demonstrate scalability with unlabeled data.
The representations learned by MuST exhibit strong transferability to new tasks and datasets, highlighting the method's potential for creating more adaptive and generalizable vision models.

Multi-Task Self-Training for Learning General Representations

The paper "Multi-Task Self-Training for Learning General Representations" introduces an innovative approach called Multi-Task Self-Training (MuST), aimed at enhancing the versatility of computer vision models. The primary focus of this research is to train a single model capable of performing well across a diverse array of vision tasks by leveraging both labeled and unlabeled data efficiently.

The authors propose a methodology that involves three main steps. Initially, specialized teacher models are trained independently on labeled datasets tailored to specific tasks, such as object classification or detection. These teacher models are then utilized to generate pseudo labels on a separate, unlabeled dataset. This pseudo-labeled dataset is subsequently employed to train a student model using a multi-task learning approach. The combination of pseudo labels from diverse tasks allows the student model to learn general feature representations.

Numerical Results and Evaluation

The effectiveness of MuST is demonstrated through comprehensive experiments spanning six vision tasks: image classification, object detection, semantic segmentation, depth estimation, surface normal estimation, and 3D geometry evaluation. Key findings include:

Outperformance of Baselines: The MuST-trained models surpass benchmarks set by both specialized supervised models and state-of-the-art self-supervised models, particularly when scaled to large datasets.
Scalability: The approach shows considerable scalability with unlabeled or partially labeled datasets, suggesting that MuST efficiently aggregates information from various tasks.
Transfer Learning: The paper highlights that representations learned using MuST can be feasibly transferred across different tasks and datasets, thus validating the generalization capabilities of the method.

Implications and Future Developments

The implications of this research are significant in the context of computer vision and representation learning. By showing that self-training can unify insights gleaned from multiple tasks into a generalist model, the paper paves the way for substantial advancements in creating more adaptive and generalizable vision models. The potential to enhance strong pre-existing models like ALIGN by re-training them using MuST demonstrates the versatility and utility of this technique for future applications.

The practical applications of MuST are widespread. For instance, in scenarios where obtaining labeled data is challenging, leveraging pseudo-labeling as explored in this paper suggests a promising avenue for enhancing model capacity without direct data annotation. Additionally, the flexibility of MuST in handling multiple vision tasks concurrently indicates a step toward more efficient and cost-effective solutions in AI-driven industries.

Conclusion

In conclusion, the introduction of Multi-Task Self-Training marks a significant contribution to the field of vision-based machine learning. The research illustrates that by bridging the gap between supervised and unsupervised learning paradigms through multi-task learning, it is possible to cultivate robust and scalable models capable of understanding complex visual tasks. Future work might further explore integrating self-supervised techniques with MuST, enhancing its applicability and effectiveness in rapidly evolving AI landscapes.

Related Papers

YouTube

Show All Videos