All in Tokens: Unifying Output Space of Visual Tasks via Soft Token (2301.02229v2)

Published 5 Jan 2023 in cs.CV and cs.AI

Abstract: Unlike language tasks, where the output space is usually limited to a set of tokens, the output space of visual tasks is more complicated, making it difficult to build a unified visual model for various visual tasks. In this paper, we seek to unify the output space of visual tasks, so that we can also build a unified model for visual tasks. To this end, we demonstrate a single unified model that simultaneously handles two typical visual tasks of instance segmentation and depth estimation, which have discrete/fixed-length and continuous/varied-length outputs, respectively. We propose several new techniques that take into account the particularity of visual tasks: 1) Soft token. We employ soft token to represent the task output. Unlike hard tokens in the common VQ-VAE which are assigned one-hot to discrete codebooks/vocabularies, the soft token is assigned softly to the codebook embeddings. Soft token can improve the accuracy of both the next token inference and decoding of the task output; 2) Mask augmentation. Many visual tasks have corruption, undefined or invalid values in label annotations, i.e., occluded area of depth maps. We show that a mask augmentation technique can greatly benefit these tasks. With these new techniques and other designs, we show that the proposed general-purpose task-solver can perform both instance segmentation and depth estimation well. Particularly, we achieve 0.279 RMSE on the specific task of NYUv2 depth estimation, setting a new record on this benchmark. The general-purpose task-solver, dubbed AiT, is available at \url{https://github.com/SwinTransformer/AiT}.

Citations (35)

View on Semantic Scholar

Summary

The paper presents a novel unification of visual task outputs using soft tokens to encode probabilities instead of discrete labels.
It employs mask augmentation in VQ-VAE training to improve robustness against incomplete data, achieving a RMSE of 0.275 on NYUv2 depth estimation.
The unified framework leverages a Swin Transformer-based encoder-decoder, demonstrating practical efficiency with only 2M parameters and 0.06G FLOPs.

Overview of "All in Tokens: Unifying Output Space of Visual Tasks via Soft Token"

The paper "All in Tokens: Unifying Output Space of Visual Tasks via Soft Token" presents a novel approach for developing a unified model to handle diverse visual tasks. Unlike typical natural language tasks where outputs are restricted to a set of discrete tokens, visual tasks have complex and varied output spaces, challenging the creation of a unified model. This research employs a single model to address two representative visual tasks: instance segmentation and depth estimation, demonstrating promising results using innovative methods such as soft tokens and mask augmentation.

Key Concepts and Techniques

Soft Tokens: The paper introduces soft tokens to represent task outputs. In contrast to hard tokens, which one-hot encode outputs, soft tokens provide a probability-based assignment to codebook embeddings. This approach enhances accuracy in next-token inference and output decoding by creating an interpolable space, facilitating end-to-end learning through an auxiliary loss.
Mask Augmentation: Many visual tasks, such as depth estimation, involve corrupted or undefined labels. The authors propose mask augmentation to address these challenges. By masking portions of the input during VQ-VAE training and using ground truth for masked areas, the model gains resilience to incomplete data, improving the ability to recover accurate depth maps.
Architecture and Framework: The framework employs VQ-VAE for output tokenization and a Swin Transformer-based encoder-decoder model as the task solver. The VQ-VAE encoder and decoder act as tokenizer and detokenizer, encoding task outputs into tokens and reconstructing them back into task-specific formats.

Numerical Results and Claims

The proposed method achieves a RMSE of 0.275 in NYUv2 depth estimation, surpassing previous state-of-the-art results on this benchmark. This claim is supported by extensive evaluations that demonstrate the model's efficacy. The lightweight VQ-VAE model requires only 2M parameters and 0.06G FLOPs, indicating the practical applicability of this unified approach.

Implications and Future Directions

The implications of this research are substantial, both theoretically and practically. The unified output space has the potential to simplify the modeling of multi-task visual networks and reduce the computation needed for diverse visual task-solving. The authors suggest this framework can be extended to other visual tasks, presenting an exciting avenue for future research in AI model unification.

One particularly interesting future direction is exploring the balance between auto-regressive and parallel decoding techniques. The paper's initial investigations into parallel decoding indicate promising results, pointing towards potentially more efficient and effective methods for handling complex visual tasks.

Conclusion

This paper provides meaningful contributions to the field of unified AI models for visual tasks, leveraging soft tokens and innovative training techniques to address the inherent diversity of visual task outputs. Future research expanding on these findings could significantly advance the field, offering more harmonized and efficient solutions for multi-task visual processing.

The general-purpose task solver, AiT, exemplifies a solid step forward in the quest for unified visual modeling, encouraging further exploration and validation in varied applications.

PDF Markdown

Related Papers

GitHub

GitHub - SwinTransformer/AiT (105 stars)

Tweets

https://twitter.com/skalskip92/status/1798825281979036090