Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2101.11986v3)

Published 28 Jan 2021 in cs.CV

Abstract: Transformers, which are popular for LLMing, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0\% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3\% top1 accuracy in image resolution 384$\times$384 on ImageNet. (Code: https://github.com/yitu-opensource/T2T-ViT)

PDF Abstract

Guidelines for Preparing ICCV Proceedings in \LaTeX

This document offers detailed instructions on preparing papers for submission to the IEEE International Conference on Computer Vision (ICCV) using the \LaTeX\ document preparation system. The paper's authors provide a comprehensive set of guidelines to ensure consistency and quality across all submissions.

Key Topics Covered

The paper is structured to address several critical areas as outlined below:

Abstract and Introduction: The abstract must be in 10-point, fully-justified italic text, positioned at the top of the left-hand column. The introduction begins two lines below the abstract, presenting a high-level overview of the requirements for manuscript submission to the IEEE Computer Society Press.
Language and Dual Submission: Authors are reminded that all submissions must be in English. The guidelines also highlight the rules against dual submissions, advising authors to refer to the ICCV web page for detailed policies.
Paper Length: Submissions should not exceed eight pages, excluding references. Importantly, no exceptions for overlength papers will be made, emphasizing the need to adhere strictly to formatting guidelines.
Formatting Elements:
- The Ruler: A printed ruler is included in the \LaTeX\ style for review purposes, enabling reviewers to reference specific lines.
- Mathematics: All sections and displayed equations must be numbered for clarity and ease of reference.
- Blind Review: Instructions for maintaining anonymity during the review process are provided. This includes appropriate citation practices and the handling of technical reports.
Figures and Tables: The paper includes guidance on figure and table captions, advising on font sizes and placement to ensure clarity and consistency.
Miscellaneous: The paper touches on various formatting issues, including the appropriate use of spacing, capitalization, and special macros.
Type Styles and Fonts: Detailed specifications on type styles and fonts are provided, emphasizing the use of Times Roman or its closest available alternative.
Footnotes and References: Authors are cautioned to use footnotes sparingly and are given precise formatting instructions for bibliographical references, which should be listed in 9-point Times.

Practical and Theoretical Implications

Adhering to these guidelines serves multiple purposes:

Uniformity: Ensures that all submissions appear consistent, facilitating a smoother review process.
Clarity and Professionalism: Well-formatted papers are easier to read and reflect a level of professionalism that is demanded by high-impact conferences like ICCV.
Review Efficiency: By providing specific instructions on elements like the printed ruler and blind review, the guidelines aim to make the review process more efficient and less error-prone.

Potential Developments

Future iterations of these guidelines could benefit from more detailed instructions on the incorporation of interactive and multimedia elements, reflecting the growing importance of these in modern research presentations. Additionally, guidelines could evolve to offer more automated validation tools, ensuring that submissions meet formatting standards without requiring extensive manual checks.

In conclusion, this document lays a robust foundation for preparing research papers for ICCV, contributing significantly to the standardization and quality control of submissions. The detailed instructions provided here are indispensable for authors aiming to present their work at this premier venue in computer vision.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Li Yuan (141 papers)
Yunpeng Chen (36 papers)
Tao Wang (700 papers)
Weihao Yu (36 papers)
Yujun Shi (23 papers)
Zihang Jiang (28 papers)
Francis EH Tay (5 papers)
Jiashi Feng (295 papers)
Shuicheng Yan (275 papers)

Citations (1,725)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - yitu-opensource/T2T-ViT: ICCV2021, Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (1,164 stars)

Tweets

https://twitter.com/_clashluke/status/1853466542496719357

YouTube

Show All Videos