Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design (2305.13035v5)

Published 22 May 2023 in cs.CV and cs.LG
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design

Abstract: Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute. For example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical settings, with also less than half the inference cost. We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. Overall, our findings challenge the prevailing approach of blindly scaling up vision models and pave a path for a more informed scaling.

Analysis of "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design"

The paper "Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design" proposes a novel approach to determining the optimal architectural shape for compute-efficient Vision Transformers (ViTs). The emphasis is on deriving compute-optimal model shapes—specifically their width and depth—by advancing the existing methodologies that focus solely on model sizes quantified by parameter counts. This approach introduces the shape-optimized ViT, or SoViT, which demonstrates competitive performance with significantly larger models trained under similar compute budgets. Specifically, SoViT-400m/14 attains a fine-tuning accuracy of 90.3% on the ILSVRC2012 benchmark, surpassing models like ViT-g/14 and approaching the performance of ViT-G/14 while incurring less than half the inference cost.

Contributions and Methodology

The paper addresses the growing trend of scaling model size without taking computational constraints into account by aligning model design with predefined compute budgets. It critiques the conventional reliance on parameter count as the sole indicator of model size, advocating instead for consideration of "shape dimensions" such as depth and width. Previous studies highlighted that transformer shape had minimal impact on scaling behavior in LLMing, but this paper emphasizes its critical role in vision tasks.

The authors employ a structured methodology to derive a scaling strategy for ViT architecture:

  1. Joint Functional Formulation: The authors propose a joint functional form that considers both compute and individual shape dimensions (width, depth, MLP size). This approach extends existing literature on compute-optimal scaling by encompassing multiple dimensions rather than focusing on a single parameter count.
  2. Star and Grid Sweeps: By employing a two-step approach of star and grid sweep, the methodology significantly reduces the need for extensive computationally costly experiments. The star sweep variance allows each dimension to be optimized independently, while the grid sweep helps identify a baseline compute-optimal architecture for small compute budgets.
  3. Analytical Scaling Laws: The methodology involves deriving scaling exponents from empirical experiments, which inform the extent to which each shape dimension should be scaled relative to compute increase.

Experimental Results

The practical impact of their optimized approach is empirically validated across various vision tasks. The SoViT-400m/14 shows competitive performance not only in standard image classification benchmarks but also in diversified tasks like image captioning, optical character recognition, and multimodal VQA (Visual Question Answering). Furthermore, despite being smaller, SoViT models demonstrate the ability to match or exceed much larger ViT models in zero-shot classification and multi-task decoding efficiency. These results challenge existing norms suggesting that simply enlarging models leads to better performance, highlighting instead a more sophisticated balance between model shape and compute resource allocation.

Implications and Future Directions

This research prompts a reevaluation of strategies employed in designing neural architectures for large-scale vision tasks. By optimizing model shape alongside compute, deployers can realize improvements in model efficiency, which is pertinent given the increasing computational and environmental costs associated with training large AI models.

The findings encourage broader exploration of multi-dimensional scaling laws in other AI domains beyond vision and propose an alternative path in model development that emphasizes efficiency over mere scale. It would be interesting to see future research that further generalizes this approach and possibly extends it to LLMs or other architectures like recurrent or convolutional neural networks. Additionally, deploying such optimized models in real-world systems where computational resources are at a premium could be a fruitful area for exploration.

In conclusion, the paper brings forward an important perspective on how AI models, particularly transformers, can be reimagined to better align with the realities of available computational power while maintaining state-of-the-art performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Revisiting neural scaling laws in language and vision. In Advances in neural information processing systems (NeurIPS).
  2. Explaining neural scaling laws. arXiv preprint arXiv:2102.06701.
  3. Data scaling laws in NMT: The effect of noise and architecture. arXiv preprint arXiv:2202.01994.
  4. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32.
  5. Revisiting resnets: Improved training and scaling strategies. Advances in neural information processing systems (NeurIPS).
  6. Are we done with imagenet? CoRR, abs/2006.07159.
  7. Flexivit: One model for all patch sizes. In CVPR.
  8. A study of autoregressive decoders for multi-tasking in computer vision.
  9. Better plain vit baselines for imagenet-1k.
  10. Big vision. https://github.com/google-research/big_vision.
  11. Convex optimization. Cambridge university press.
  12. Wide attention is the way forward for transformers. arXiv preprint arXiv:2210.00640.
  13. Language models are few-shot learners. Advances in neural information processing systems (NeurIPS).
  14. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
  15. Pali: A jointly-scaled multilingual language-image model.
  16. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  17. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
  18. Code, P. W. (2023). Papers With Code: ImageNet Benchmark. https://paperswithcode.com/sota/image-classification-on-imagenet. [Online; accessed 16-May-2023].
  19. Coatnet: Marrying convolution and attention for all data sizes. Advances in neural information processing systems (NeurIPS).
  20. The efficiency misnomer. In ICLR.
  21. Scaling vision transformers to 22 billion parameters.
  22. Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR).
  23. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  24. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Representation Learning (ICLR).
  25. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
  26. Scaling laws for neural machine translation. arXiv preprint arXiv:2109.07740.
  27. Data and parameter scaling laws for neural machine translation. In Conference on Empirical Methods in Natural Language Processing.
  28. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR.
  29. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR).
  30. The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV.
  31. Natural adversarial examples. CVPR.
  32. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701.
  33. Scaling laws for transfer. arXiv preprint arXiv:2102.01293.
  34. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409.
  35. Training compute-optimal large language models. In Advances in neural information processing systems (NeurIPS).
  36. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
  37. GQA: a new dataset for compositional question answering over real-world images. arXiv preprint arXiv:1902.09506.
  38. Hutter, M. (2021). Learning curve theory. arXiv preprint arXiv:2102.04074.
  39. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  40. Multi-class texture analysis in colorectal cancer histology. Scientific reports, 6:27988.
  41. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  42. Panoptic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR).
  43. Big transfer (BiT): General visual representation learning. In European Conference on Computer Vision (ECCV).
  44. UViM: A unified modeling approach for vision with learned guiding codes. Advances in neural information processing systems (NeurIPS).
  45. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia.
  46. Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report.
  47. Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on Machine Learning (ICML).
  48. Microsoft coco: Common objects in context. In ECCV.
  49. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019.
  50. OCR-VQA: Visual question answering by reading text in images. In ICDAR.
  51. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition.
  52. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350.
  53. Combined scaling for zero-shot transfer learning. arXiv preprint arXiv:2111.10050.
  54. Learning transferable visual models from natural language supervision. In ICML.
  55. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  56. Do imagenet classifiers generalize to imagenet? CoRR, abs/1902.10811.
  57. Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575.
  58. MobileNetV2: Inverted residuals and linear bottlenecks. In Conference on Computer Vision and Pattern Recognition (CVPR).
  59. Scaling laws from the data manifold dimension. Journal of Machine Learning Research, 23.
  60. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning (ICML).
  61. The effectiveness of mae pre-pretraining for billion-scale pretraining. arXiv preprint arXiv:2303.13496.
  62. How to train your vit? data, augmentation, and regularization in vision transformers. Transactions on Machine Learning Research.
  63. Revisiting unreasonable effectiveness of data in deep learning era. In International Conference on Computer Vision (ICCV).
  64. EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (ICML).
  65. Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551.
  66. Scale efficiently: Insights from pre-training and fine-tuning transformers. In International Conference on Representation Learning (ICLR).
  67. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (ICML).
  68. DeiT III: Revenge of the ViT. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 516–533. Springer.
  69. Llama: Open and efficient foundation language models.
  70. Fixing the train-test resolution discrepancy. Advances in neural information processing systems, 32.
  71. Neural discrete representation learning. Advances in neural information processing systems (NeurIPS).
  72. Attention is all you need. Advances in neural information processing systems (NeurIPS).
  73. Cider: Consensus-based image description evaluation. In CVPR.
  74. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology.
  75. Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognition.
  76. Bag-of-visual-words and spatial extensions for land-use classification. In ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS).
  77. CoCa: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917.
  78. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.
  79. Wide residual networks. arXiv preprint arXiv:1605.07146.
  80. Scaling vision transformers. In Conference on Computer Vision and Pattern Recognition (CVPR).
  81. LiT: Zero-shot transfer with locked-image text tuning. In CVPR, pages 18123–18133.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ibrahim Alabdulmohsin (31 papers)
  2. Xiaohua Zhai (51 papers)
  3. Alexander Kolesnikov (44 papers)
  4. Lucas Beyer (46 papers)
Citations (32)
Youtube Logo Streamline Icon: https://streamlinehq.com