Rotary Position Embedding for Vision Transformer (2403.13298v2)

Published 20 Mar 2024 in cs.CV and cs.LG

Abstract: Rotary Position Embedding (RoPE) performs remarkably on LLMs, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead. Our code and pre-trained models are available at https://github.com/naver-ai/rope-vit

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

Authors (4)

Byeongho Heo (33 papers)
Song Park (12 papers)
Dongyoon Han (49 papers)
Sangdoo Yun (71 papers)

Citations (13)

View on Semantic Scholar

Tweets

https://twitter.com/oodgnas/status/1841826121958191185

https://twitter.com/HimanshuBhenwa1/status/1847532469169348906

https://twitter.com/MuzafferKal_/status/1808053545515864572

https://twitter.com/CSVisionPapers/status/1770886218600919288

https://twitter.com/syoyo/status/1849038263195148591

Rotary Position Embedding for Vision Transformer (2403.13298v2)

Related Papers

Tweets