Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sub-token ViT Embedding via Stochastic Resonance Transformers (2310.03967v2)

Published 6 Oct 2023 in cs.CV and cs.AI

Abstract: Vision Transformer (ViT) architectures represent images as collections of high-dimensional vectorized tokens, each corresponding to a rectangular non-overlapping patch. This representation trades spatial granularity for embedding dimensionality, and results in semantically rich but spatially coarsely quantized feature maps. In order to retrieve spatial details beneficial to fine-grained inference tasks we propose a training-free method inspired by "stochastic resonance". Specifically, we perform sub-token spatial transformations to the input data, and aggregate the resulting ViT features after applying the inverse transformation. The resulting "Stochastic Resonance Transformer" (SRT) retains the rich semantic information of the original representation, but grounds it on a finer-scale spatial domain, partly mitigating the coarse effect of spatial tokenization. SRT is applicable across any layer of any ViT architecture, consistently boosting performance on several tasks including segmentation, classification, depth estimation, and others by up to 14.9% without the need for any fine-tuning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Deep vit features as dense visual descriptors. arXiv preprint arXiv:2112.05814, 2(3):4, 2021.
  2. Test-time data augmentation for estimation of heteroscedastic aleatoric uncertainty in deep neural networks. In Medical Imaging with Deep Learning, 2022.
  3. The mechanism of stochastic resonance. Journal of Physics A: mathematical and general, 14(11):L453, 1981.
  4. Stochastic resonance in climatic change. Tellus, 34(1):10–16, 1982.
  5. Efficient self-ensemble for semantic segmentation. arXiv preprint arXiv:2111.13280, 2021.
  6. A non-local algorithm for image denoising. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), volume 2, pp.  60–65. Ieee, 2005.
  7. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9650–9660, 2021.
  8. Theory of the stochastic resonance effect in signal detection: Part i—fixed detectors. IEEE transactions on Signal Processing, 55(7):3172–3184, 2007.
  9. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pp. 1310–1320. PMLR, 2019.
  10. Domain-size pooling in local descriptors: Dsp-sift. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  5097–5106, 2015.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  12. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
  13. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  14. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  16000–16009, 2022.
  15. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  16. nnu-net: Self-adapting framework for u-net-based medical image segmentation. arXiv preprint arXiv:1809.10486, 2018.
  17. Space-time correspondence as a contrastive random walk. Advances in neural information processing systems, 33:19545–19560, 2020.
  18. A deep 3d residual cnn for false-positive reduction in pulmonary nodule detection. Medical physics, 45(5):2097–2107, 2018.
  19. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  20. Robust stochastic resonance: Signal detection and adaptation in impulsive noise. Physical review E, 64(5):051110, 2001.
  21. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  22. Lindeberg, T. Scale-space theory in computer vision, volume 256. Springer Science & Business Media, 2013.
  23. Towards robust neural networks via random self-ensemble. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  369–385, 2018.
  24. Image classification of melanoma, nevus and seborrheic keratosis by deep neural network ensemble. arXiv preprint arXiv:1703.03108, 2017.
  25. What is stochastic resonance? definitions, misconceptions, debates, and its relevance to biology. PLoS computational biology, 5(5):e1000348, 2009.
  26. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  27. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  28. Mixup inference: Better exploiting mixup to defend adversarial attacks. arXiv preprint arXiv:1909.11515, 2019.
  29. How do vision transformers work? arXiv preprint arXiv:2202.06709, 2022.
  30. The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017.
  31. Deflecting adversarial attacks with pixel deflection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  8571–8580, 2018.
  32. Revisiting oxford and paris: Large-scale image retrieval benchmarking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  5706–5715, 2018.
  33. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  34. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  12179–12188, 2021.
  35. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
  36. Hierarchical image saliency detection on extended cssd. IEEE transactions on pattern analysis and machine intelligence, 38(4):717–729, 2015.
  37. Application of stochastic resonance technology in underwater acoustic weak signal detection. In OCEANS 2016-Shanghai, pp.  1–5. IEEE, 2016.
  38. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  39. Understanding measures of uncertainty for adversarial example detection. arXiv preprint arXiv:1803.08533, 2018.
  40. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766, 2017.
  41. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1–9, 2015.
  42. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  43. Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation with convolutional neural networks. Neurocomputing, 338:34–45, 2019.
  44. Adaptive multiscale noise tuning stochastic resonance for health diagnosis of rolling element bearings. IEEE Transactions on instrumentation and measurement, 64(2):564–577, 2014.
  45. Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  136–145, 2017.
  46. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  14408–14419, June 2023.
  47. Tokencut: Segmenting objects in images and videos with self-supervised transformer and normalized cut. arXiv preprint arXiv:2209.00383, 2022.
  48. Stochastic resonance. Reports on progress in physics, 67(1):45, 2003.
  49. Augundo: Scaling up augmentations for unsupervised depth completion. arXiv preprint arXiv:2310.09739, 2023.
  50. Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3166–3173, 2013.
  51. Emergence of segmentation with minimalistic white-box transformers. arXiv preprint arXiv:2308.16271, 2023.
  52. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  3713–3722, 2019.
  53. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  633–641, 2017.
Citations (3)

Summary

We haven't generated a summary for this paper yet.