Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adaptive Patching for High-resolution Image Segmentation with Transformers (2404.09707v1)

Published 15 Apr 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Attention-based models are proliferating in the space of image analytics, including segmentation. The standard method of feeding images to transformer encoders is to divide the images into patches and then feed the patches to the model as a linear sequence of tokens. For high-resolution images, e.g. microscopic pathology images, the quadratic compute and memory cost prohibits the use of an attention-based model, if we are to use smaller patch sizes that are favorable in segmentation. The solution is to either use custom complex multi-resolution models or approximate attention schemes. We take inspiration from Adapative Mesh Refinement (AMR) methods in HPC by adaptively patching the images, as a pre-processing step, based on the image details to reduce the number of patches being fed to the model, by orders of magnitude. This method has a negligible overhead, and works seamlessly with any attention-based model, i.e. it is a pre-processing step that can be adopted by any attention-based model without friction. We demonstrate superior segmentation quality over SoTA segmentation models for real-world pathology datasets while gaining a geomean speedup of $6.9\times$ for resolutions up to $64K2$, on up to $2,048$ GPUs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017.
  2. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2103.17239, 2021.
  3. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.
  4. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.
  5. N. Carion, F. Massa, G. Synnaeve, N. Usunier, R. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” European Conference on Computer Vision, 2020.
  6. H. Thisanke, C. Deshan, K. Chamith, S. Seneviratne, R. Vidanaarachchi, and D. Herath, “Semantic segmentation using vision transformers: A survey,” 2023.
  7. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
  8. Y. Li, Z. Wang, B. Li, L. Zhang, Y. Fu, Y. He, G. Xie, Z. Zeng, H. Yu, D. Chen et al., “Vitgan: Training generative adversarial networks with vision transformers,” arXiv preprint arXiv:2108.05620, 2021.
  9. C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” in Proceedings of the IEEE/CVF international conference on computer vision.   New York, NY, USA: IEEE, 2021, pp. 357–366.
  10. A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Landman, H. R. Roth, and D. Xu, “Unetr: Transformers for 3d medical image segmentation,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, pp. 574–584.
  11. J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Wang, and C. Lu, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, 2021.
  12. F. Shamshad, S. Khan, S. W. Zamir, M. H. Khan, M. Hayat, F. S. Khan, and H. Fu, “Transformers in medical imaging: A survey,” Medical Image Analysis, p. 102802, 2023.
  13. R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer for semantic segmentation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 7262–7272.
  14. N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran, “Image transformer,” in International Conference on Learning Representations, 2018.
  15. S. A. Jacobs, M. Tanaka, C. Zhang, M. Zhang, S. L. Song, S. Rajbhandari, and Y. He, “Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models,” 2023.
  16. D. Li, R. Shao, A. Xie, E. P. Xing, J. E. Gonzalez, I. Stoica, X. Ma, and H. Zhang, “Lightseq: Sequence level parallelism for distributed training of long context transformers,” 2023.
  17. H. Liu, M. Zaharia, and P. Abbeel, “Ring attention with blockwise transformers for near-infinite context,” 2023.
  18. X. Wang, I. Lyngaas, A. Tsaris, P. Chen, S. Dash, M. C. Shekar, T. Luo, H.-J. Yoon, M. Wahib, and J. Gouley, “Ultra-long sequence distributed transformer,” 2023.
  19. T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” in NeurIPS: Proceedings of the 35th Neural Information Processing Systems Conference.   New York, NY, USA: Association for Computing Machinery, 2022. [Online]. Available: https://arxiv.org/abs/2205.14135
  20. T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,” 2023.
  21. T. Dao, B. Chen, N. S. Sohoni, A. Desai, M. Poli, J. Grogan, A. Liu, A. Rao, A. Rudra, and C. Re, “Monarch: Expressive structured matrices for efficient and accurate training,” in Proceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162.   PMLR, 17–23 Jul 2022, pp. 4690–4721. [Online]. Available: https://proceedings.mlr.press/v162/dao22a.html
  22. D. Bo, C. Shi, L. Wang, and R. Liao, “Specformer: Spectral graph neural networks meet transformers,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=0pdSt3oyJa1
  23. D. Kreuzer, D. Beaini, W. Hamilton, V. Létourneau, and P. Tossou, “Rethinking graph transformers with spectral attention,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34.   Curran Associates, Inc., 2021, pp. 21 618–21 629. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2021/file/b4fd1d2cb085390fbbadae65e07876a7-Paper.pdf
  24. K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, and A. Weller, “Rethinking attention with performers,” in The International Conference on Learning Representations (ICLR).   New York, NY, USA: Association for Computing Machinery, 2021. [Online]. Available: https://arxiv.org/abs/2009.14794
  25. A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in Proceedings of the 37th International Conference on Machine Learning, ser. ICML’20.   New York, NY, USA: Association for Computing Machinery, 2020.
  26. R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” 2019. [Online]. Available: https://arxiv.org/abs/1904.10509
  27. N. Kitaev, L. Kaiser, and A. Levskaya, “Reformer: The efficient transformer,” in The International Conference on Learning Representations (ICLR).   New York, NY, USA: Association for Computing Machinery, 2020. [Online]. Available: https://arxiv.org/abs/2001.04451
  28. A. Roy, M. Saffar, A. Vaswani, and D. Grangier, “Efficient Content-Based Sparse Attention with Routing Transformers,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 53–68, 02 2021. [Online]. Available: https://doi.org/10.1162/tacl_a_00353
  29. I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document transformer,” arXiv preprint arXiv:2004.05150, 2020.
  30. M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang et al., “Big bird: Transformers for longer sequences,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 283–17 297, 2020.
  31. C. Ying, G. Ke, D. He, and T.-Y. Liu, “Lazyformer: Self attention with lazy update,” 2021.
  32. M. N. Rabe and C. Staats, “Self-attention does not need o⁢(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory,” 2022.
  33. B. Chen, T. Dao, E. Winsor, Z. Song, A. Rudra, and C. Ré, “Scatterbrain: Unifying sparse and low-rank attention,” Advances in Neural Information Processing Systems, vol. 34, pp. 17 413–17 426, 2021.
  34. H. Shi, J. Gao, X. Ren, H. Xu, X. Liang, Z. Li, and J. T. Kwok, “Sparsebert: Rethinking the importance analysis in self-attention,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 139.   New York, NY, USA: PMLR, 2021, pp. 9547–9557. [Online]. Available: http://proceedings.mlr.press/v139/shi21a.html
  35. Y. Si and K. Roberts, “Three-level hierarchical transformer networks for long-sequence and multiple clinical documents classification,” 2021. [Online]. Available: https://arxiv.org/abs/2104.08444
  36. R. J. Chen, C. Chen, Y. Li, T. Y. Chen, A. D. Trister, R. G. Krishnan, and F. Mahmood, “Scaling vision transformers to gigapixel images via hierarchical self-supervised learning,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   New York, NY, USA: IEEE, 2022, pp. 16 123–16 134.
  37. L. Yu, D. Simig, C. Flaherty, A. Aghajanyan, L. Zettlemoyer, and M. Lewis, “Megabyte: Predicting million-byte sequences with multiscale transformers,” 2023.
  38. J. Ainslie, S. Ontanon, C. Alberti, V. Cvicek, Z. Fisher, P. Pham, A. Ravula, S. Sanghai, Q. Wang, and L. Yang, “Etc: Encoding long and structured inputs in transformers,” arXiv preprint arXiv:2004.08483, 2020.
  39. M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang et al., “Big bird: Transformers for longer sequences,” Advances in neural information processing systems, vol. 33, pp. 17 283–17 297, 2020.
  40. N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient transformer,” arXiv preprint arXiv:2001.04451, 2020.
  41. R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” arXiv preprint arXiv:1904.10509, 2019.
  42. A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in International conference on machine learning.   PMLR, 2020, pp. 5156–5165.
  43. K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser et al., “Rethinking attention with performers,” arXiv preprint arXiv:2009.14794, 2020.
  44. M. J. Berger and J. E. Oliger, “Adaptive mesh refinement for hyperbolic partial differential equations,” Stanford, CA, USA, Tech. Rep., 1983.
  45. T. TU, D. R. O’HALLARON, and O. GHATTAS, “Scalable parallel octree meshing for terascale applications,” in Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, ser. SC ’05.   USA: IEEE Computer Society, 2005, p. 4. [Online]. Available: https://doi.org/10.1109/SC.2005.61
  46. M. Wahib, N. Maruyama, and T. Aoki, “Daino: a high-level framework for parallel and efficient AMR on gpus,” in SC.   IEEE Computer Society, 2016, pp. 621–632.
  47. H. Tropf and H. Herzog, “Multimensional range search in dynamically balanced trees,” Angew. Inform., vol. 23, pp. 71–77, 1981. [Online]. Available: https://api.semanticscholar.org/CorpusID:26857103
  48. G. v. d. Bergen, “Efficient collision detection of complex deformable models using aabb trees,” Journal of graphics tools, vol. 2, no. 4, pp. 1–13, 1997.
  49. J. T. Klosowski, M. Held, J. S. Mitchell, H. Sowizral, and K. Zikan, “Efficient collision detection using bounding volume hierarchies of k-dops,” IEEE transactions on Visualization and Computer Graphics, vol. 4, no. 1, pp. 21–36, 1998.
  50. J. Redding, J. Amin, J. Boskovic, Y. Kang, K. Hedrick, J. Howlett, and S. Poll, “A real-time obstacle detection and reactive path planning system for autonomous small-scale helicopters,” in AIAA Guidance, Navigation and Control Conference and Exhibit, 2007, p. 6413.
  51. S. Tang, J. Zhang, S. Zhu, and P. Tan, “Quadtree attention for vision transformers,” arXiv preprint arXiv:2201.02767, 2022.
  52. M. Ibing, G. Kobsik, and L. Kobbelt, “Octree transformer: Autoregressive 3d shape generation on hierarchically structured sequences,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2697–2706.
  53. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015.
  54. Y. J. Kim, H. Jang, K. Lee, S. Park, S.-G. Min, C. Hong, J. H. Park, K. Lee, J. Kim, W. Hong, H. Jung, Y. Liu, H. Rajkumar, M. Khened, G. Krishnamurthi, S. Yang, X. Wang, C. H. Han, J. T. Kwak, J. Ma, Z. Tang, B. Marami, J. Zeineh, Z. Zhao, P.-A. Heng, R. Schmitz, F. Madesta, T. Rösch, R. Werner, J. Tian, E. Puybareau, M. Bovio, X. Zhang, Y. Zhu, S. Y. Chun, W.-K. Jeong, P. Park, and J. Choi, “Paip 2019: Liver cancer segmentation challenge,” Medical Image Analysis, vol. 67, p. 101854, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1361841520302188
  55. S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,” arXiv preprint arXiv:2006.04768, 2020.
  56. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818.
  57. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  58. V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
  59. J. Gu, H. Kwon, D. Wang, W. Ye, M. Li, Y.-H. Chen, L. Lai, V. Chandra, and D. Z. Pan, “Multi-scale high-resolution vision transformer for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 094–12 103.
  60. Y. Yuan, R. Fu, L. Huang, W. Lin, C. Zhang, X. Chen, and J. Wang, “Hrformer: High-resolution transformer for dense prediction,” arXiv preprint arXiv:2110.09408, 2021.
  61. J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang et al., “Deep high-resolution representation learning for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3349–3364, 2020.
  62. P. Zhang, X. Dai, J. Yang, B. Xiao, L. Yuan, L. Zhang, and J. Gao, “Multi-scale vision longformer: A new vision transformer for high-resolution image encoding,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 2998–3008.
  63. J. Canny, “A computational approach to edge detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-8, no. 6, pp. 679–698, 1986.
  64. H. Samet, “The quadtree and related hierarchical data structures,” ACM Computing Surveys (CSUR), vol. 16, no. 2, pp. 187–260, 1984.
  65. H. Finkel and J. Bentley, “Quad trees: a data structure for retrieval on composite keys,” Acta Informatica, vol. 4, no. 1, pp. 1–9, 1974.
  66. “The Frontier supercomputer,” https://www.olcf.ornl.gov/frontier/.
  67. B. Landman, Z. Xu, J. Igelsias, M. Styner, T. Langerak, and A. Klein, “Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge,” in Proc. MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge, 2015.
  68. J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” CoRR, vol. abs/2102.04306, 2021.
  69. Y. Tang, D. Yang, W. Li, H. R. Roth, B. A. Landman, D. Xu, V. Nath, and A. Hatamizadeh, “Self-supervised pre-training of swin transformers for 3d medical image analysis,” in CVPR.   IEEE, 2022, pp. 20 698–20 708.
  70. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  71. Y. Zhou, L. Xie, W. Shen, Y. Wang, E. K. Fishman, and A. L. Yuille, “A fixed-point model for pancreas segmentation in abdominal ct scans,” in International conference on medical image computing and computer-assisted intervention.   Springer, 2017, pp. 693–701.

Summary

We haven't generated a summary for this paper yet.