Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Real-Time Image Segmentation via Hybrid Convolutional-Transformer Architecture Search (2403.10413v1)

Published 15 Mar 2024 in cs.CV

Abstract: Image segmentation is one of the most fundamental problems in computer vision and has drawn a lot of attentions due to its vast applications in image understanding and autonomous driving. However, designing effective and efficient segmentation neural architectures is a labor-intensive process that may require lots of trials by human experts. In this paper, we address the challenge of integrating multi-head self-attention into high resolution representation CNNs efficiently, by leveraging architecture search. Manually replacing convolution layers with multi-head self-attention is non-trivial due to the costly overhead in memory to maintain high resolution. By contrast, we develop a multi-target multi-branch supernet method, which not only fully utilizes the advantages of high-resolution features, but also finds the proper location for placing multi-head self-attention module. Our search algorithm is optimized towards multiple objective s (e.g., latency and mIoU) and capable of finding architectures on Pareto frontier with arbitrary number of branches in a single search. We further present a series of model via Hybrid Convolutional-Transformer Architecture Search (HyCTAS) method that searched for the best hybrid combination of light-weight convolution layers and memory-efficient self-attention layers between branches from different resolutions and fuse to high resolution for both efficiency and effectiveness. Extensive experiments demonstrate that HyCTAS outperforms previous methods on semantic segmentation task. Code and models are available at \url{https://github.com/MarvinYu1995/HyCTAS}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. E. H. Said, D. E. M. Nassar, G. Fahmy, and H. H. Ammar, “Teeth segmentation in digitized dental x-ray films using mathematical morphology,” IEEE transactions on information forensics and security, vol. 1, no. 2, pp. 178–189, 2006.
  2. Z. Sun, S. Balakrishnan, L. Su, A. Bhuyan, P. Wang, and C. Qiao, “Who is in control? practical physical layer attack and defense for mmwave-based sensing in autonomous vehicles,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 3199–3214, 2021.
  3. K. Thanikasalam, C. Fookes, S. Sridharan, A. Ramanan, and A. Pinidiyaarachchi, “Target-specific siamese attention network for real-time object tracking,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 1276–1289, 2019.
  4. C. Wang, J. Muhammad, Y. Wang, Z. He, and Z. Sun, “Towards complete and accurate iris segmentation using deep multi-task attention network for non-cooperative iris recognition,” IEEE Transactions on information forensics and security, vol. 15, pp. 2944–2959, 2020.
  5. M. Liu and P. Qian, “Automatic segmentation and enhancement of latent fingerprints using deep nested unets,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 1709–1719, 2020.
  6. C. Peng, N. Wang, J. Li, and X. Gao, “Soft semantic representation for cross-domain face recognition,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 346–360, 2020.
  7. S. He and L. Schomaker, “Fragnet: Writer identification using deep fragment networks,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 3013–3022, 2020.
  8. A. R. Lejbølle, K. Nasrollahi, B. Krogh, and T. B. Moeslund, “Person re-identification using spatial and layer-wise attention,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 1216–1231, 2019.
  9. L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv:1706.05587, 2017.
  10. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” TPAMI, 2019.
  11. K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, and J. Wang, “High-resolution representations for labeling pixels and regions,” arXiv:1904.04514, 2019.
  12. J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao, “Deep high-resolution representation learning for visual recognition,” TPAMI, 2019.
  13. B. Cheng, B. Xiao, J. Wang, H. Shi, T. Huang, and L. Zhang, “Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5385–5394, 2020.
  14. C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and L. Fei-Fei, “Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation,” in CVPR, 2019.
  15. “Fasterseg: Searching for faster real-time semantic segmentation,” in ICLR, 2020.
  16. H. Wu, J. Zhang, and K. Huang, “Sparsemask: Differentiable connectivity learning for dense image prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6768–6777.
  17. L.-C. Chen, M. D. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens, “Searching for efficient multi-scale architectures for dense image prediction,” in NeurIPS, 2018.
  18. V. Nekrasov, H. Chen, C. Shen, and I. Reid, “Fast neural architecture search of compact semantic segmentation models via auxiliary cells,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9126–9135.
  19. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
  20. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
  21. V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” TPAMI, 2017.
  22. M. Treml, J. A. Arjona-Medina, T. Unterthiner, R. Durgesh, F. Friedmann, P. Schuberth, A. Mayr, M. Heusel, M. Hofmarcher, M. Widrich, B. Nessler, and S. Hochreiter, “Speeding up semantic segmentation for autonomous driving,” 2016.
  23. A. Paszke, A. Chaurasia, S. Kim, and E. E. Culurciello, “A deep neural network architecture for real-time semantic segmentation,” arXiv preprint arXiv:1606.02147, 2016.
  24. H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic segmentation on high-resolution images,” in ECCV, 2018.
  25. C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in ECCV, 2018.
  26. H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” in ICLR, 2019.
  27. S. Xie, H. Zheng, C. Liu, and L. Lin, “SNAS: stochastic neural architecture search,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=rylqooRqK7
  28. E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 4780–4789.
  29. M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for mobile,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  30. Y. Zhang, Z. Qiu, J. Liu, T. Yao, D. Liu, and T. Mei, “Customizable architecture search for semantic segmentation,” in CVPR, 2019.
  31. Y. Li, L. Song, Y. Chen, Z. Li, X. Zhang, X. Wang, and J. Sun, “Learning dynamic routing for semantic segmentation,” in CVPR, 2020.
  32. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  33. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision.   Springer, 2020, pp. 213–229.
  34. S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” arXiv preprint arXiv:2012.15840, 2020.
  35. X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020.
  36. F. Yang, H. Yang, J. Fu, H. Lu, and B. Guo, “Learning texture transformer network for image super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5791–5800.
  37. J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang, “Slimmable neural networks,” in ICLR, 2019.
  38. S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang, “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in CVPR, 2021.
  39. H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,” in International conference on machine learning.   PMLR, 2019, pp. 7354–7363.
  40. L. Li and A. Talwalkar, “Random search and reproducibility for neural architecture search,” in UAI, 2019.
  41. Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun, “Single path one-shot neural architecture search with uniform sampling,” 2020.
  42. K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: Nsga-ii,” EC, 2002.
  43. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in CVPR, 2016.
  44. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  45. B. Cheng, “panoptic-deeplab,” https://github.com/bowenc0221/panoptic-deeplab, 2020.
  46. R. P. Poudel, S. Liwicki, and R. Cipolla, “Fast-scnn: fast semantic segmentation network,” in BMVC, 2019.
  47. X. Li, Y. Zhou, Z. Pan, and J. Feng, “Partial order pruning: for best speed/accuracy trade-off in neural architecture search,” in CVPR, 2019.
  48. A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al., “Searching for mobilenetv3,” in ICCV, 2019.
  49. A. Shaw, D. Hunter, F. Landola, and S. Sidhu, “Squeezenas: Fast neural architecture search for faster semantic segmentation,” in ICCVW, 2019.
  50. Q. Wan, Z. Huang, J. Lu, G. Yu, and L. Zhang, “Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation,” arXiv preprint arXiv:2301.13156, 2023.
  51. B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in CVPR, 2018.
  52. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Hongyuan Yu (21 papers)
  2. Cheng Wan (48 papers)
  3. Mengchen Liu (48 papers)
  4. Dongdong Chen (164 papers)
  5. Bin Xiao (93 papers)
  6. Xiyang Dai (53 papers)

Summary

We haven't generated a summary for this paper yet.