NOAH: Learning Pairwise Object Category Attentions for Image Classification (2402.02377v1)
Abstract: A modern deep neural network (DNN) for image classification tasks typically consists of two parts: a backbone for feature extraction, and a head for feature encoding and class predication. We observe that the head structures of mainstream DNNs adopt a similar feature encoding pipeline, exploiting global feature dependencies while disregarding local ones. In this paper, we revisit the feature encoding problem, and propose Non-glObal Attentive Head (NOAH) that relies on a new form of dot-product attention called pairwise object category attention (POCA), efficiently exploiting spatially dense category-specific attentions to augment classification performance. NOAH introduces a neat combination of feature split, transform and merge operations to learn POCAs at local to global scales. As a drop-in design, NOAH can be easily used to replace existing heads of various types of DNNs, improving classification performance while maintaining similar model efficiency. We validate the effectiveness of NOAH on ImageNet classification benchmark with 25 DNN architectures spanning convolutional neural networks, vision transformers and multi-layer perceptrons. In general, NOAH is able to significantly improve the performance of lightweight DNNs, e.g., showing 3.14\%|5.3\%|1.9\% top-1 accuracy improvement to MobileNetV2 (0.5x)|Deit-Tiny (0.5x)|gMLP-Tiny (0.5x). NOAH also generalizes well when applied to medium-size and large-size DNNs. We further show that NOAH retains its efficacy on other popular multi-class and multi-label image classification benchmarks as well as in different training regimes, e.g., showing 3.6\%|1.1\% mAP improvement to large ResNet101|ViT-Large on MS-COCO dataset. Project page: https://github.com/OSVAI/NOAH.
- D. G. Lowe, “Object recognition from local scale-invariant features,” in ICCV, 1999.
- N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
- S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in CVPR, 2006.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in CVPR, 2015.
- M. Lin, Q. Chen, and S. Yan, “Network in network,” in ICLR, 2014.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
- B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” in ICLR, 2017.
- I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, “Designing network design spaces,” in CVPR, 2020.
- Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in CVPR, 2022.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
- H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in ICML, 2021.
- W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in ICCV, 2021.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
- Z. Huang, Y. Ben, G. Luo, P. Cheng, G. Yu, and B. Fu, “Shuffle transformer: Rethinking spatial shuffle for vision transformer,” arXiv preprint arXiv:2106.03650, 2021.
- X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision transformers,” in CVPR, 2022.
- L. Melas-Kyriazi, “Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet,” arXiv preprint arXiv:2105.02723, 2021.
- I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, D. Keysers, J. Uszkoreit, M. Lucic et al., “Mlp-mixer: An all-mlp architecture for vision,” in NeurIPS, 2021.
- H. Touvron, P. Bojanowski, M. Caron, M. Cord, A. El-Nouby, E. Grave, G. Izacard, A. Joulin, G. Synnaeve, J. Verbeek et al., “Resmlp: Feedforward networks for image classification with data-efficient training,” arXiv preprint arXiv:2105.03404, 2021.
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
- X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in CVPR, 2018.
- T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” in ICCV, 2015.
- Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin, and S. Belongie, “Kernel pooling for convolutional neural networks,” in CVPR, 2017.
- I. Kim, W. Baek, and S. Kim, “Spatially attentive output layer for image classification,” in CVPR, 2020.
- M. A. Islam, M. Kowal, S. Jia, K. G. Derpanis, and N. D. Bruce, “Global pooling, more than meets the eye: Position information is encoded channel-wise in cnns,” in ICCV, 2021.
- K. Zhu and J. Wu, “Residual attention: A simple but effective method for multi-label recognition,” in ICCV, 2021.
- S. Liu, L. Zhang, X. Yang, H. Su, and J. Zhu, “Query2label: A simple transformer way to multi-label classification,” arXiv preprint arXiv:2107.10834, 2021.
- J. Xie, R. Zeng, Q. Wang, Z. Zhou, and P. Li, “Sot: Delving deeper into classification head for transformer,” arXiv preprint arXiv:2104.10935, 2021.
- L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in ICCV, 2015.
- G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie, “The inaturalist species classification and detection dataset,” in CVPR, 2018.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in CVPR, 2017.
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in CVPR, 2018.
- N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in ECCV, 2018.
- M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in ICML, 2019.
- K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer in transformer,” in NeurIPS, 2021.
- L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. Tay, J. Feng, and S. Yan, “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” in ICCV, 2021.
- H. Liu, Z. Dai, D. R. So, and Q. V. Le, “Pay attention to mlps,” in NeurIPS, 2021.
- S. Chen, E. Xie, C. Ge, R. Chen, D. Liang, and P. Luo, “Cyclemlp: A mlp-like architecture for dense prediction,” in ICLR, 2021.
- T. Yu, X. Li, Y. Cai, M. Sun, and P. Li, “S2-mlp: Spatial-shift mlp architecture for vision,” in WACV, 2022.
- Q. Hou, Z. Jiang, L. Yuan, M.-M. Cheng, S. Yan, and J. Feng, “Vision permutator: A permutable mlp-like architecture for visual recognition,” TPAMI, 2022.
- M. D. Zeiler and R. Fergus, “Stochastic pooling for regularization of deep convolutional neural networks,” in ICLR, 2013.
- Y. Gao, O. Beijbom, N. Zhang, and T. Darrell, “Compact bilinear pooling,” in CVPR, 2016.
- Z. Gao, J. Xie, Q. Wang, and P. Li, “Global second-order pooling convolutional networks,” in CVPR, 2019.
- C. Ionescu, O. Vantzos, and C. Sminchisescu, “Matrix backpropagation for deep networks with structured layers,” in ICCV, 2015.
- P. Li, J. Xie, Q. Wang, and Z. Gao, “Towards faster training of global covariance pooling networks by iterative matrix square root normalization,” in CVPR, 2018.
- F. Saeedan, N. Weber, M. Goesele, and S. Roth, “Detail-preserving pooling in deep networks,” in CVPR, 2018.
- J. Zhao and C. G. M. Snoek, “Liftpool: Bidirectional convnet pooling,” in ICLR, 2021.
- R. Zhang, “Making convolutional networks shift-invariant again,” in ICML, 2019.
- O. Rippel, J. Snoek, and R. P. Adams, “Spectral representations for convolutional neural networks,” in NIPS, 2015.
- S. Zhai, H. Wu, A. Kumar, Y. Cheng, Y. Lu, Z. Zhang, and R. Feris, “S3pool: Pooling with stochastic spatial sampling,” in CVPR, 2017.
- Z. Gao, L. Wang, and G. Wu, “Lip: Local importance-based pooling,” in ICCV, 2019.
- Q. Wang, J. Xie, W. Zuo, L. Zhang, and P. Li, “Deep cnns meet global covariance pooling: Better representation and generalization,” TPAMI, 2021.
- T. Kobayashi, “Gaussian-based pooling for convolutional neural networks,” in NeurIPS, 2019.
- L. Liu, C. Shen, and A. Van Den Hengel, “The treasure beneath convolutional layers: Cross-convolutional-layer pooling for image classification,” in CVPR, 2015.
- B.-B. Gao and H.-Y. Zhou, “Learning to discover multi-class attentional regions for multi-label image recognition,” TIP, 2021.
- W. Li, X. Zhu, and S. Gong, “Harmonious attention network for person re-identification,” in CVPR, 2018.
- R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in CVPR, 2016.
- P. Tang, X. Wang, B. Shi, X. Bai, W. Liu, and Z. Tu, “Deep fishernet for object classification,” arXiv preprint arXiv: 1608.00182, 2016.
- Q. Wang, P. Li, and L. Zhang, “G2denet: Global gaussian distribution embedding network and its application to visual recognition,” in CVPR, 2017.
- M. Gou, F. Xiong, O. Camps, and M. Sznaier, “Monet: Moments embedding network,” in CVPR, 2018.
- X. Zhu, J. Liu, W. Liu, J. Ge, B. Liu, and J. Cao, “Scene-aware label graph learning for multi-label image classification,” in ICCV, 2023.
- F. Zhu, H. Li, W. Ouyang, N. Yu, and X. Wang, “Learning spatial regularization with image-level supervisions for multi-label image classification,” in CVPR, 2017.
- Y. Liu, S. Yan, L. Leal-Taixé, J. Hays, and D. Ramanan, “Soft augmentation for image classification,” in CVPR, 2023.
- F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention network for image classification,” in CVPR, 2017.
- S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in ECCV, 2018.
- J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in CVPR, 2018.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F.-F. Li, “Imagenet large scale visual recognition challenge,” IJCV, 2015.
- C.-Y. Lee, P. W. Gallagher, and Z. Tu, “Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree,” in AISTATS, 2016.
- T. Kobayashi, “Global feature guided local pooling,” in ICCV, 2019.
- Q. Wang, P. Li, Q. Hu, P. Zhu, and W. Zuo, “Deep global generalized gaussian networks,” in CVPR, 2019.
- Q. Wang, L. Zhang, B. Wu, D. Ren, P. Li, W. Zuo, and Q. Hu, “What deep cnns benefit from global covariance pooling: an optimization perspective,” in CVPR, 2020.
- Z.-M. Chen, X.-S. Wei, P. Wang, and Y. Guo, “Multi-label image recognition with graph convolutional networks,” in CVPR, 2019.
- Q. Li, X. Peng, Y. Qiao, and Q. Peng, “Learning label correlations for multi-label image recognition with graph networks,” Pattern Recognition Letters, 2020.
- Y. Wang, Y. Xie, Y. Liu, K. Zhou, and X. Li, “Fast graph convolution network based multi-label image recognition via cross-modal fusion,” in CIKM, 2020.
- R. You, Z. Guo, L. Cui, X. Long, Y. Bao, and S. Wen, “Cross-modality attention with semantic graph embedding for multi-label classification,” in AAAI, 2020.
- X. Qu, H. Che, J. Huang, L. Xu, and X. Zheng, “Multi-layered semantic representation network for multi-label image classification,” arXiv preprint arXiv:2106.11596, 2021.
- Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in CVPR, 2020.
- X. Lin, L. Ma, W. Liu, and S.-F. Chang, “Context-gated convolution,” in ECCV, 2020.
- N. Ma, X. Zhang, J. Huang, and J. Sun, “Weightnet: Revisiting the design space of weight networks,” in ECCV, 2020.
- N. Quader, M. M. I. Bhuiyan, J. Lu, P. Dai, and W. Li, “Weight excitation: Built-in attention mechanisms in convolutional neural networks,” in ECCV, 2020.
- A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” in WACV, 2018.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in ICLR, 2019.
- C. Szegedy, V. Vanhoucke, S. Ioffe, S. Jonathon, and W. Zbigniew, “Rethinking the inception architecture for computer vision,” in CVPR, 2016.
- E. D. Cubuk, B. Zoph, J. Shlens, and L. Q. V, “Randaugment: Practical automated data augmentation with a reduced search space,” in NeurIPS, 2020.
- Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” in AAAI, 2020.
- H. Zhang, C. Moustapha, N. D. Yann, and L.-P. David, “mixup: Beyond empirical risk minimization,” in ICLR, 2018.
- S. Yun, , D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in ICCV, 2019.
- G. Huang, Y. Sun, Z. Liu, S. Daniel, and W. K. Q, “Deep networks with stochastic depth,” in ECCV, 2016.
- H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jegou, “Going deeper with image transformers,” in ICCV, 2021.
- B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approximation by averaging,” SIAM Journal on Control and Optimization, 1992.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.