Independently Keypoint Learning for Small Object Semantic Correspondence (2404.02678v1)
Abstract: Semantic correspondence remains a challenging task for establishing correspondences between a pair of images with the same category or similar scenes due to the large intra-class appearance. In this paper, we introduce a novel problem called 'Small Object Semantic Correspondence (SOSC).' This problem is challenging due to the close proximity of keypoints associated with small objects, which results in the fusion of these respective features. It is difficult to identify the corresponding key points of the fused features, and it is also difficult to be recognized. To address this challenge, we propose the Keypoint Bounding box-centered Cropping (KBC) method, which aims to increase the spatial separation between keypoints of small objects, thereby facilitating independent learning of these keypoints. The KBC method is seamlessly integrated into our proposed inference pipeline and can be easily incorporated into other methodologies, resulting in significant performance enhancements. Additionally, we introduce a novel framework, named KBCNet, which serves as our baseline model. KBCNet comprises a Cross-Scale Feature Alignment (CSFA) module and an efficient 4D convolutional decoder. The CSFA module is designed to align multi-scale features, enriching keypoint representations by integrating fine-grained features and deep semantic features. Meanwhile, the 4D convolutional decoder, based on efficient 4D convolution, ensures efficiency and rapid convergence. To empirically validate the effectiveness of our proposed methodology, extensive experiments are conducted on three widely used benchmarks: PF-PASCAL, PF-WILLOW, and SPair-71k. Our KBC method demonstrates a substantial performance improvement of 7.5\% on the SPair-71K dataset, providing compelling evidence of its efficacy.
- Surf: Speeded up robust features, in: Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9, Springer. pp. 404–417.
- Dense semantic correspondence where every pixel is a classifier, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 4024–4031.
- An improved faster r-cnn for small object detection. Ieee Access 7, 106838–106846.
- End-to-end object detection with transformers, in: European conference on computer vision, Springer. pp. 213–229.
- Cats: Cost aggregation transformers for visual correspondence. Advances in Neural Information Processing Systems 34, 9011–9023.
- Histograms of oriented gradients for human detection, in: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), Ieee. pp. 886–893.
- Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee. pp. 248–255.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 .
- Multi-frame self-supervised depth with transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 160–170.
- Proposal flow, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3475–3484.
- Proposal flow: Semantic correspondences from object proposals. IEEE transactions on pattern analysis and machine intelligence 40, 1711–1725.
- Scnet: Learning semantic correspondence, in: Proceedings of the IEEE international conference on computer vision, pp. 1831–1840.
- Integrative feature and cost aggregation with transformers for dense correspondence. arXiv preprint arXiv:2209.08742 .
- Dynamic context correspondence network for semantic alignment, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2010–2019.
- Flownet 2.0: Evolution of optical flow estimation with deep networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2462–2470.
- Parn: Pyramidal affine regression networks for dense semantic correspondence, in: Proceedings of the European Conference on Computer Vision (ECCV), pp. 351–366.
- Cotr: Correspondence transformer for matching across images, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6207–6217.
- Transformatcher: Match-to-match attention for semantic correspondence, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8697–8707.
- Correspondence networks with adaptive neighbourhood consensus, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10196–10205.
- Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6197–6206.
- Semantic correspondence as an optimal transport problem, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4463–4472.
- Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440.
- Distinctive image features from scale-invariant keypoints. International journal of computer vision 60, 91–110.
- Transflow: Transformer as flow learner, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18063–18073.
- Dgc-net: Dense geometric correspondence network, in: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE. pp. 1034–1042.
- Convolutional hough matching networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2940–2950.
- Hypercorrelation squeeze for few-shot segmentation, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6941–6952.
- Hyperpixel flow: Semantic correspondence with multi-layer neural features, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3395–3404.
- Spair-71k: A large-scale benchmark for semantic correspondence. arXiv preprint arXiv:1908.10543 .
- Learning to compose hypercolumns for visual correspondence, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, Springer. pp. 346–363.
- Fast-scnn: Fast semantic segmentation network. arXiv preprint arXiv:1902.04502 .
- You only look once: Unified, real-time object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788.
- Convolutional neural network architecture for geometric matching, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6148–6157.
- Neighbourhood consensus networks. Advances in neural information processing systems 31.
- Attentive semantic alignment with offset-aware correlation kernels, in: Proceedings of the European Conference on Computer Vision (ECCV), pp. 349–364.
- Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1599–1610.
- Segmenter: Transformer for semantic segmentation, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 7262–7272.
- Correspondence transformers with asymmetric feature learning and matching flow super-resolution, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17787–17796.
- Deep semantic feature matching, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6914–6923.
- Understanding convolution for semantic segmentation, in: 2018 IEEE winter conference on applications of computer vision (WACV), Ieee. pp. 1451–1460.
- Styleformer: Real-time arbitrary style transfer via parametric style composition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14618–14627.
- Gmflow: Learning optical flow via global matching, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8121–8130.
- Volumetric correspondence networks for optical flow. Advances in neural information processing systems 32.
- Unitbox: An advanced object detection network, in: Proceedings of the 24th ACM international conference on Multimedia, pp. 516–520.
- Context encoding for semantic segmentation, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7151–7160.
- Multi-scale matching networks for semantic correspondence, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3354–3364.
- M2det: A single-shot object detector based on multi-level feature pyramid network, in: Proceedings of the AAAI conference on artificial intelligence, pp. 9259–9266.
- Cross attention based style distribution for controllable person image synthesis, in: European Conference on Computer Vision, Springer. pp. 161–178.
- Learning transferable architectures for scalable image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710.