Learning State-Invariant Representations of Objects from Image Collections with State, Pose, and Viewpoint Changes (2404.06470v1)
Abstract: We add one more invariance - state invariance - to the more commonly used other invariances for learning object representations for recognition and retrieval. By state invariance, we mean robust with respect to changes in the structural form of the object, such as when an umbrella is folded, or when an item of clothing is tossed on the floor. Since humans generally have no difficulty in recognizing objects despite such state changes, we are naturally faced with the question of whether it is possible to devise a neural architecture with similar abilities. To that end, we present a novel dataset, ObjectsWithStateChange, that captures state and pose variations in the object images recorded from arbitrary viewpoints. We believe that this dataset will facilitate research in fine-grained object recognition and retrieval of objects that are capable of state changes. The goal of such research would be to train models capable of generating object embeddings that remain invariant to state changes while also staying invariant to transformations induced by changes in viewpoint, pose, illumination, etc. To demonstrate the usefulness of the ObjectsWithStateChange dataset, we also propose a curriculum learning strategy that uses the similarity relationships in the learned embedding space after each epoch to guide the training process. The model learns discriminative features by comparing visually similar objects within and across different categories, encouraging it to differentiate between objects that may be challenging to distinguish due to changes in their state. We believe that this strategy enhances the model's ability to capture discriminative features for fine-grained tasks that may involve objects with state changes, leading to performance improvements on object-level tasks not only on our new dataset, but also on two other challenging multi-view datasets such as ModelNet40 and ObjectPI.
- Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA, USA: IEEE Computer Society, jun 2015, pp. 1912–1920.
- X. Liu, Z. Han, Y.-S. Liu, and M. Zwicker, “Fine-grained 3d shape classification with hierarchical part-view attentions,” IEEE Transactions on Image Processing (TIP), 2021.
- B. Leibe and B. Schiele, “Analyzing appearance and contour based methods for object categorization,” in Computer Vision and Pattern Recognition (CVPR), vol. 2. IEEE, 2003, pp. II–409.
- K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view rgb-d object dataset,” in International Conference on Robotics and Automation (ICRA), May 2011, pp. 1817–1824.
- A. Kanezaki, Y. Matsushita, and Y. Nishida, “Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,” Computer Vision and Pattern Recognition (CVPR), pp. 5010–5019, 2018.
- X. Wang, T. Ma, J. Ainooson, S. Cha, X. Wang, A. Molla, and M. Kunda, “Seeing neural networks through a box of toys: The toybox dataset of visual object transformations,” 2018.
- C.-H. Ho, P. Morgado, A. Persekian, and N. Vasconcelos, “Pies: Pose invariant embeddings,” in Computer Vision and Pattern Recognition (CVPR), June 2019.
- A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and B. Katz, “Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019.
- X. Wei, Q. Cui, L. Yang, P. Wang, and L. Liu, “RPC: A large-scale retail product checkout dataset,” CoRR, vol. abs/1901.07249, 2019.
- R. Wang, J. Wang, T. S. Kim, J. Kim, and H.-J. Lee, “Mvp-n: A dataset and benchmark for real-world multi-view object classification,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 20 536–20 550.
- R. Sarkar and A. Kak, “Dual pose-invariant embeddings: Learning category and object-specific discriminative representations for recognition and retrieval,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [Online]. Available: https://arxiv.org/abs/2403.00272
- A. Kanezaki, Y. Matsushita, and Y. Nishida, “Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,” in Computer Vision and Pattern Recognition (CVPR), 2018.
- S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Computer Vision and Pattern Recognition (CVPR), vol. 1, 2005, pp. 539–546.
- E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” Lecture Notes in Computer Science, p. 84–92, 2015.
- W. Chen, X. Chen, J. Zhang, and K. Huang, “Beyond triplet loss: A deep quadruplet network for person re-identification,” in Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1320–1329.
- H. Xuan, A. Stylianou, X. Liu, and R. Pless, “Hard negative examples are hard, but useful,” in European Conference on Computer Vision (ECCV), A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Springer International Publishing, 2020, pp. 126–142.
- C. Huang, C. C. Loy, and X. Tang, “Local similarity-aware deep feature embedding,” in Neural Information Processing Systems (NeurIPS), Red Hook, NY, USA, 2016, p. 1270–1278.
- F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” Computer Vision and Pattern Recognition (CVPR), Jun 2015.
- H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric learning via lifted structured feature embedding,” in Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4004–4012.
- B. Harwood, V. K. B.G., G. Carneiro, I. Reid, and T. Drummond, “Smart mining for deep metric learning,” in International Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, oct 2017, pp. 2840–2848.
- W. Ge, W. Huang, D. Dong, and M. R. Scott, “Deep metric learning with hierarchical triplet loss,” in European Conference on Computer Vision (ECCV). Springer International Publishing, 2018, pp. 272–288.
- X. Wang, H. Zhang, W. Huang, and M. R. Scott, “Cross-batch memory for embedding learning,” in Computer Vision and Pattern Recognition (CVPR), 2020, pp. 6388–6397.
- Y. Suh, B. Han, W. Kim, and K. M. Lee, “Stochastic class-based hard example mining for deep metric learning,” in Computer Vision and Pattern Recognition (CVPR), 2019, pp. 7244–7252.
- Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh, “No fuss distance metric learning using proxies,” in International Conference on Computer Vision (ICCV), 2017, pp. 360–368.
- Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in International Conference on Machine Learning (ICML), 2009, p. 41–48.
- X. Wang, Y. Chen, and W. Zhu, “A survey on curriculum learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 44, no. 9, pp. 4555–4576, 2022.
- A. Sanakoyeu, V. Tschernezki, U. Büchler, and B. Ommer, “Divide and conquer the embedding space for metric learning,” in Computer Vision and Pattern Recognition (CVPR), 2019.
- A. Sanakoyeu, P. Ma, V. Tschernezki, and B. Ommer, “Improving deep metric learning by divide and conquer,” IEEE Transactions on pattern analysis and machine intelligence (TPAMI), 2021.
- R. Sarkar, N. Bodla, M. Vasileva, Y.-L. Lin, A. Beniwal, A. Lu, and G. Medioni, “Outfittransformer: Outfit representations for fashion recommendation,” in Computer Vision and Pattern Recognition Workshops (CVPRW), June 2022, pp. 2263–2267.
- R. Sarkar, N. Bodla, M. I. Vasileva, Y.-L. Lin, A. Beniwal, A. Lu, and G. Medioni, “Outfittransformer: Learning outfit representations for fashion recommendation,” in Winter Conference on Applications of Computer Vision (WACV), January 2023, pp. 3601–3609.
- H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller, “Multi-view convolutional neural networks for 3d shape recognition,” in International Conference on Computer Vision (ICCV), 2015.
- X. He, Y. Zhou, Z. Zhou, S. Bai, and X. Bai, “Triplet-center loss for multi-view 3d object retrieval,” Computer Vision and Pattern Recognition (CVPR), Jun 2018.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.
- W. Nie, Y. Zhao, D. Song, and Y. Gao, “Dan: Deep-attention network for 3d shape recognition,” IEEE Trans. on Image Processing, vol. 30, pp. 4371–4383, 2021.
- X. Wei, Y. Gong, F. Wang, X. Sun, and J. Sun, “Learning canonical view representation for 3d shape recognition with arbitrary views,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, oct 2021, pp. 397–406.
- W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large-margin softmax loss for convolutional neural networks,” in International Conference on Machine Learning (ICML), ser. ICML’16. JMLR.org, 2016, p. 507–516.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.
- R. Sarkar and A. C. Kak, “Checksoft : A scalable event-driven software architecture for keeping track of people and things in people-centric spaces,” CoRR, vol. abs/2102.10513, 2021.
- R. Sarkar and A. Kak, “Scalable event-driven software architecture for the automation of people-centric systems,” Dec. 10 2020, uS Patent App. Publication US20200388121A1.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning (ICML), 2021, pp. 8748–8763.
- D. Mizrahi, R. Bachmann, O. Kar, T. Yeo, M. Gao, A. Dehghan, and A. Zamir, “4m: Massively multimodal masked modeling,” Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2024.
- Rohan Sarkar (13 papers)
- Avinash Kak (9 papers)