Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-supervised Photographic Image Layout Representation Learning (2403.03740v2)

Published 6 Mar 2024 in cs.CV and cs.MM

Abstract: In the domain of image layout representation learning, the critical process of translating image layouts into succinct vector forms is increasingly significant across diverse applications, such as image retrieval, manipulation, and generation. Most approaches in this area heavily rely on costly labeled datasets and notably lack in adapting their modeling and learning methods to the specific nuances of photographic image layouts. This shortfall makes the learning process for photographic image layouts suboptimal. In our research, we directly address these challenges. We innovate by defining basic layout primitives that encapsulate various levels of layout information and by mapping these, along with their interconnections, onto a heterogeneous graph structure. This graph is meticulously engineered to capture the intricate layout information within the pixel domain explicitly. Advancing further, we introduce novel pretext tasks coupled with customized loss functions, strategically designed for effective self-supervised learning of these layout graphs. Building on this foundation, we develop an autoencoder-based network architecture skilled in compressing these heterogeneous layout graphs into precise, dimensionally-reduced layout representations. Additionally, we introduce the LODB dataset, which features a broader range of layout categories and richer semantics, serving as a comprehensive benchmark for evaluating the effectiveness of layout representation learning methods. Our extensive experimentation on this dataset demonstrates the superior performance of our approach in the realm of photographic image layout representation learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. J. Wang and S. Langer, “A brief review of human perception factors in digital displays for picture archiving and communications systems,” Journal of Digital Imaging, vol. 10, pp. 158–168, 1997.
  2. X. Wujiang, Y. Xu, G. Sang, L. Li, A. Wang, P. Wei, and L. Zhu, “Recursive multi-relational graph convolutional network for automatic photo selection,” IEEE Transactions on Multimedia, 2022.
  3. M. Zhang, M. Li, J. Yu, and L. Chen, “Aesthetic photo collage with deep reinforcement learning,” IEEE Transactions on Multimedia, 2022.
  4. H. Tan, B. Yin, K. Wei, X. Liu, and X. Li, “Alr-gan: Adaptive layout refinement for text-to-image synthesis,” IEEE Transactions on Multimedia, 2023.
  5. C. Li, P. Zhang, and C. Wang, “Harmonious textual layout generation over natural images via deep aesthetics learning,” IEEE Transactions on Multimedia, vol. 24, pp. 3416–3428, 2021.
  6. J. Cheng, F. Wu, L. Liu, Q. Zhang, L. Rutkowski, and D. Tao, “Indecgan: Learning to generate complex images from captions via independent object-level decomposition and enhancement,” IEEE Transactions on Multimedia, 2023.
  7. M. Song, G.-M. Um, H. K. Lee, J. Seo, and W. Kim, “Dynamic residual filtering with laplacian pyramid for instance segmentation,” IEEE Transactions on Multimedia, 2022.
  8. F.-L. Zhang, M. Wang, and S.-M. Hu, “Aesthetic image enhancement by dependence-aware object recomposition,” IEEE Transactions on Multimedia, vol. 15, no. 7, pp. 1480–1490, 2013.
  9. J.-T. Lee, H.-U. Kim, C. Lee, and C.-S. Kim, “Photographic composition classification and dominant geometric element detection for outdoor scenes,” Journal of Visual Communication and Image Representation, vol. 55, pp. 91–105, 2018.
  10. B. Zhang, L. Niu, and L. Zhang, “Image composition assessment with saliency-augmented multi-pattern pooling,” arXiv preprint arXiv:2104.03133, 2021.
  11. J. Hou, S. Yang, and W. Lin, “Object-level attention for aesthetic rating distribution prediction,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 816–824.
  12. D. She, Y.-K. Lai, G. Yi, and K. Xu, “Hierarchical layout-aware graph convolutional network for unified aesthetics assessment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8475–8484.
  13. J. Ke, Q. Wang, Y. Wang, P. Milanfar, and F. Yang, “Musiq: Multi-scale image quality transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5148–5157.
  14. P. Lu, H. Zhang, X. Peng, and X. Jin, “Learning the relation between interested objects and aesthetic region for image cropping,” IEEE Transactions on Multimedia, vol. 23, pp. 3618–3630, 2020.
  15. S. Ni, F. Shao, X. Chai, H. Chen, and Y.-S. Ho, “Composition-guided neural network for image cropping aesthetic assessment,” IEEE Transactions on Multimedia, 2022.
  16. D. Manandhar, D. Ruta, and J. Collomosse, “Learning structural similarity of user interface layouts using graph networks,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16.   Springer, 2020, pp. 730–746.
  17. A. G. Patil, M. Li, M. Fisher, M. Savva, and H. Zhang, “Layoutgmn: Neural graph matching for structural layout similarity,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 048–11 057.
  18. Y. Bai, D. Manandhar, Z. Wang, J. Collomosse, and Y. Fu, “Layout representation learning with spatial and structural hierarchies,” 2023.
  19. P. T. Quinlan and R. N. Wilton, “Grouping by proximity or similarity? competition between the gestalt principles in vision,” Perception, vol. 27, no. 4, pp. 417–430, 1998.
  20. P. Kałamała, A. Sadowska, W. Ordziniak, and A. Chuderski, “Gestalt effects in visual working memory: Whole-part similarity works, symmetry does not.” Experimental Psychology, vol. 64, no. 1, p. 5, 2017.
  21. W. Wu, X.-M. Fu, R. Tang, Y. Wang, Y.-H. Qi, and L. Liu, “Data-driven interior plan generation for residential buildings,” ACM Transactions on Graphics (TOG), vol. 38, no. 6, pp. 1–12, 2019.
  22. B. Deka, Z. Huang, C. Franzen, J. Hibschman, D. Afergan, Y. Li, J. Nichols, and R. Kumar, “Rico: A mobile app dataset for building data-driven design applications,” in Proceedings of the 30th annual ACM symposium on user interface software and technology, 2017, pp. 845–854.
  23. L. Zhang, M. Song, Y. Yang, Q. Zhao, C. Zhao, and N. Sebe, “Weakly supervised photo cropping,” IEEE Transactions on Multimedia, vol. 16, no. 1, pp. 94–107, 2013.
  24. T. F. Liu, M. Craft, J. Situ, E. Yumer, R. Mech, and R. Kumar, “Learning design semantics for mobile apps,” in Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, 2018, pp. 569–579.
  25. D. Manandhar, H. Jin, and J. Collomosse, “Magic layouts: Structural prior for component detection in user interface designs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 809–15 818.
  26. J. Jin, Z. Xue, and B. Leng, “Shrag: Semantic hierarchical graph for floorplan representation,” in 2022 International Conference on 3D Vision (3DV).   IEEE, 2022, pp. 271–279.
  27. Y. Yuan, X. Chen, and J. Wang, “Object-contextual representations for semantic segmentation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16.   Springer, 2020, pp. 173–190.
  28. D. Arthur and S. Vassilvitskii, “K-means++ the advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007, pp. 1027–1035.
  29. J. D. Banfield and A. E. Raftery, “Model-based gaussian and non-gaussian clustering,” Biometrics, pp. 803–821, 1993.
  30. E. S. Spelke, K. Breinlinger, K. Jacobson, and A. Phillips, “Gestalt relations and object perception: A developmental study,” Perception, vol. 22, no. 12, pp. 1483–1501, 1993.
  31. L. Cai, Z. Zhang, Y. Zhu, L. Zhang, M. Li, and X. Xue, “Bigdetection: A large-scale benchmark for improved object detector pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4777–4787.
  32. Y. Liu, M. Jin, S. Pan, C. Zhou, Y. Zheng, F. Xia, and S. Y. Philip, “Graph self-supervised learning: A survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 6, pp. 5879–5900, 2022.
  33. R. Winter, F. Noé, and D.-A. Clevert, “Permutation-invariant variational autoencoder for graph-level representation learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 9559–9573, 2021.
  34. X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 002–21 012, 2020.
  35. Y. Yang, L. Xu, L. Li, N. Qie, Y. Li, P. Zhang, and Y. Guo, “Personalized image aesthetics assessment with rich attributes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 861–19 869.
  36. S. Kong, X. Shen, Z. Lin, R. Mech, and C. Fowlkes, “Photo aesthetics ranking network with attributes and content adaptation,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14.   Springer, 2016, pp. 662–679.
  37. N. Murray, L. Marchesotti, and F. Perronnin, “Ava: A large-scale database for aesthetic visual analysis,” in 2012 IEEE conference on computer vision and pattern recognition.   IEEE, 2012, pp. 2408–2415.
  38. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.
  39. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  40. J. Ren, X. Shen, Z. Lin, R. Mech, and D. J. Foran, “Personalized image aesthetics,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 638–647.
  41. S. He, Y. Zhang, R. Xie, D. Jiang, and A. Ming, “Rethinking image aesthetics assessment: Models, datasets and benchmarks,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, 2022, pp. 942–948.
  42. M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, pp. 303–338, 2010.
  43. M. Everingham and J. Winn, “The pascal visual object classes challenge 2012 (voc2012) development kit,” Pattern Anal. Stat. Model. Comput. Learn., Tech. Rep, vol. 2007, no. 1-45, p. 5, 2012.
  44. Z. Fu, Z. Mao, B. Hu, A.-A. Liu, and Y. Zhang, “Intra-class adaptive augmentation with neighbor correction for deep metric learning,” IEEE Transactions on Multimedia, 2022.
  45. FoamLiu, “A github repository for scene classification,” https://github.com/foamliu/Scene-Classification, 2023, accessed: August 18, 2023.
  46. D. Huynh and E. Elhamifar, “A shared multi-attention framework for multi-label zero-shot learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8776–8786.
  47. L. Yang, B. Geng, Y. Cai, A. Hanjalic, and X.-S. Hua, “Object retrieval using visual query context,” IEEE Transactions on multimedia, vol. 13, no. 6, pp. 1295–1307, 2011.
  48. S. Li, Z. Chen, J. Lu, X. Li, and J. Zhou, “Neighborhood preserving hashing for scalable video retrieval,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8212–8221.

Summary

We haven't generated a summary for this paper yet.