Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Solving Masked Jigsaw Puzzles with Diffusion Vision Transformers (2404.07292v1)

Published 10 Apr 2024 in cs.CV

Abstract: Solving image and video jigsaw puzzles poses the challenging task of rearranging image fragments or video frames from unordered sequences to restore meaningful images and video sequences. Existing approaches often hinge on discriminative models tasked with predicting either the absolute positions of puzzle elements or the permutation actions applied to the original data. Unfortunately, these methods face limitations in effectively solving puzzles with a large number of elements. In this paper, we propose JPDVT, an innovative approach that harnesses diffusion transformers to address this challenge. Specifically, we generate positional information for image patches or video frames, conditioned on their underlying visual content. This information is then employed to accurately assemble the puzzle pieces in their correct positions, even in scenarios involving missing pieces. Our method achieves state-of-the-art performance on several datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Photo sequencing. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, pages 654–667. Springer, 2012.
  2. Better plain vit baselines for imagenet-1k. arXiv preprint arXiv:2205.01580, 2022.
  3. Solving jigsaw puzzles with eroded boundaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3526–3535, 2020.
  4. Photo sequencing. International journal of computer vision, 110:275–289, 2014.
  5. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  6. Solving temporal puzzles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5896–5905, 2016.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  8. Apictorial jigsaw puzzles: The computer solution of a problem in pattern recognition. IEEE Transactions on Electronic Computers, (2):118–127, 1964.
  9. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  10. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  11. Self-supervised video representation learning with space-time cubic puzzles. In Proceedings of the AAAI conference on artificial intelligence, pages 8545–8552, 2019.
  12. Rethinking the self-attention in vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3071–3075, 2021.
  13. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE international conference on computer vision, pages 667–676, 2017.
  14. Jigsawgan: Auxiliary learning for solving jigsaw puzzles with generative adversarial networks. IEEE Transactions on Image Processing, 31:513–524, 2021.
  15. Shuffle and learn: unsupervised learning using temporal order verification. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 527–544. Springer, 2016.
  16. Space-time tradeoffs in photo sequencing. In Proceedings of the IEEE International Conference on Computer Vision, pages 977–984, 2013.
  17. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  18. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pages 69–84. Springer, 2016.
  19. Deepzzle: Solving visual jigsaw puzzles with deep learning and shortest path optimization. IEEE Trans. on Image Proc., 29:3569–3581, 2020.
  20. Image reassembly combining deep learning and shortest path problem. In Proceedings of the European conference on computer vision (ECCV), pages 153–167, 2018.
  21. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  22. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  23. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations, 2022.
  24. Siamese-discriminant deep reinforcement learning for solving jigsaw puzzles with large eroded gaps. In AAAI, 2023a.
  25. Solving jigsaw puzzle of large eroded gaps using puzzlet discriminant network. In ICASSP, 2023b.
  26. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  27. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852. PMLR, 2015.
  28. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2149–2159, 2022.
  29. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  30. Video anomaly detection by solving decoupled spatio-temporal jigsaw puzzles. In European Conference on Computer Vision, pages 494–511. Springer, 2022.
  31. Iterative reorganization with weak spatial constraints: Solving arbitrary jigsaw puzzles for unsupervised representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1910–1919, 2019.
  32. Random shuffle transformer for image restoration. In International Conference on Machine Learning, pages 38039–38058. PMLR, 2023.
  33. Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10334–10343, 2019.
  34. Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019.
  35. The met dataset: Instance-level recognition for artworks. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  36. Dtvnet: Dynamic time-lapse video generation via single still image. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 300–315. Springer, 2020.

Summary

We haven't generated a summary for this paper yet.