Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiffAugment: Diffusion based Long-Tailed Visual Relationship Recognition (2401.01387v2)

Published 1 Jan 2024 in cs.CV

Abstract: The task of Visual Relationship Recognition (VRR) aims to identify relationships between two interacting objects in an image and is particularly challenging due to the widely-spread and highly imbalanced distribution of <subject, relation, object> triplets. To overcome the resultant performance bias in existing VRR approaches, we introduce DiffAugment -- a method which first augments the tail classes in the linguistic space by making use of WordNet and then utilizes the generative prowess of Diffusion Models to expand the visual space for minority classes. We propose a novel hardness-aware component in diffusion which is based upon the hardness of each <S,R,O> triplet and demonstrate the effectiveness of hardness-aware diffusion in generating visual embeddings for the tail classes. We also propose a novel subject and object based seeding strategy for diffusion sampling which improves the discriminative capability of the generated visual embeddings. Extensive experimentation on the GQA-LT dataset shows favorable gains in the subject/object and relation average per-class accuracy using Diffusion augmented samples.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Combining Local Context and WordNet Similarity for Word Sense Identification. In WordNet: An Electronic Lexical Database. The MIT Press, 1998.
  2. Exploring long tail visual relationship recognition with large vocabulary, 2021.
  3. Blended latent diffusion. ACM Transactions on Graphics, 42(4):1–11, 2023.
  4. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
  5. Reltransformer: A transformer-based long-tail visual relationship recognition, 2022.
  6. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  7. Denoising diffusion probabilistic models, 2020.
  8. Fastdiff: A fast conditional diffusion model for high-quality speech synthesis. arXiv preprint arXiv:2204.09934, 2022.
  9. Gqa: A new dataset for real-world visual reasoning and compositional question answering, 2019.
  10. Decoupling representation and classifier for long-tailed recognition, 2020.
  11. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020.
  12. Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
  13. Priorgrad: Improving conditional denoising diffusion models with data-driven adaptive prior. arXiv preprint arXiv:2106.06406, 2021.
  14. Q-diffusion: Quantizing diffusion models, 2023.
  15. Focal loss for dense object detection, 2018.
  16. Large-scale long-tailed recognition in an open world, 2019.
  17. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
  18. George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, 1995.
  19. Viplo: Vision transformer based pose-conditioned self-loop graph for human-object interaction detection, 2023.
  20. Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pages 8599–8608. PMLR, 2021a.
  21. Diffusion-based voice conversion with fast maximum likelihood sampling scheme. arXiv preprint arXiv:2109.13821, 2021b.
  22. Class-balancing diffusion models, 2023.
  23. Learning transferable visual models from natural language supervision, 2021.
  24. Faster r-cnn: Towards real-time object detection with region proposal networks, 2016.
  25. High-resolution image synthesis with latent diffusion models, 2022.
  26. Image super-resolution via iterative refinement, 2021.
  27. Generating images of rare concepts using pre-trained diffusion models, 2023.
  28. Relay backpropagation for effective learning of deep convolutional neural networks, 2016.
  29. Very deep convolutional networks for large-scale image recognition, 2015.
  30. Curriculum learning: A survey, 2022.
  31. Equalization loss for long-tailed object recognition, 2020.
  32. Improved vector quantized diffusion models, 2023.
  33. Attention is all you need, 2023.
  34. Manifold mixup: Better representations by interpolating hidden states, 2019.
  35. Vision transformer-based spatially conditioned graphs for long tail visual relationship recognition cvpr 2023 ltvrr challenge. arXiv preprint arXiv, 2023.
  36. Spatially conditioned graphs for detecting human-object interactions, 2021.
  37. Large-scale visual relationship understanding, 2019.
  38. Fast sampling of diffusion models with exponential integrator. In The Eleventh International Conference on Learning Representations, 2023.
  39. Fast sampling of diffusion models via operator learning, 2023.

Summary

We haven't generated a summary for this paper yet.