Papers
Topics
Authors
Recent
Search
2000 character limit reached

Image Clustering Conditioned on Text Criteria

Published 27 Oct 2023 in cs.CV and cs.AI | (2310.18297v4)

Abstract: Classical clustering methods do not provide users with direct control of the clustering results, and the clustering results may not be consistent with the relevant criterion that a user has in mind. In this work, we present a new methodology for performing image clustering based on user-specified text criteria by leveraging modern vision-LLMs and LLMs. We call our method Image Clustering Conditioned on Text Criteria (IC|TC), and it represents a different paradigm of image clustering. IC|TC requires a minimal and practical degree of human intervention and grants the user significant control over the clustering results in return. Our experiments show that IC|TC can effectively cluster images with various criteria, such as human action, physical location, or the person's mood, while significantly outperforming baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. From sparse to dense: GPT-4 summarization with chain of density prompting. arXiv:2309.04269, 2023.
  2. Flamingo: a visual language model for few-shot learning. Neural Information Processing Systems, 2022.
  3. OpenFlamingo: An open-source framework for training large autoregressive vision-language models. arXiv:2308.01390, 2023.
  4. Neural codes for image retrieval. European Conference on Computer Vision, 2014.
  5. Conditioned and composed image retrieval combining and partially fine-tuning clip-based features. Conference on Computer Vision and Pattern Recognition, 2022.
  6. On the opportunities and risks of foundation models. arXiv:2108.07258, 2022.
  7. Language models are few-shot learners. Neural Information Processing Systems, 2020.
  8. Semantic-enhanced image clustering. American Association for Artificial Intelligence, 2023.
  9. Unifying deep local and global features for image search. European Conference on Computer Vision, 2020.
  10. R. Caruana. Clustering: Probably approximately useless? ACM International Conference on Information & Knowledge Management, 2013.
  11. Meta clustering. International Conference on Data Mining, 2006.
  12. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality, 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  13. PiCIE: Unsupervised semantic segmentation using invariance and equivariance in clustering. Computer Vision and Pattern Recognition, 2021.
  14. PaLM: Scaling language modeling with pathways. arXiv:2204.02311, 2022.
  15. Non-redundant multi-view clustering via orthogonalization. International Conference on Data Mining, 2007.
  16. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500, 2023.
  17. Superpoint: Self-supervised interest point detection and description. Computer Vision and Pattern Recognition Workshops, 2018.
  18. LIFT: Language-interfaced fine-tuning for non-language machine learning tasks. Neural Information Processing Systems, 2022.
  19. GLaM: Efficient scaling of language models with mixture-of-experts. International Conference on Machine Learning, 2022.
  20. D2-net: A trainable cnn for joint description and detection of local features. Conference on Computer Vision and Pattern Recognition, 2019.
  21. HiCLIP: Contrastive language-image pretraining with hierarchy-aware attention. International Conference on Learning Representations, 2023.
  22. MultiModal-GPT: A vision and language model for dialogue with humans. arXiv:2305.04790, 2023.
  23. Deep image retrieval: Learning global representations for image search. European Conference on Computer Vision, 2016.
  24. FACET: Fairness in computer vision evaluation benchmark. arXiv:2309.00035, 2023.
  25. Local descriptors optimized for average precision. Computer Vision and Pattern Recognition, 2018.
  26. H. Jégou and O. Chum. Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening. European Conference on Computer Vision, 2012.
  27. Invariant information clustering for unsupervised image classification and segmentation. International Conference on Computer Vision, 2019.
  28. Scaling up visual and vision-language representation learning with noisy text supervision. International Conference on Machine Learning, 2021.
  29. J. Kleinberg. An impossibility theorem for clustering. Neural Information Processing Systems, 2002.
  30. Rotation and translation invariant representation learning with implicit neural representations. International Conference on Machine Learning, 2023.
  31. Revisiting self-similarity: Structural embedding for image retrieval. Conference on Computer Vision and Pattern Recognition, 2023.
  32. Otter: A multi-modal model with in-context instruction tuning. arXiv:2305.03726, 2023a.
  33. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023b.
  34. Grounded language-image pre-training. Computer Vision and Pattern Recognition, 2022.
  35. Deep fair clustering for visual learning. Computer Vision and Pattern Recognition, 2020.
  36. Image clustering with external guidance. arXiv:2310.11989, 2023c.
  37. Visual instruction tuning. Neural Information Processing Systems, 2023.
  38. Image retrieval on real-life images with pre-trained vision-and-language models. International Conference on Computer Vision, 2021.
  39. PointClustering: Unsupervised point cloud pre-training using transformation invariance in clustering. Computer Vision and Pattern Recognition, 2023.
  40. An exact no free lunch theorem for community detection. International Conference on Complex Networks and Their Applications, 2020.
  41. S. Menon and C. Vondrick. Visual classification via description from large language models. International Conference on Learning Representaions, 2023.
  42. DivClust: Controlling diversity in deep clustering. Computer Vision and Pattern Recognition, 2023.
  43. I. Misra and L. v. d. Maaten. Self-supervised learning of pretext-invariant representations. Computer Vision and Pattern Recognition, 2020.
  44. ClipCap: Clip prefix for image captioning. arXiv:2111.09734, 2021.
  45. C. Niu and G. Wang. SPICE: Semantic pseudo-labeling for image clustering. IEEE Transactions on Image Processing, 31:7264–7278, 2021.
  46. Large-scale image retrieval with attentive deep local features. International Conference on Computer Vision, 2017.
  47. OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023.
  48. Training language models to follow instructions with human feedback. Neural Information Processing Systems, 2022.
  49. Improving unsupervised image clustering with robust learning. Computer Vision and Pattern Recognition, 2021.
  50. Phototoc: automatic clustering for browsing personal photographs. International Conference on Information, Communications and Signal Processing, 2003.
  51. Learning transferable visual models from natural language supervision. International Conference on Machine Learning, 2021.
  52. ChatGPT-powered hierarchical comparisons for image classification. Neural Information Processing Systems, 2023.
  53. R2d2: Reliable and repeatable detector and descriptor. Neural Information Processing Systems, 2019.
  54. High-resolution image synthesis with latent diffusion models. Computer Vision and Pattern Recognition, 2022.
  55. Labelme: A database and web-based tool for image annotation. International Journal of Computer Vision, 77:157–173, 2008.
  56. Multitask prompted training enables zero-shot task generalization. International Conference on Learning Representations, 2022.
  57. A data-centric approach for improving ambiguous labels with combined semi-supervised classification and clustering. European Conference on Computer Vision, 2022.
  58. You never cluster alone. Neural Information Processing Systems, 2021.
  59. Local features and visual words emerge in activations. Conference on Computer Vision and Pattern Recognition, 2019.
  60. Fashion image retrieval with text feedback by additive attention compositional learning. Conference on Applications of Computer Vision, 2023.
  61. Particular object retrieval with integral max-pooling of cnn activations. arXiv: 1511.05879, 2015.
  62. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023a.
  63. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023b.
  64. Scan: Learning to classify images without labels. European Conference on Computer Vision, 2020.
  65. Learning local feature descriptors with triplets and shallow convolutional neural networks. Proceedings of the British Machine Vision Conference (BMVC), 2016.
  66. Large language models enable few-shot clustering. arXiv:2307.00524, 2023.
  67. Composing text and image for image retrieval - an empirical odyssey. Computer Vision and Pattern Recognition, 2019.
  68. Clustering: Science or art? International Conference on Machine Learning Workshop on Unsupervised and Transfer Learning, 2012.
  69. Locality-constrained linear coding for image classification. Computer Vision and Pattern Recognition, 2010.
  70. Finetuned language models are zero-shot learners. International Conference on Learning Representations, 2022.
  71. D. Wolpert and W. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):67–82, 1997.
  72. Asymmetric feature fusion for image retrieval. Computer Vision and Pattern Recognition, 2023.
  73. Discriminant-em algorithm with application to image retrieval. Computer Vision and Pattern Recognition, 2000.
  74. B. Yao and L. Fei-Fei. Grouplet: A structured image representation for recognizing human and object interactions. Computer Vision and Pattern Recognition, 2010.
  75. Human action recognition by learning bases of action attributes and parts. International Conference on Computer Vision, 2011.
  76. Lift: Learned invariant feature transform. European Conference on Computer Vision, pages 467–483, 2016.
  77. Twin contrastive learning for online clustering. International Journal of Computer Vision, 130, 2022.
  78. Deep fair clustering via maximizing and minimizing mutual information: Theory, algorithm and metric. Computer Vision and Pattern Recognition, 2023.
  79. GLIPv2: Unifying localization and vision-language understanding. arXiv:2206.05836, 2022.
  80. Learning spatial-context-aware global visual feature representation for instance image retrieval. International Conference on Computer Vision, 2023.
  81. Graph contrastive clustering. International Conference on Computer Vision, 2021.
  82. Learning deep features for scene recognition using places database. Advances in Neural Information Processing Systems, 2014.
  83. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.
Citations (7)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.