Image Clustering Conditioned on Text Criteria
Abstract: Classical clustering methods do not provide users with direct control of the clustering results, and the clustering results may not be consistent with the relevant criterion that a user has in mind. In this work, we present a new methodology for performing image clustering based on user-specified text criteria by leveraging modern vision-LLMs and LLMs. We call our method Image Clustering Conditioned on Text Criteria (IC|TC), and it represents a different paradigm of image clustering. IC|TC requires a minimal and practical degree of human intervention and grants the user significant control over the clustering results in return. Our experiments show that IC|TC can effectively cluster images with various criteria, such as human action, physical location, or the person's mood, while significantly outperforming baselines.
- From sparse to dense: GPT-4 summarization with chain of density prompting. arXiv:2309.04269, 2023.
- Flamingo: a visual language model for few-shot learning. Neural Information Processing Systems, 2022.
- OpenFlamingo: An open-source framework for training large autoregressive vision-language models. arXiv:2308.01390, 2023.
- Neural codes for image retrieval. European Conference on Computer Vision, 2014.
- Conditioned and composed image retrieval combining and partially fine-tuning clip-based features. Conference on Computer Vision and Pattern Recognition, 2022.
- On the opportunities and risks of foundation models. arXiv:2108.07258, 2022.
- Language models are few-shot learners. Neural Information Processing Systems, 2020.
- Semantic-enhanced image clustering. American Association for Artificial Intelligence, 2023.
- Unifying deep local and global features for image search. European Conference on Computer Vision, 2020.
- R. Caruana. Clustering: Probably approximately useless? ACM International Conference on Information & Knowledge Management, 2013.
- Meta clustering. International Conference on Data Mining, 2006.
- Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality, 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- PiCIE: Unsupervised semantic segmentation using invariance and equivariance in clustering. Computer Vision and Pattern Recognition, 2021.
- PaLM: Scaling language modeling with pathways. arXiv:2204.02311, 2022.
- Non-redundant multi-view clustering via orthogonalization. International Conference on Data Mining, 2007.
- InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500, 2023.
- Superpoint: Self-supervised interest point detection and description. Computer Vision and Pattern Recognition Workshops, 2018.
- LIFT: Language-interfaced fine-tuning for non-language machine learning tasks. Neural Information Processing Systems, 2022.
- GLaM: Efficient scaling of language models with mixture-of-experts. International Conference on Machine Learning, 2022.
- D2-net: A trainable cnn for joint description and detection of local features. Conference on Computer Vision and Pattern Recognition, 2019.
- HiCLIP: Contrastive language-image pretraining with hierarchy-aware attention. International Conference on Learning Representations, 2023.
- MultiModal-GPT: A vision and language model for dialogue with humans. arXiv:2305.04790, 2023.
- Deep image retrieval: Learning global representations for image search. European Conference on Computer Vision, 2016.
- FACET: Fairness in computer vision evaluation benchmark. arXiv:2309.00035, 2023.
- Local descriptors optimized for average precision. Computer Vision and Pattern Recognition, 2018.
- H. Jégou and O. Chum. Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening. European Conference on Computer Vision, 2012.
- Invariant information clustering for unsupervised image classification and segmentation. International Conference on Computer Vision, 2019.
- Scaling up visual and vision-language representation learning with noisy text supervision. International Conference on Machine Learning, 2021.
- J. Kleinberg. An impossibility theorem for clustering. Neural Information Processing Systems, 2002.
- Rotation and translation invariant representation learning with implicit neural representations. International Conference on Machine Learning, 2023.
- Revisiting self-similarity: Structural embedding for image retrieval. Conference on Computer Vision and Pattern Recognition, 2023.
- Otter: A multi-modal model with in-context instruction tuning. arXiv:2305.03726, 2023a.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023b.
- Grounded language-image pre-training. Computer Vision and Pattern Recognition, 2022.
- Deep fair clustering for visual learning. Computer Vision and Pattern Recognition, 2020.
- Image clustering with external guidance. arXiv:2310.11989, 2023c.
- Visual instruction tuning. Neural Information Processing Systems, 2023.
- Image retrieval on real-life images with pre-trained vision-and-language models. International Conference on Computer Vision, 2021.
- PointClustering: Unsupervised point cloud pre-training using transformation invariance in clustering. Computer Vision and Pattern Recognition, 2023.
- An exact no free lunch theorem for community detection. International Conference on Complex Networks and Their Applications, 2020.
- S. Menon and C. Vondrick. Visual classification via description from large language models. International Conference on Learning Representaions, 2023.
- DivClust: Controlling diversity in deep clustering. Computer Vision and Pattern Recognition, 2023.
- I. Misra and L. v. d. Maaten. Self-supervised learning of pretext-invariant representations. Computer Vision and Pattern Recognition, 2020.
- ClipCap: Clip prefix for image captioning. arXiv:2111.09734, 2021.
- C. Niu and G. Wang. SPICE: Semantic pseudo-labeling for image clustering. IEEE Transactions on Image Processing, 31:7264–7278, 2021.
- Large-scale image retrieval with attentive deep local features. International Conference on Computer Vision, 2017.
- OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023.
- Training language models to follow instructions with human feedback. Neural Information Processing Systems, 2022.
- Improving unsupervised image clustering with robust learning. Computer Vision and Pattern Recognition, 2021.
- Phototoc: automatic clustering for browsing personal photographs. International Conference on Information, Communications and Signal Processing, 2003.
- Learning transferable visual models from natural language supervision. International Conference on Machine Learning, 2021.
- ChatGPT-powered hierarchical comparisons for image classification. Neural Information Processing Systems, 2023.
- R2d2: Reliable and repeatable detector and descriptor. Neural Information Processing Systems, 2019.
- High-resolution image synthesis with latent diffusion models. Computer Vision and Pattern Recognition, 2022.
- Labelme: A database and web-based tool for image annotation. International Journal of Computer Vision, 77:157–173, 2008.
- Multitask prompted training enables zero-shot task generalization. International Conference on Learning Representations, 2022.
- A data-centric approach for improving ambiguous labels with combined semi-supervised classification and clustering. European Conference on Computer Vision, 2022.
- You never cluster alone. Neural Information Processing Systems, 2021.
- Local features and visual words emerge in activations. Conference on Computer Vision and Pattern Recognition, 2019.
- Fashion image retrieval with text feedback by additive attention compositional learning. Conference on Applications of Computer Vision, 2023.
- Particular object retrieval with integral max-pooling of cnn activations. arXiv: 1511.05879, 2015.
- Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023b.
- Scan: Learning to classify images without labels. European Conference on Computer Vision, 2020.
- Learning local feature descriptors with triplets and shallow convolutional neural networks. Proceedings of the British Machine Vision Conference (BMVC), 2016.
- Large language models enable few-shot clustering. arXiv:2307.00524, 2023.
- Composing text and image for image retrieval - an empirical odyssey. Computer Vision and Pattern Recognition, 2019.
- Clustering: Science or art? International Conference on Machine Learning Workshop on Unsupervised and Transfer Learning, 2012.
- Locality-constrained linear coding for image classification. Computer Vision and Pattern Recognition, 2010.
- Finetuned language models are zero-shot learners. International Conference on Learning Representations, 2022.
- D. Wolpert and W. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):67–82, 1997.
- Asymmetric feature fusion for image retrieval. Computer Vision and Pattern Recognition, 2023.
- Discriminant-em algorithm with application to image retrieval. Computer Vision and Pattern Recognition, 2000.
- B. Yao and L. Fei-Fei. Grouplet: A structured image representation for recognizing human and object interactions. Computer Vision and Pattern Recognition, 2010.
- Human action recognition by learning bases of action attributes and parts. International Conference on Computer Vision, 2011.
- Lift: Learned invariant feature transform. European Conference on Computer Vision, pages 467–483, 2016.
- Twin contrastive learning for online clustering. International Journal of Computer Vision, 130, 2022.
- Deep fair clustering via maximizing and minimizing mutual information: Theory, algorithm and metric. Computer Vision and Pattern Recognition, 2023.
- GLIPv2: Unifying localization and vision-language understanding. arXiv:2206.05836, 2022.
- Learning spatial-context-aware global visual feature representation for instance image retrieval. International Conference on Computer Vision, 2023.
- Graph contrastive clustering. International Conference on Computer Vision, 2021.
- Learning deep features for scene recognition using places database. Advances in Neural Information Processing Systems, 2014.
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.