Thinking Like an Annotator: Generation of Dataset Labeling Instructions (2306.14035v1)
Abstract: Large-scale datasets are essential to modern day deep learning. Advocates argue that understanding these methods requires dataset transparency (e.g. "dataset curation, motivation, composition, collection process, etc..."). However, almost no one has suggested the release of the detailed definitions and visual category examples provided to annotators - information critical to understanding the structure of the annotations present in each dataset. These labels are at the heart of public datasets, yet few datasets include the instructions that were used to generate them. We introduce a new task, Labeling Instruction Generation, to address missing publicly available labeling instructions. In Labeling Instruction Generation, we take a reasonably annotated dataset and: 1) generate a set of examples that are visually representative of each category in the dataset; 2) provide a text label that corresponds to each of the examples. We introduce a framework that requires no model training to solve this task and includes a newly created rapid retrieval system that leverages a large, pre-trained vision and LLM. This framework acts as a proxy to human annotators that can help to both generate a final labeling instruction set and evaluate its quality. Our framework generates multiple diverse visual and text representations of dataset categories. The optimized instruction set outperforms our strongest baseline across 5 folds by 7.06 mAP for NuImages and 12.9 mAP for COCO.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
- What’s the point: Semantic segmentation with point supervision. In European conference on computer vision, pages 549–565. Springer, 2016.
- Wordnet: A lexical database organized on psycholinguistic principles. In Lexical acquisition: Exploiting on-line resources to build a lexicon, pages 211–232. Psychology Press, 2021.
- Large image datasets: A pyrrhic win for computer vision? In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1536–1546. IEEE, 2021.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Semantic combination of textual and visual information in multimedia retrieval. In ACM ICMR, pages 1–8, 2011.
- ” this is my unicorn, fluffy”: Personalizing frozen vision-language representations. arXiv preprint arXiv:2204.01694, 2022.
- Nick Craswell. Mean Reciprocal Rank, pages 1703–1703. Springer US, Boston, MA, 2009.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010.
- T. Gebru et al. Datasheets for datasets. ACM, 2021.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
- Lvis: A dataset for large vocabulary instance segmentation. In CVPR, pages 5356–5364, 2019.
- J. N. Itri et al. Heuristics and cognitive error in medical imaging. AJR Am J Roentgenol, 210(5):1097–1105, 2018.
- Predicting sufficient annotation strength for interactive foreground segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1313–1320, 2013.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3668–3678, 2015.
- Daniel Kang. ML models are increasingly being deployed in mission-critical settings. Online Communication, LinkedIn, Sept 2022.
- Daniel. Kang et al. Finding label errors in autonomous vehicle data with learned observation assertions. 2022.
- Discovering attribute shades of meaning with the crowd. International Journal of Computer Vision, 114(1):56–73, 2015.
- Actively selecting annotations among objects and attributes. In ICCV, pages 1403–1410. IEEE, 2011.
- Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github. com/openimages, 2(3):18, 2017.
- Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages, 2016.
- Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054, 2022.
- Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
- Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2125–2134, 2021.
- 12-in-1: Multi-task vision and language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10437–10446, 2020.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3195–3204, 2019.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- B. Recht et al. Do ImageNet classifiers generalize to ImageNet? In ICML, pages 5389–5400. PMLR, 2019.
- Best of both worlds: human-machine collaboration for object annotation. In CVPR, pages 2121–2131, 2015.
- Collie: Continual learning of language grounding from language-image embeddings. Journal of Artificial Intelligence Research, 74:1201–1223, 2022.
- Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6439–6448, 2019.
- M. D. Wilkinson et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data, 2016.
- Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022.
- Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5410–5419, 2017.
- Graph r-cnn for scene graph generation. In Proceedings of the European conference on computer vision (ECCV), pages 670–685, 2018.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
- Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134, 2021.