ARAIDA: Analogical Reasoning-Augmented Interactive Data Annotation (2405.11912v2)
Abstract: Human annotation is a time-consuming task that requires a significant amount of effort. To address this issue, interactive data annotation utilizes an annotation model to provide suggestions for humans to approve or correct. However, annotation models trained with limited labeled data are prone to generating incorrect suggestions, leading to extra human correction effort. To tackle this challenge, we propose Araida, an analogical reasoning-based approach that enhances automatic annotation accuracy in the interactive data annotation setting and reduces the need for human corrections. Araida involves an error-aware integration strategy that dynamically coordinates an annotation model and a k-nearest neighbors (KNN) model, giving more importance to KNN's predictions when predictions from the annotation model are deemed inaccurate. Empirical studies demonstrate that Araida is adaptable to different annotation tasks and models. On average, it reduces human correction labor by 11.02% compared to vanilla interactive data annotation methods.
- Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 41–46, Toronto, Canada. Association for Computational Linguistics.
- Cliquecnn: Deep unsupervised exemplar learning. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
- Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250.
- Reinforced active learning for image segmentation. arxiv.
- Reducing confusion in active learning for part-of-speech tagging. TACL, 9:1–16.
- Jit2r: A joint framework for item tagging and tag-based recommendation. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pages 1681–1684.
- Sanjoy Dasgupta. 2005. Coarse sample complexity bounds for active learning. In Proceedings of the 18th International Conference on Neural Information Processing Systems, pages 235–242.
- The forgetron: A kernel-based perceptron on a fixed budget. NeurIPS, 18.
- Semi-automated data labeling. In Proceedings of the NeurIPS 2020 Competition and Demonstration Track, volume 133 of Proceedings of Machine Learning Research, pages 156–169. PMLR.
- Convolutional 2d knowledge graph embeddings. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
- Investigating meta-learning algorithms for low-resource natural language understanding tasks. In EMNLP, pages 1192–1197.
- Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056.
- Annollm: Making large language models to be better crowdsourced annotators. arXiv preprint arXiv:2303.16854.
- Anea: distant supervision for low-resource named entity recognition. the Practical Machine Learning For Developing Countries Workshop at ICLR.
- Reduce human labor on evaluating conversational information retrieval system: A human-machine collaboration approach. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10876–10891, Singapore. Association for Computational Linguistics.
- Rebecca Hwa. 2000. Sample selection for statistical grammar induction. In 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 45–52.
- Categorical reparameterization with gumbel-softmax. arxiv.
- Learning kernel-smoothed machine translation with retrieved examples. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7280–7290, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Bag of tricks for efficient text classification. In EACL, pages 427–431. Association for Computational Linguistics.
- Nora Kassner and Hinrich Schütze. 2020. BERT-kNN: Adding a kNN search component to pretrained language models for better QA. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3424–3430, Online. Association for Computational Linguistics.
- Nearest neighbor machine translation. International Conference on Learning Representations.
- Generalization through memorization: Nearest neighbor language models. In ICLR.
- The INCEpTION platform: Machine-assisted and knowledge-oriented interactive annotation. In COLING: System Demonstrations, pages 5–9, Santa Fe, New Mexico. ACL.
- From Zero to Hero: Human-In-The-Loop Entity Linking in Low Resource Domains. In ACL, pages 6982–6993, Online. ACL.
- Data augmentation for hypernymy detection. In EACL, pages 1034–1048.
- Interactive information extraction with constrained conditional random fields. In AAAI, volume 4, pages 412–418.
- Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338.
- Building machines that learn and think like people. Behavioral and brain sciences, 40.
- Inconsistencies in crowdsourced slot-filling annotations: A typology and identification methods. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5035–5046.
- Active learning with Amazon Mechanical Turk. In EMNLP, pages 1546–1556, Edinburgh, Scotland, UK. ACL.
- Interactive video object mask annotation. In AAAI, volume 35, pages 16067–16070.
- Promptiverse: Scalable generation of scaffolding prompts through human-ai hybrid knowledge graph annotation. In CHI Conference on Human Factors in Computing Systems, pages 1–18.
- Fitannotator: A flexible and intelligent text annotation system. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, pages 35–41.
- knn-tl: k-nearest-neighbor transfer learning for low-resource neural machine translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1878–1891.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
- Samuel Marcos-Pablos and Francisco J García-Peñalvo. 2020. Information retrieval methodology for aiding scientific database search. Soft Computing, 24(8):5551–5560.
- Melanie Mitchell. 2021. Abstraction and analogy-making in artificial intelligence. Annals of the New York Academy of Sciences, 1505(1):79–101.
- Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Glove: Global vectors for word representation. In EMNLP, pages 1532–1543.
- Tim Rietz and Alexander Maedche. 2021. Cody: An ai-based system to semi-automate coding for qualitative research. In CHI, pages 1–14.
- Active learning for part-of-speech tagging: Accelerating corpus annotation. In Proceedings of the Linguistic Annotation Workshop, pages 101–108.
- Inclusive yet selective: Supervised distributional hypernymy detection. In COLING: Technical Papers, pages 1025–1036.
- Learning emphasis selection for written text in visual media from crowd-sourced label distributions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1167–1172.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Whose ai dream? in search of the aspiration in data annotation. In CHI, pages 1–16.
- Non-parametric online learning from human feedback for neural machine translation. arXiv.
- k𝑘kitalic_k nn-ner: Named entity recognition with nearest neighbor search. arXiv preprint arXiv:2203.17103.
- Simpleshot: Revisiting nearest-neighbor classification for few-shot learning. arXiv.
- A survey of human-in-the-loop for machine learning. Future Generation Computer Systems, 135:364–381.
- FreeAL: Towards human-free active learning in the era of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14520–14535, Singapore. Association for Computational Linguistics.
- An interactive neural network approach to keyphrase extraction in talent recruitment. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 2383–2393.
- Adaptive nearest neighbor machine translation. pages 368–374.