Gesture-aware Interactive Machine Teaching with In-situ Object Annotations (2208.01211v1)

Published 2 Aug 2022 in cs.HC

Abstract: Interactive Machine Teaching (IMT) systems allow non-experts to easily create Machine Learning (ML) models. However, existing vision-based IMT systems either ignore annotations on the objects of interest or require users to annotate in a post-hoc manner. Without the annotations on objects, the model may misinterpret the objects using unrelated features. Post-hoc annotations cause additional workload, which diminishes the usability of the overall model building process. In this paper, we develop LookHere, which integrates in-situ object annotations into vision-based IMT. LookHere exploits users' deictic gestures to segment the objects of interest in real time. This segmentation information can be additionally used for training. To achieve the reliable performance of this object segmentation, we utilize our custom dataset called HuTics, including 2040 front-facing images of deictic gestures toward various objects by 170 people. The quantitative results of our user study showed that participants were 16.3 times faster in creating a model with our system compared to a standard IMT system with a post-hoc annotation process while demonstrating comparable accuracies. Additionally, models created by our system showed a significant accuracy improvement ($\Delta mIoU=0.466$) in segmenting the objects of interest compared to those without annotations.

Authors (2)

Zhongyi Zhou (14 papers)
Koji Yatani (16 papers)

Citations (11)

View on Semantic Scholar

Summary

Gesture-aware Interactive Machine Teaching with In-situ Object Annotations: An Examination

The paper by Zhongyi Zhou and Koji Yatani introduces "LookHere," an innovative system that enhances Interactive Machine Teaching (IMT) by incorporating gesture-aware, real-time object annotation into the model training process. This research attempts to overcome limitations of existing vision-based IMT systems, which often overlook object annotations or rely on cumbersome post-hoc annotation processes, thereby compromising usability and model accuracy.

Key Contributions

In-situ Object Annotation with LookHere: The primary advancement in this paper is the development of a vision-based IMT system, LookHere, which integrates in-situ object annotations guided by users’ deictic gestures. This integration significantly streamlines the model creation process for non-expert users by eliminating the need for time-consuming post-hoc annotations. The system leverages natural human interactions, using gestures as intuitive prompts for object identification.
HuTics Dataset: Central to this implementation is the novel HuTics dataset, which includes 2040 images showing various deictic gestures performed by 170 participants. The diversity in the dataset was pivotal for training models to recognize and appropriately annotate objects in real-time, further enhancing the practical effectiveness of LookHere.
Improvement in Annotation and Model Accuracy: By utilizing deictic gestures for real-time object segmentation, LookHere demonstrated a significant improvement in model training efficiency (16.3 times faster) compared to traditional methods requiring post-hoc annotations. Furthermore, models trained with LookHere showed a notable increase in segmentation accuracy (ΔmIoU=0.466), indicating the effectiveness of in-situ annotations in improving model learning outcomes.

Methodology

The LookHere system uses a two-stage segmentation process: a preliminary hand-segmentation step followed by an object-focused segmentation using U-Net architecture, enhanced with gesture-recognition data from the HuTics dataset. This methodology ensures that the models focus on salient features directly related to the objects of interest, guided by users' natural gestures.

Evaluation

The performance of LookHere was assessed through a comprehensive user paper comparing it against traditional annotation techniques, such as contour-based and click-based methods. The paper results highlighted LookHere’s capability to reduce the instructional workload while maintaining competitive, if not superior, model accuracies across different tasks.

Implications and Future Directions

The implications of this research are profound, as it simplifies the model training pipeline for end-users while enhancing model accuracy—a critical balance in user-centered AI applications. While the approach demonstrates substantial improvements, future work could investigate further optimizations in model training, explore broader datasets and gestures, and integrate additional modalities like voice commands to support gesture interaction.

Additionally, improving the robustness of gesture recognition in varied environmental contexts could expand the applicability of LookHere in more complex, real-world scenarios. Overall, this work not only advances the field of IMT by addressing usability concerns but also sets a foundation for future exploration into multi-modal interaction systems in user-ML model collaborations.

PDF Markdown