HOIST-Former: Hand-held Objects Identification, Segmentation, and Tracking in the Wild (2404.13819v1)
Abstract: We address the challenging task of identifying, segmenting, and tracking hand-held objects, which is crucial for applications such as human action segmentation and performance evaluation. This task is particularly challenging due to heavy occlusion, rapid motion, and the transitory nature of objects being hand-held, where an object may be held, released, and subsequently picked up again. To tackle these challenges, we have developed a novel transformer-based architecture called HOIST-Former. HOIST-Former is adept at spatially and temporally segmenting hands and objects by iteratively pooling features from each other, ensuring that the processes of identification, segmentation, and tracking of hand-held objects depend on the hands' positions and their contextual appearance. We further refine HOIST-Former with a contact loss that focuses on areas where hands are in contact with objects. Moreover, we also contribute an in-the-wild video dataset called HOIST, which comprises 4,125 videos complete with bounding boxes, segmentation masks, and tracking IDs for hand-held objects. Through experiments on the HOIST dataset and two additional public datasets, we demonstrate the efficacy of HOIST-Former in segmenting and tracking hand-held objects.
- High-speed tracking-by-detection without using image information. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–6, 2017.
- ContactPose: A dataset of grasps with object contact and hand pose. In The European Conference on Computer Vision (ECCV), 2020.
- Long term arm and hand tracking for continuous sign language tv broadcasts. In Proceedings of the British Machine Vision Conference, 2008.
- Reconstructing hand-object interactions in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Dexycb: A benchmark for capturing hand grasping of objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021a.
- DexYCB: A benchmark for capturing hand grasping of objects. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
- Mask2former for video instance segmentation. 2021a.
- Masked-attention mask transformer for universal image segmentation. 2022.
- Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In Advances in Neural Information Processing Systems(NeurIPS), 2021b.
- Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13608–13618, 2022.
- The epic-kitchens dataset: Collection, challenges and baselines. 2021.
- Epic-kitchens visor benchmark: Video segmentations and object relations. In Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2022.
- Strongsort: Make deepsort great again. IEEE Transactions on Multimedia, 2023.
- Honnotate: A method for 3d annotation of hand and object poses. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Keypoint transformer: Solving joint identification in challenging hands and object interactions for accurate 3d pose estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Learning joint reconstruction of hands and manipulated objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Fastreid: A pytorch toolbox for general instance re-identification. arXiv preprint arXiv:2006.02631, 2020.
- Forward propagation, backward regression and pose association for hand tracking in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- End-to-end detection and pose estimation of two interacting hands. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Robust hand detection. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings., 2004.
- Efficient discriminative learning of parts-based models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2009.
- H2o: Two hands manipulating objects for first person interaction recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021a.
- H2o: Two hands manipulating objects for first person interaction recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021b.
- Semi-supervised 3d hand-object poses estimation with interactions in time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Hand detection using multiple proposals. In Proceedings of the British Machine Vision Conference, 2011.
- Real-time hand tracking under occlusion from an egocentric rgb-d sensor. In Proceedings of the IEEE/ CVF International Conference on Computer Vision Workshops (ICCVW), 2017.
- On self-contact and human pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Contextual attention for hand detection in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- Detecting hands and recognizing physical contact in the wild. In Advances in Neural Information Processing Systems, 2020.
- Whose hands are these? hand detection and hand-body association in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Text-to-hand-image generation using pose- and mesh-guided diffusion. In IEEE/CVF International Conference on Computer Vision (ICCV), International Workshop on Observing and Understanding Hands in Action, 2023.
- Handiffuser: Text-to-image generation with realistic hand appearances. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- A boosted classifier tree for hand shape detection. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, 2004.
- Attention based detection and recognition of hand postures against complex backgrounds. Internation Journal on Computer Vision, 2013.
- Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, 2015.
- Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 2017.
- Eventhands: Real-time neural 3d hand pose estimation from an event stream. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Understanding human hands in contact at internet scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Accurate, robust, and flexible real-time hand tracking. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 2015.
- WorkingHands: A hand-tool assembly dataset for image segmentation and activity mining. In Proceedings of British Machine Vision Conference, 2019.
- Fast and robust hand tracking using detection-guided optimization. In Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Real-time joint tracking of a hand manipulating an object from rgb-d input. In Proceedings of European Conference on Computer Vision (ECCV), 2016.
- Real-time hand-tracking with a color glove. ACM Transactions on Graphics, 2009.
- Unidentified video objects: A benchmark for dense, open-world segmentation. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- An adaptive self-organizing color segmentation algorithm with application to robust real-time human hand localization. In Proceedings of the Asian Conference on Computer Vision, 2000.
- Detectron2. https://github.com/facebookresearch/detectron2, 2019.
- Youtube-vos: A large-scale video object segmentation benchmark. 2018.
- Video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- Semihand: Semi-supervised hand pose estimation with consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Mediapipe hands: On-device real-time hand tracking. arXiv preprint arXiv:2006.10214, 2020.
- Fine-grained egocentric hand-object segmentation: Dataset, model, and applications. In European Conference on Computer Vision, 2022.
- Bytetrack: Multi-object tracking by associating every detection box. In European Conference on Computer Vision, 2021.
- Global tracking transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Segmenting hands of arbitrary color. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, 2000.
- Learning to estimate 3d hand pose from single rgb images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017.
- Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.