Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 119 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 460 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Improving Continuous Sign Language Recognition with Adapted Image Models (2404.08226v1)

Published 12 Apr 2024 in cs.CV

Abstract: The increase of web-scale weakly labelled image-text pairs have greatly facilitated the development of large-scale vision-LLMs (e.g., CLIP), which have shown impressive generalization performance over a series of downstream tasks. However, the massive model size and scarcity of available data limit their applications to fine-tune the whole model in downstream tasks. Besides, fully fine-tuning the model easily forgets the generic essential knowledge acquired in the pretraining stage and overfits the downstream data. To enable high efficiency when adapting these large vision-LLMs (e.g., CLIP) to performing continuous sign language recognition (CSLR) while preserving their generalizability, we propose a novel strategy (AdaptSign). Especially, CLIP is adopted as the visual backbone to extract frame-wise features whose parameters are fixed, and a set of learnable modules are introduced to model spatial sign variations or capture temporal sign movements. The introduced additional modules are quite lightweight, only owning 3.2% extra computations with high efficiency. The generic knowledge acquired in the pretraining stage is well-preserved in the frozen CLIP backbone in this process. Extensive experiments show that despite being efficient, AdaptSign is able to demonstrate superior performance across a series of CSLR benchmarks including PHOENIX14, PHOENIX14-T, CSL-Daily and CSL compared to existing methods. Visualizations show that AdaptSign could learn to dynamically pay major attention to the informative spatial regions and cross-frame trajectories in sign videos.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018.
  3. Visual prompting: Modifying pixel space to adapt pre-trained models. arXiv preprint arXiv:2203.17274, 2022.
  4. Neural sign language translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7784–7793, 2018.
  5. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10023–10033, 2020.
  6. Fully convolutional networks for continuous sign language recognition. In ECCV, 2020.
  7. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290, 2021.
  8. Subunets: End-to-end hand shape and continuous sign language recognition. In ICCV, 2017.
  9. A deep neural framework for continuous sign language recognition by iterative training. TMM, 21(7):1880–1891, 2019.
  10. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  12. Speech recognition techniques for a sign language recognition system. hand, 60:80, 2007.
  13. Mdmmt: Multidomain multimodal transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3354–3363, 2021.
  14. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6824–6835, 2021.
  15. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021.
  16. Orientation histograms for hand gesture recognition. In International workshop on automatic face and gesture recognition, pages 296–301. Zurich, Switzerland, 1995.
  17. Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 214–229. Springer, 2020.
  18. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
  19. A chinese sign language recognition system based on sofm/srn/hmm. Pattern Recognition, 37(12):2389–2402, 2004.
  20. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376, 2006.
  21. Distilling cross-temporal contexts for continuous sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10771–10780, 2023.
  22. Modelling and segmenting subunits for sign language recognition based on hand motion analysis. Pattern Recognition Letters, 30(6):623–633, 2009.
  23. Self-mutual distillation learning for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11303–11312, 2021.
  24. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021.
  25. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9):1904–1916, 2015.
  26. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  27. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021a.
  28. Signbert: Pre-training of hand-model-aware representation for sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11087–11096, 2021b.
  29. Hand-model-aware sign language recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1558–1566, 2021c.
  30. Temporal lift pooling for continuous sign language recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 511–527. Springer, 2022.
  31. Self-emphasizing network for continuous sign language recognition. In Thirty-seventh AAAI conference on artificial intelligence, 2023.
  32. Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844, 2017.
  33. Video-based sign language recognition without temporal segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  34. Visual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pages 709–727. Springer, 2022.
  35. In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10267–10276, 2020.
  36. Cosign: Exploring co-occurrence signals in skeleton-based continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20676–20686, 2023.
  37. Robust fine-tuning of deep neural networks with hessian-based generalization guarantees. In International Conference on Machine Learning, pages 10431–10461. PMLR, 2022.
  38. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
  39. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding, 141:108–125, 2015.
  40. Deep sign: Hybrid cnn-hmm for continuous sign language recognition. In Proceedings of the British Machine Vision Conference 2016, 2016.
  41. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms. In CVPR, 2017.
  42. Weakly supervised learning with multi-stream cnn-lstm-hmms to discover sequential parallelism in sign language videos. PAMI, 42(9):2306–2320, 2019.
  43. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  44. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022a.
  45. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  46. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022b.
  47. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  48. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  49. Frozen clip models are efficient video learners. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 388–404. Springer, 2022.
  50. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021.
  51. Visual alignment constraint for continuous sign language recognition. In ICCV, 2021.
  52. Deep radial embedding for visual sequence learning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI, pages 240–256. Springer, 2022.
  53. Movie: Revisiting modulated convolutions for visual counting and beyond. arXiv preprint arXiv:2004.11883, 2020.
  54. Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition. In ECCV, 2020.
  55. Automatic sign language analysis: A survey and the future beyond lexical meaning. IEEE Transactions on Pattern Analysis & Machine Intelligence, 27(06):873–891, 2005.
  56. Iterative alignment network for continuous sign language recognition. In CVPR, 2019.
  57. Boosting continuous sign language recognition via cross modality augmentation. In ACM MM, 2020.
  58. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  59. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237, 2022.
  60. Pose-based sign language recognition using gcn and bert. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 31–40, 2021.
  61. Sf-net: Structured feature network for continuous sign language recognition. arXiv preprint arXiv:1908.01341, 2019.
  62. Coca: Contrastive captioners are image-text foundation models. Trans. Mach. Learn. Res., 2022, 2022.
  63. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
  64. Cvt-slr: Contrastive visual-textual transformation for sign language recognition with variational alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23141–23150, 2023.
  65. Spatial-temporal multi-cue network for continuous sign language recognition. In AAAI, 2020.
  66. Improving sign language translation with monolingual data by sign back-translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1316–1325, 2021.
  67. C2slr: Consistency-enhanced continuous sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5131–5140, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.