Representing Online Handwriting for Recognition in Large Vision-Language Models (2402.15307v1)
Abstract: The adoption of tablets with touchscreens and styluses is increasing, and a key feature is converting handwriting to text, enabling search, indexing, and AI assistance. Meanwhile, vision-LLMs (VLMs) are now the go-to solution for image understanding, thanks to both their state-of-the-art performance across a variety of tasks and the simplicity of a unified approach to training, fine-tuning, and inference. While VLMs obtain high performance on image-based tasks, they perform poorly on handwriting recognition when applied naively, i.e., by rendering handwriting as an image and performing optical character recognition (OCR). In this paper, we study online handwriting recognition with VLMs, going beyond naive OCR. We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image. We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers. Wide applicability is shown through results with two different VLM families, on multiple public datasets. Our approach can be applied to off-the-shelf VLMs, does not require any changes in their architecture, and can be used in both fine-tuning and parameter-efficient tuning. We perform a detailed ablation study to identify the key elements of the proposed representation.
- Kaggle Quick, Draw! competition. https://www.kaggle.com/competitions/quickdraw-doodle-recognition, 2023. [Online; accessed 10-Jan-2024].
- Deepwriting: Making digital ink editable via deep generative modeling. In Proceedings of the 2018 CHI conference on human factors in computing systems, pp. 1–14, 2018.
- Survey on handwritten recognition. pp. 273–281, 10 2022. doi: 10.1109/ISMSIT56059.2022.9932793.
- Flamingo: a visual language model for few-shot learning, 2022.
- Transformer-based models for arabic online handwriting recognition. International Journal of Advanced Computer Science and Applications, 13(5), 2022.
- Exploring length generalization in large language models, 2022.
- Palm 2 technical report, 2023.
- Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
- On the opportunities and risks of foundation models, 2022.
- Language models are few-shot learners, 2020.
- Fast multi-language lstm-based online handwriting recognition. International Journal on Document Analysis and Recognition (IJDAR), 2020.
- Pali-x: On scaling up a multilingual vision and language model, 2023a.
- Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023b.
- Pali: A jointly-scaled multilingual language-image model, 2023c.
- State-of-the-art speech recognition with sequence-to-sequence models. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778, 2017. URL https://api.semanticscholar.org/CorpusID:206742954.
- Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. arXiv preprint arXiv:2307.06304, 2023.
- Msdoctr-lite: A lite transformer for full page multi-script handwriting recognition. Pattern Recogn. Lett., 169(C):28–34, may 2023. ISSN 0167-8655. doi: 10.1016/j.patrec.2023.03.020. URL https://doi.org/10.1016/j.patrec.2023.03.020.
- Palm-e: An embodied multimodal language model, 2023.
- Mathwriting dataset, 2024. URL https://storage.googleapis.com/mathwriting_data/mathwriting-2024.tgz.
- Google Cloud. Detect handwriting in image, 2023. URL https://cloud.google.com/vision/docs/handwriting.
- Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pp. 369–376, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933832. doi: 10.1145/1143844.1143891. URL https://doi.org/10.1145/1143844.1143891.
- A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31:855–868, 2009. URL https://api.semanticscholar.org/CorpusID:14635907.
- Deep speech: Scaling up end-to-end speech recognition. ArXiv, abs/1412.5567, 2014. URL https://api.semanticscholar.org/CorpusID:16979536.
- Long short-term memory. 9(8):1735–1780, nov 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL https://doi.org/10.1162/neco.1997.9.8.1735.
- Hmm based online handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(10):1039–1045, 1996. doi: 10.1109/34.541414.
- Speech translation with large language models: An industrial practice, 2023.
- The state of the art in japanese online handwriting recognition compared to techniques in western handwriting recognition. International Journal on Document Analysis and Recognition, 6:75–88, 10 2003. doi: 10.1007/s10032-003-0107-y.
- Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C., Bottou, L., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Conference on Empirical Methods in Natural Language Processing, 2018. URL https://api.semanticscholar.org/CorpusID:52051958.
- End to end recognition system for recognizing offline unconstrained vietnamese handwriting. arXiv preprint arXiv:1905.05381, 2019.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pp. 18893–18912. PMLR, 2023.
- Visual instruction tuning, 2023.
- Goat: Fine-tuned llama outperforms gpt-4 on arithmetic tasks, 2023.
- Evaluating sequence-to-sequence models for handwritten text recognition, 2019.
- Distributed representations of words and phrases and their compositionality. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf.
- Icfhr 2018–competition on vietnamese online handwritten text recognition using hands-vnondb (vohtr2018). In 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 494–499. IEEE, 2018.
- Painter: Teaching auto-regressive language models to draw sketches, 2023.
- A comparison of sequence-to-sequence models for speech recognition. In Interspeech, 2017. URL https://api.semanticscholar.org/CorpusID:6028290.
- Deep context: End-to-end contextual speech recognition. 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 418–425, 2018. URL https://api.semanticscholar.org/CorpusID:51942169.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435.
- Sketchformer: Transformer-based representation for sketched structure, 2020.
- As we may ink?: Learning from everyday analog pen use to improve digital ink experiences. pp. 3241–3253, 05 2017. doi: 10.1145/3025453.3025716.
- Audiopalm: A large language model that can speak and listen, 2023.
- Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pp. 843–852, 2017.
- Llama: Open and efficient foundation language models, 2023.
- Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
- Multi-modal attention network for handwritten mathematical expression recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1181–1186, 2019. doi: 10.1109/ICDAR.2019.00191.
- What language model architecture and pretraining objective work best for zero-shot generalization?, 04 2022.
- Icdar 2023 crohme: Competition on recognition of handwritten mathematical expressions. In Fink, G. A., Jain, R., Kise, K., and Zanibbi, R. (eds.), Document Analysis and Recognition - ICDAR 2023, pp. 553–565, Cham, 2023. Springer Nature Switzerland.
- mt5: A massively multilingual pre-trained text-to-text transformer, 2021.
- Coca: Contrastive captioners are image-text foundation models, 2022.
- Unified vision-language pre-training for image captioning and vqa, 2019.