Proper exploitation of both modalities in MICL for HTR
Determine training objectives and architectural mechanisms that enable multimodal in-context learning (MICL) models for handwritten text recognition to effectively and jointly exploit both the visual modality (handwritten line images) and the textual modality (in-context transcriptions), avoiding over-reliance on textual patterns and enabling robust real-world deployment.
References
As a result, the proper exploitation of both modalities remains an open challenge for deploying MICL methods in real-world HTR scenarios.
— Few-shot Writer Adaptation via Multimodal In-Context Learning
(2603.29450 - Simon et al., 31 Mar 2026) in Section 2.3, Multimodal In-Context Learning