Proper exploitation of both modalities in MICL for HTR

Determine training objectives and architectural mechanisms that enable multimodal in-context learning (MICL) models for handwritten text recognition to effectively and jointly exploit both the visual modality (handwritten line images) and the textual modality (in-context transcriptions), avoiding over-reliance on textual patterns and enabling robust real-world deployment.

Background

Multimodal In-Context Learning (MICL) extends in-context learning to vision-language settings by conditioning on joint visual and textual examples. Prior analyses indicate that many MICL systems rely disproportionately on textual cues while underutilizing visual information.

This imbalance is especially problematic for handwritten text recognition, which depends on fine-grained visual features of characters. The paper highlights that achieving balanced and effective use of both modalities is still unresolved for practical deployment in real-world HTR.

References

As a result, the proper exploitation of both modalities remains an open challenge for deploying MICL methods in real-world HTR scenarios.

Few-shot Writer Adaptation via Multimodal In-Context Learning  (2603.29450 - Simon et al., 31 Mar 2026) in Section 2.3, Multimodal In-Context Learning