CASHG: Context-Aware Stylized Online Handwriting Generation

Published 2 Apr 2026 in cs.CV and cs.LG | (2604.02103v2)

Abstract: Online handwriting represents strokes as time-ordered trajectories, which makes handwritten content easier to transform and reuse in a wide range of applications. However, generating natural sentence-level online handwriting that faithfully reflects a writer's style remains challenging, since sentence synthesis demands context-dependent characters with stroke continuity and spacing. Prior methods treat these boundary properties as implicit outcomes of sequence modeling, which becomes unreliable at the sentence scale and under limited compositional diversity. We propose CASHG, a context-aware stylized online handwriting generator that explicitly models inter-character connectivity for style-consistent sentence-level trajectory synthesis. CASHG uses a Character Context Encoder to obtain character identity and sentence-dependent context memory and fuses them in a bigram-aware sliding-window Transformer decoder that emphasizes local predecessor--current transitions, complemented by gated context fusion for sentence-level context.Training proceeds through a three-stage curriculum from isolated glyphs to full sentences, improving robustness under sparse transition coverage. We further introduce Connectivity and Spacing Metrics (CSM), a boundary-aware evaluation suite that quantifies cursive connectivity and spacing similarity. Under benchmark-matched evaluation protocols, CASHG consistently improves CSM over comparison methods while remaining competitive in DTW-based trajectory similarity, with gains corroborated by a human evaluation.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a Transformer-based framework that explicitly models inter-character connectivity using a bigram-aware sliding-window decoder.
It employs character context encoding with gated fusion to adapt local handwriting styles and preserve writer-specific cues.
A new suite of connectivity and spacing metrics is proposed and validated through both human evaluation and benchmark comparisons.

CASHG: Context-Aware Stylized Online Handwriting Generation (2604.02103)

Introduction and Motivation

Generating realistic sentence-level online handwriting—where the output is a sequence of pen trajectories, not simply static glyphs—requires comprehensive modeling of both the local shape of characters and the detailed, style-consistent transitions between them. The challenge extends far beyond plausible glyphs: authentic handwriting exhibits context-dependent inter-character phenomena, such as kerning, word and character spacing, and cursive-like connections. Previous neural handwriting generators typically treat these properties as implicit outcomes of sequence modeling, which proves insufficient as compositional diversity increases and with longer sentence contexts. Such methods often fail to maintain consistent spacing or realistic connectivity beyond the glyph level, resulting in visually jarring boundary artifacts in synthesized handwriting.

CASHG (Context-Aware Stylized Online Handwriting Generation) addresses these deficiencies by explicitly targeting sentence-level generation that preserves local and global writer style—including nuanced inter-character connectivity—by combining disentangled style and content representations, boundary-aware decoding strategies, and novel evaluation protocols for spacing and continuity.

Model Architecture

Character Context Encoding and Memory

To support high-fidelity style conditioning and connectivity modeling, CASHG employs a Character Context Encoder that generates two decoupled conditioning signals per character: a deterministic Character-Identity Embedding (for content specification) and a position-dependent context memory (for context-aware adaptation). Both are produced via a shared multilingual text encoder (e.g., CANINE or ByT5), with the context branch incorporating a lightweight Transformer encoder to capture surrounding sentence context and position-dependent variations.

References are encoded into Writer-style and Glyph-style memory modules, each supplying distinct stylization cues: Writer-style memory captures global writer traits, while Glyph-style memory encodes character-level stylistic identities.

Figure 1: Overview of CASHG—reference handwriting images are encoded into writer- and glyph-style memories, and character context encoding supports both sentence-dependent and isolated-character conditioning.

Bigram-Aware Sliding-Window Transformer Decoder

A key architectural innovation is the bigram-aware sliding-window Transformer decoder (Bi-SWT). For each character position, CASHG constructs a local decoding window comprising the predecessor and current character (with their trajectories or predictions), and conditions decoding on this bigram. This design explicitly supervises and regularizes local boundary dynamics (kerning, joins, spacing) while retaining robustness to the compositional sparsity inherent in sentence-scale data. Longer-range context is integrated via the context memory with a gating mechanism, thus preventing excessive context fusion from overwhelming stylization signals.

Figure 2: Bigram-aware sliding-window Transformer decoding. Each character is modeled in the local context of its predecessor, allowing the model to directly handle kerning and cursive joins through context fusion.

Autoregressive trajectory generation is realized via a GMM-based output head, directly modeling distribution over 2D point displacements and explicitly classifying pen state (including a "CursiveEOC" label for pen-down joins). Regularization for vertical boundary alignment (Vertical Drift Loss) is imposed at bigram boundaries to ensure stable vertical positioning across characters.

Training and Curriculum

Due to data sparsity in sentence-level datasets, CASHG is trained with a three-stage curriculum: isolated glyphs $\rightarrow$ bigrams $\rightarrow$ sentences. This progression strengthens the model’s ability to generalize from local transitions to longer contexts and prevents boundary artifacts from emerging at later stages. A combination of masked GMM negative log-likelihood, cross-entropy over pen states, and the Vertical Drift Loss constitutes the optimization targets across all stages.

Explicit Connectivity and Spacing Evaluation

CASHG introduces a new evaluation protocol—Connectivity and Spacing Metrics (CSM)—to quantify generation quality at boundary and spacing levels, surpassing traditional DTW-based global trajectory similarity scores. CSM captures:

F1 ${}_\mathrm{Cursive}$ : boundary-level accuracy for pen-down (cursive-like) joins
CRE: writer-level rate agreement of cursive boundaries
KGS: pairwise kerning gap similarity
SSS: inter-word spacing similarity

This suite enables systematic assessment of boundary phenomena, decoupling glyph shape error from connectivity/spacing errors which are especially visually salient in sentence-level writing.

Experimental Results

Human and Quantitative Evaluation

Human judges performed extensive pairwise preferences between CASHG and state-of-the-art baselines (DSD, DeepWriting, OLHWG) over style, connectivity, and spacing. CASHG was consistently preferred, particularly with respect to inter-character connectivity and spacing, confirming the validity of CSM results.

Figure 3: Human evaluation of perceptual similarity—CASHG is preferred across style, connectivity, and spacing, corroborating metric-based findings.

On sentence-level English and Chinese benchmarks, CASHG achieves:

Strong improvements in CSM (higher values indicating improved realistic spacing/connectivity)
Competitive or improved DTW scores (lower is better)
Lower variance in DTW and CSM, demonstrating stable, style-consistent generation

Qualitative inspections show that baseline decoders often exhibit rigid or broken inter-character spacing/joins, while CASHG maintains consistent writer-specific boundary phenomena.

Figure 4: Qualitative comparisons—CASHG exhibits superior preservation of writer glyph style, local connectivity, and spacing versus DSD and OLHWG baselines in both English and Chinese.

Ablation studies confirm that both the Bi-SWT decoder and the gated context fusion are critical for high scores in connectivity and spacing metrics—removing either module yields degraded CSM, even when global trajectory fidelity is retained.

Implications and Future Directions

CASHG demonstrates that explicit modeling of inter-character connectivity and boundary-local phenomena is essential in online handwriting generation, especially as generation length and compositional diversity increase. The architectural modularity (disentangled content/style, bigram-local modeling) and explicit connectivity-aware objectives provide a reproducible pathway for future models to improve generative realism, including extending the framework to scripts with more complex contextual or diacritic phenomena.

The introduction of CSM exposes a gap in standard evaluation protocols, suggesting a parallel need for boundary-sensitive metrics in other sequence or layout generative tasks (e.g., music, gesture, or motion synthesis).

Practically, CASHG's architecture—free from reliance on rendered content templates and robust to style input sparsity—facilitates deployment in applications ranging from personalized note-taking avatars to context- and style-consistent digital document generation.

Conclusion

CASHG establishes a robust framework for context-aware, stylized online handwriting generation with explicit inter-character connectivity modeling. By combining curriculum training, bigram-conditioned decoding, and sentence-dependent context fusion, CASHG surpasses prior art in boundary realism, as validated by the introduced CSM metric family and human studies. This work sets a new baseline for both architecture and evaluation in sentence-level online handwriting generation and motivates research into even richer compositional and stylistic controls.

Markdown Report Issue