Hieroglyphic Stroke Analyzer (HieroSA)
- Hieroglyphic Stroke Analyzer (HieroSA) is a reinforcement learning framework that extracts normalized stroke primitives from binarized glyph images without manual annotation.
- It utilizes coordinate normalization and Group Relative Policy Optimization to accurately convert glyph structures into explicit line-segment representations.
- The framework outperforms conventional models in coverage and validity, enhancing OCR accuracy and enabling unsupervised analysis of diverse scripts.
Hieroglyphic Stroke Analyzer (HieroSA) is a generalizable reinforcement learning-based framework for character-level structural analysis of logographic and hieroglyphic scripts. It enables Multimodal LLMs (MLLMs) to derive explicit stroke-level decompositions from character bitmaps without manual annotation or language-specific priors, representing glyphs as interpretable line segments in a normalized coordinate system. HieroSA bypasses the limitations of conventional LLM and MLLM approaches, which treat characters as text tokens or raw images without explicit modeling of their internal geometric and compositional logic. The framework has demonstrated strong performance in capturing structural and semantic properties of hieroglyphs across diverse scripts, including ancient Chinese Oracle Bone Script (OBS), Egyptian hieroglyphs, modern Chinese, and Japanese Kanji (Luo et al., 9 Jan 2026).
1. Motivation and Problem Context
Logographic and hieroglyphic writing systems encode information not only in glyph identity but also in their internal stroke arrangement, orientation, and connectivity. This structural composition often maps directly to semantic and cultural functions within a script. However, current LLMs disregard stroke geometry by operating at the text token level, and MLLMs process glyphs only as pixel grids, remaining “structurally blind” to stroke primitives. Previous stroke-based analyses typically require script-specific inventories or labor-intensive annotation, limiting generalization to lesser-documented or unknown scripts. HieroSA addresses this bottleneck by offering a method to recover geometric stroke representations directly from binarized character bitmaps, without handcrafted data or annotated traces.
2. Methodological Framework
HieroSA operates as a reinforcement-learning pipeline that converts a binarized glyph image, formatted as black strokes on a white background, into a sequence of line-segment stroke primitives in normalized coordinates. The sequence of operations includes:
- Binarization of the input glyph image.
- Overlaying a coordinate grid to facilitate stable regression of endpoint locations.
- Encoding the image and grid within the Qwen3-VL-4B-Instruct vision-language backbone.
- Autoregressive prediction of stroke primitives, each parametrized by two endpoints .
- Computation of a reward measuring spatial coverage of predicted strokes over black-pixel regions, plus a format-conformity bonus.
- Inference parsing of model outputs into an explicit set for downstream structural analysis.
The framework employs Group Relative Policy Optimization (GRPO) to maximize the reward, which balances coverage accuracy against conformity to the output format. Overlaying a faint coordinate grid assists the model in localizing endpoints, as confirmed by ablation.
3. Mathematical Formulation
Coordinate normalization maps pixel positions in an image of width and height into space via , , ensuring that all stroke endpoints are expressed in a canonical square. Glyphs are represented as sets of line segments , permitting approximation of curved features by short consecutive segments.
Stroke validation samples equidistant points along each segment, , ensuring sample-point spacing . Strokes with any sampled points outside the black-pixel region are marked invalid. Coverage is estimated by computing the tangent and normal vectors at each sample, extending along normal directions until the background is reached. Extension endpoints are truncated using , and tangentially extended by . Each stroke yields an approximating polygon . Stroke acceptance proceeds sequentially if the stroke covers at least of previously uncovered black pixels.
The aggregated stroke-coverage reward is , with overall reward (format reward), driving GRPO optimization.
4. Empirical Evaluation and Comparative Analysis
The HieroSA framework was trained on 12,000 images each from six Chinese and six Japanese fonts, as well as publicly available Oracle Bone Script bitmaps, all without stroke annotation. Models were trained separately for each script over two epochs (≈22 h on 8×NVIDIA A800 GPUs), utilizing hyperparameters , , , , GRPO batch size 32, rollout 8, and learning rate . For coordinate overlay, a grid is overlaid on glyph images.
Testing was performed on 1,000 unseen-font images per script, reporting RE (test-time reward), CO (%) (percent of black-pixel area covered), and IS (%) (fraction of invalid strokes). HieroSA was compared to GPT-5, Claude Sonnet 4, and Qwen3-VL-4B zero-shot stroke parsers.
| Model | RE | CO (%) | IS (%) |
|---|---|---|---|
| GPT-5 (ZH) | 0.133 | 3.6 | 88.2 |
| Qwen3-VL-4B (ZH) | 0.032 | 0.5 | 97.9 |
| HieroSA (ZH) | 0.837 | 78.5 | 6.1 |
| HieroSA (JA) | 0.756 | 72.2 | 10.2 |
| HieroSA (OBS) | 0.446 | 64.6 | 23.1 |
HieroSA outperformed baseline models by over 60 percentage points in coverage and reduced invalid strokes by more than 80 percentage points. Cross-script training (e.g., ZH→JA) retained robust performance.
5. Ablation Studies and Analysis
Ablation experiments detailed the effects of key hyperparameters:
- Invalid-stroke penalty : yielded high invalid stroke rates (IS 64%, CO 42%), provided balanced outcomes (CO 22%, IS 45%), while excessively penalized (CO 2.7%, IS 75.7%).
- Segment endpoints: optimal decomposition was achieved with two points per stroke (endpoints); three or four points degraded performance (lower RE and CO, higher IS).
- Format reward : optimum at ; insufficient led to output parsing errors, excessive diverted model attention from coverage.
- Coordinate overlay: inclusion increased coverage by ∼10% and decreased invalid strokes by ∼5%.
6. Qualitative Observations and Representational Outcomes
HieroSA provided consistent and interpretable line-segment decompositions across scripts:
- Chinese “日”: four orthogonal segments aligned with geometric structure.
- Japanese Kanji “木”: vertical, horizontal, and two diagonals, reflecting semantic skeleton.
- Oracle Bone Script: highly pictographic glyphs parsed into a succinct sequence of segments tracing main contours.
Figure 1 in (Luo et al., 9 Jan 2026) illustrates oracle bone glyph segmentation in normalized coordinate space. This suggests applicability even to highly irregular and archaic writing forms.
7. Limitations, Generalization, and Future Directions
HieroSA dispenses with script-specific priors and stroke inventories, supporting application to under-documented scripts such as Dongba or Egyptian hieroglyphs. Performance varies with glyph complexity and geometric noise; current training data and model scale bracket moderate diversity. Proposed extensions include structure-aware denoising, larger or ensemble models for improved stability, stepwise filtering/ranking during exploration, and expansion to spline-based primitives to better model curvature.
A plausible implication is that the explicit structuring of glyphs at the stroke level facilitates downstream applications: improved Optical Character Recognition (OCR) accuracy (+1 percentage point) and structure-guided retrieval of semantically related glyphs across heterogeneous scripts. This positions HieroSA as a potential core tool in graphematics and unsupervised script analysis (Luo et al., 9 Jan 2026).