Char-SAM: Zero-Shot Text Segmentation
- Char-SAM is an automated, training-free pipeline that uses character-level visual prompts to generate precise pixel-level scene text segmentation masks.
- It employs a two-stage refinement process—character bounding-box refinement and glyph-driven positive/negative point selection—to mitigate over- and under-segmentation errors.
- Evaluations on datasets like TextSeg and COCO_TS demonstrate that Char-SAM achieves near-supervised performance without the need for manual pixel annotations.
Char-SAM is an automated, training-free annotation pipeline for scene text segmentation that leverages the Segment Anything Model (SAM) with enhanced character-level visual prompts. Addressing the limitations of both word-level and character-level bounding box prompts for SAM, Char-SAM introduces a two-stage refinement mechanism that enables accurate, large-scale generation of pixel-level text segmentation masks across diverse real-world datasets, without requiring ground-truth pixel annotations or retraining (Xie et al., 27 Dec 2024).
1. Motivation and Problem Setting
Pixel-level scene text segmentation is critical for downstream applications such as text erasure, localized editing, and fine-grained OCR refinement, but producing such masks is substantially more labor-intensive than annotating bounding boxes at the word or character level. The Segment Anything Model (SAM) can convert bounding boxes into masks in a zero-shot manner, but using coarse word-level prompts leads SAM to merge adjacent characters, while naïve character-level boxes yield over-segmentation (e.g., mistakenly filling in the loop of an “A” or “D”) and under-segmentation (missing or broken glyph components). Char-SAM addresses these shortcomings by programmatically refining prompt granularity and incorporating character glyph information to guide the mask generation process – thus reducing annotation cost while markedly improving mask quality (Xie et al., 27 Dec 2024).
2. System Architecture and Pipeline
Char-SAM is structured in three principal stages:
- Input Assembly: The system takes as input an image , a set of word-level bounding boxes , and corresponding textual transcriptions from an existing detection dataset.
- Character Bounding-box Refinement (CBR):
- For each word box and its transcription , a text detector (CRAFT) is run over cropped to to produce candidate character boxes .
- If , indicating mismatch in number of detected boxes and ground-truth characters, the mask from SAM is split via watershed segmentation on the negative distance transform to obtain connected regions.
- Refined boxes are assigned to characters using bipartite matching to minimize spatial assignment costs.
- Character Glyph Refinement (CGR):
- For each character box with label , a set of pre-rendered and binarized glyph templates over fonts is used.
- Votes are accumulated across templates at each pixel; pixels with high vote rates (, typically $0.6$) constitute positive points , while those with low vote rates () form negative points .
- The SAM prompt for this character becomes ; SAM then produces a fine-grained pixel mask .
This fully automated pipeline is training-free and leverages the bbox-to-mask conversion capability of SAM, augmented by precise character box delineation and glyph-informed point prompts.
3. Mathematical Formulation and Prompt Generation
Let CRAFT detect a preliminary set . If does not match the target sequence length , SAM is invoked as follows:
$D \leftarrow \text{distance_transform}(M_0)$
Final character boxes are assigned by bipartite matching to the transcription sequence.
In the CGR module, glyph templates (1 if is glyph foreground, 0 otherwise) over multiple fonts define a vote rate at each pixel :
These points, along with the refined character box, construct the full prompt for SAM.
No supervised loss is used; all mask outputs are generated without further training.
4. Exploiting SAM: Bbox-to-mask and Visual Prompting
SAM’s promptable box-to-mask interface is the backbone of Char-SAM. By moving from word-level to character-level box prompts, the system prevents character merging. Introducing glyph-driven positive/negative points amends systematic errors:
- Positive points enable the recovery of missing strokes, especially in thin-font or low-contrast regions.
- Negative points mitigate over-segmentation, effectively excluding interior cuts and holes within letterforms such as “A” and “D”.
This approach exploits SAM’s general segmentation ability while incorporating lightweight scene text domain knowledge, all without retraining the model.
5. Experimentation: Datasets, Protocols, and Implementation
Char-SAM was evaluated on the following datasets:
- TextSeg (Xu et al. CVPR 2021): 4,024 images, with complete word/character and mask annotation, supporting ablation and zero-shot evaluation.
- COCO-Text → COCO_TS: 14,690 images, previously annotated with weakly supervised pseudo masks.
- MLT17 → MLT_S: 6,896 images, similarly pseudo-labeled.
Char-SAM outputs refined segmentation annotations (COCO_TS_refined, MLT_S_refined), demonstrating higher mask fidelity compared to previous pseudo-annotation pipelines.
Implementation uses the ViT-B variant of SAM, CRAFT text detector pretrained on SynthText, and a glyph template library ( fonts, ).
6. Results: Quantitative and Qualitative Performance
Extensive benchmarking on TextSeg demonstrates Char-SAM’s efficacy.
| Setting | fgIoU (%) | F (%) |
|---|---|---|
| Word-box baseline | 78.30 | 83.52 |
| + CBR only | 81.85 | 87.81 |
| + CBR + CGR (40 fonts) | 84.75 | 92.01 |
| + CBR + CGR (80 fonts, final) | 84.80 | 92.15 |
| Best supervised (TFT, multi-scale) | 87.11 | 93.10 |
Char-SAM’s zero-shot pipeline achieves an F-measure of (single scale) compared to the supervised SOTA (), with less than 1% gap, indicating near-supervised quality performance without additional manual labels.
Prompt ablation studies reveal additive improvements:
- Word-box only: F =
- Character box: F =
- Positive points: F =
- Positive & negative points: F =
Qualitative analysis (Figures 1 and 6 in the source) documents elimination of halo-fills inside glyph holes, recovery of thin strokes, and smooth consistent masks—whereas prior approaches under- or over-segment, especially on bold or low-contrast fonts.
7. Advantages, Limitations, and Prospective Extensions
Advantages:
- Training-free and compatible with large, weakly annotated corpora; reuses out-of-the-box SAM and CRAFT.
- Produces markedly improved pseudo-masks over methods such as COCO_TS and MLT_S.
- Achieves virtually SOTA zero-shot segmentation quality on TextSeg.
Limitations:
- Relies on CRAFT’s ability to generalize; outlier fonts or layouts can degrade bounding box refinement.
- Glyph library is restricted to Latin alphanumeric scripts; expansion to multilinguistic contexts requires new templates.
- Hyperparameters (e.g., , font set size) must be tuned for new domains.
- Inherited failure cases from SAM, including pronounced blur and extreme text perspective.
Potential improvements:
- Integration of light-weight learned adapters or fine-tuning modules for SAM on text segmentation.
- Automatic font library expansion or online glyph extraction for unsupervised script extension.
- Enhanced CBR modules harnessing learned grouping, bypassing reliance on heuristics and watershed splits.
- Richer prompting with text structural cues (e.g., stroke or tangent points) to better support curved or stylized text.
Char-SAM establishes a practical paradigm for leveraging promptable foundation models to scale high-quality scene text segmentation annotation, bridging the annotation gap for downstream text manipulation and recognition applications without costly manual mask labeling or retraining (Xie et al., 27 Dec 2024).