GLASTE: Global-Local Aware Scene Text Editing
- The paper introduces GLASTE, a GAN-based framework that jointly uses global inpainting and local style encoding to maintain image coherence and textual clarity.
- It employs a size-independent style encoder and an affine fusion module to adaptively manage variations in text length without inducing visual distortions.
- Benchmarking shows that GLASTE outperforms prior methods like SRNet and MOSTEL in terms of accuracy, perceptual similarity, and artifact minimization.
Global-Local Aware Scene Text Editing (GLASTE) refers to a generative adversarial network (GAN)-based framework designed to address the core challenges of scene text editing (STE): maintaining stylistic and spatial consistency between the edited patch and the surrounding image, and adapting to variations in the length of the replaced text without compromising on visual realism or text recognizability. GLASTE achieves these goals through an end-to-end architecture that jointly captures high-level global context via inpainting, encodes local text style/content with size-independent representations, and fuses results using an affine module that preserves aspect ratio regardless of the difference between source and target text lengths. Benchmarking demonstrates that GLASTE surpasses prior STE methods across several recognition and perceptual similarity metrics, while qualitative analysis confirms its ability to avoid artifacts such as unnatural character stretching or patch boundary inconsistencies (Yang et al., 3 Dec 2025).
1. Problem Setting and Motivations
Scene text editing requires replacing or modifying textual content in natural images such that the result is indistinguishable in style, font, and background integration. Two persistent issues in prior STE methods are:
- Inconsistency: Failing to harmonize the edited text patch with the background, leading to visible artifacts at patch boundaries or residual shadows from prior text.
- Length-Insensitivity: Inability to naturally accommodate target texts whose character counts or aspect ratios diverge significantly from the source, often resulting in squeezed or overstretched glyphs.
GLASTE targets both issues by modeling global scene layout and local text information in parallel, and by introducing mechanisms for size-independent style transfer and adaptive fusion (Yang et al., 3 Dec 2025).
2. GLASTE Architecture and Components
GLASTE adopts a hybrid architecture, which includes four principal modules:
- Inpainting Module (Global Branch): Operates on the full image input with a masked text region, using down-/up-sampling convolutional layers and Fast Fourier Convolution (FFC) blocks to restore the masked area so that it matches the overall image distribution (removing shadows and ensuring coherent illumination).
- Style and Content Encoders (Local Branch):
- Style Encoder: ResNet34 backbone, applies Rotated RoIAlign and global pooling to derive a 512-dimensional vector representing the text style, independent of patch geometry.
- Content Encoder: ResNet34 backbone to encode target content, producing a multi-level feature hierarchy.
- Text Synthesizer: Mirrors the content encoder with upsample blocks; style code is injected at each block via Adaptive Instance Normalization (AdaIN), allowing the target text to inherit the desired appearance properties.
- Affine Fusion Module: Fuses the synthesized foreground (edited text) into the inpainted background. Uses an affine transformation to map the target text region into the corresponding location, explicitly preserving aspect ratio, followed by residual blending to smooth borders and correct for mismatched region sizes.
Table 1 summarizes the key components:
| Module | Primary Role | Notable Details |
|---|---|---|
| Inpainting | Global harmonization | FFC blocks, receptive ≈ full image |
| Style Encoder | Size-independent style representation | ResNet34, Rotated RoIAlign, avg. pooling to 512-dim |
| Text Synthesizer | Local patch synthesis | AdaIN, skip connections, residual upsample blocks |
| Affine Fusion | Patch fusion with aspect ratio-aware warp | Affine mapping, residual Conv blending |
3. Loss Functions and Training Objectives
Global-Local Aware Scene Text Editing is trained using a joint loss framework:
- Global Loss : Enforces holistic visual realism across the entire image.
Where is the adversarial loss from a PatchGAN discriminator, is the global reconstruction loss, and is the perceptual loss using VGG19 features.
- Local Loss : Ensures fidelity and recognizability in the edited text patch.
includes CTC-based recognition loss using a CRNN recognizer.
The total loss is a weighted sum:
with balancing global and local contributions.
4. Key Algorithmic Innovations
Size-Independent Style Encoding
GLASTE’s style encoder yields a $512$-dimensional representation extracted via Rotated RoIAlign and global pooling. By discarding explicit spatial dimensions, style embeddings are invariant to input/text patch size and can be used to transfer style even when target and source lengths diverge.
Affine Fusion for Length Adaptivity
An affine transformation is derived to place the synthesized foreground into the inpainted background . This mapping aligns the aspect ratio of the inserted text:
- For , height is matched; margins are left at the sides.
- For , width is matched; text is wrapped or extended smoothly.
Residual convolution blocks then blend the two regions, preventing artifacts from misaligned or stretched patches.
5. Experimental Results and Comparative Benchmarks
On real-world datasets (ICDAR2015, MLT’17/19, SROIE, ICDAR2019, ICDAR2017rctw) and synthetic data, GLASTE demonstrates superior quantitative performance compared to prior art (SRNet, TextStyleBrush, MOSTEL, DIFFSTE):
| Method | MSE | PSNR | SSIM | LPIPS | FID | Acc (%) |
|---|---|---|---|---|---|---|
| SRNet | 0.336 | 16.9 | 0.505 | 0.258 | 26.9 | 74.6 |
| TextStyleBrush | 0.227 | 17.8 | 0.563 | 0.250 | 27.3 | 58.7 |
| MOSTEL | 0.280 | 17.9 | 0.491 | 0.274 | 44.6 | 64.1 |
| DIFFSTE | 0.923 | 11.1 | 0.203 | 0.476 | 54.2 | 7.9 |
| GLASTE | 0.108 | 22.4 | 0.721 | 0.129 | 12.0 | 83.7 |
GLASTE maintains high accuracy and low character error rates even for extreme text length variation (e.g., length 1,2: 84.3/88.2% accuracy; length 9,10: CER 0.073/0.086). Qualitatively, the inpainting module reliably erases original text shadows, and adaptive fusion prevents unnatural stretching or squeeze artifacts (Yang et al., 3 Dec 2025).
6. Relation to Contemporary and Prior Approaches
GLASTE contrasts with diffusion-based frameworks such as GlyphMastero (Wang et al., 8 May 2025), which employs a cross-level glyph attention module and a feature pyramid network for multi-scale OCR feature fusion. Key distinctions:
- Generator Backbone: GLASTE is GAN-based with convolutional generators; GlyphMastero uses latent diffusion with explicit Transformer-based attention.
- Style Fusion: GLASTE uses a global style encoder and AdaIN; GlyphMastero introduces cross-level glyph attention to align local strokes with global line context.
- Losses: GLASTE employs GAN and perceptual/recognition losses; GlyphMastero trains solely with diffusion noise-prediction loss.
A plausible implication is that future hybrid frameworks might combine GLASTE’s adaptive affine fusion and global-local loss design with diffusion model benefits for further advances (Wang et al., 8 May 2025).
7. Limitations and Prospects
GLASTE’s reliance on GAN-based optimization, while advantageous for computational throughput and adaptive style transfer, imposes limitations on photorealism and fine-grained character-level control. Diffusion models have demonstrated improved stroke-level fidelity but are computationally expensive and can introduce character omissions or repetitions. Future directions include hybrid GAN–diffusion designs and explicit conditioning strategies to address overfitting and achieve even higher fidelity (Yang et al., 3 Dec 2025).