GLASTE: Global-Local Aware Scene Text Editing

Updated 10 December 2025

The paper introduces GLASTE, a GAN-based framework that jointly uses global inpainting and local style encoding to maintain image coherence and textual clarity.
It employs a size-independent style encoder and an affine fusion module to adaptively manage variations in text length without inducing visual distortions.
Benchmarking shows that GLASTE outperforms prior methods like SRNet and MOSTEL in terms of accuracy, perceptual similarity, and artifact minimization.

Global-Local Aware Scene Text Editing (GLASTE) refers to a generative adversarial network (GAN)-based framework designed to address the core challenges of scene text editing (STE): maintaining stylistic and spatial consistency between the edited patch and the surrounding image, and adapting to variations in the length of the replaced text without compromising on visual realism or text recognizability. GLASTE achieves these goals through an end-to-end architecture that jointly captures high-level global context via inpainting, encodes local text style/content with size-independent representations, and fuses results using an affine module that preserves aspect ratio regardless of the difference between source and target text lengths. Benchmarking demonstrates that GLASTE surpasses prior STE methods across several recognition and perceptual similarity metrics, while qualitative analysis confirms its ability to avoid artifacts such as unnatural character stretching or patch boundary inconsistencies (Yang et al., 3 Dec 2025).

1. Problem Setting and Motivations

Scene text editing requires replacing or modifying textual content in natural images such that the result is indistinguishable in style, font, and background integration. Two persistent issues in prior STE methods are:

Inconsistency: Failing to harmonize the edited text patch with the background, leading to visible artifacts at patch boundaries or residual shadows from prior text.
Length-Insensitivity: Inability to naturally accommodate target texts whose character counts or aspect ratios diverge significantly from the source, often resulting in squeezed or overstretched glyphs.

GLASTE targets both issues by modeling global scene layout and local text information in parallel, and by introducing mechanisms for size-independent style transfer and adaptive fusion (Yang et al., 3 Dec 2025).

2. GLASTE Architecture and Components

GLASTE adopts a hybrid architecture, which includes four principal modules:

Inpainting Module (Global Branch): Operates on the full image input with a masked text region, using down-/up-sampling convolutional layers and Fast Fourier Convolution (FFC) blocks to restore the masked area so that it matches the overall image distribution (removing shadows and ensuring coherent illumination).
Style and Content Encoders (Local Branch):
- Style Encoder: ResNet34 backbone, applies Rotated RoIAlign and global pooling to derive a 512-dimensional vector representing the text style, independent of patch geometry.
- Content Encoder: ResNet34 backbone to encode target content, producing a multi-level feature hierarchy.
Text Synthesizer: Mirrors the content encoder with upsample blocks; style code is injected at each block via Adaptive Instance Normalization (AdaIN), allowing the target text to inherit the desired appearance properties.
Affine Fusion Module: Fuses the synthesized foreground (edited text) into the inpainted background. Uses an affine transformation to map the target text region into the corresponding location, explicitly preserving aspect ratio, followed by residual blending to smooth borders and correct for mismatched region sizes.

Table 1 summarizes the key components:

Module	Primary Role	Notable Details
Inpainting	Global harmonization	FFC blocks, receptive ≈ full image
Style Encoder	Size-independent style representation	ResNet34, Rotated RoIAlign, avg. pooling to 512-dim
Text Synthesizer	Local patch synthesis	AdaIN, skip connections, residual upsample blocks
Affine Fusion	Patch fusion with aspect ratio-aware warp	Affine mapping, residual Conv blending

3. Loss Functions and Training Objectives

Global-Local Aware Scene Text Editing is trained using a joint loss framework:

Global Loss $L_g$ : Enforces holistic visual realism across the entire image.

$L_g = L_\mathrm{adv}^g + \lambda_1 L_{\ell_1}^g + \lambda_2 L_\mathrm{Per}^g$

Where $L_\mathrm{adv}^g$ is the adversarial loss from a PatchGAN discriminator, $L_{\ell_1}^g$ is the global reconstruction loss, and $L_\mathrm{Per}^g$ is the perceptual loss using VGG19 features.
Local Loss $L_l$ : Ensures fidelity and recognizability in the edited text patch.

$L_l = L_\mathrm{adv}^l + \lambda_1 L_{\ell_1}^l + \lambda_2 L_\mathrm{Per}^l + \lambda_3 L_\mathrm{Rec}$

$L_\mathrm{Rec}$ includes CTC-based recognition loss using a CRNN recognizer.

The total loss is a weighted sum:

$L_\text{total} = \alpha L_g + \beta L_l$

with $\alpha, \beta$ balancing global and local contributions.

4. Key Algorithmic Innovations

Size-Independent Style Encoding

GLASTE’s style encoder yields a $512$-dimensional representation $\mathbf{z}$ extracted via Rotated RoIAlign and global pooling. By discarding explicit spatial dimensions, style embeddings are invariant to input/text patch size and can be used to transfer style even when target and source lengths diverge.

Affine Fusion for Length Adaptivity

An affine transformation $\theta$ is derived to place the synthesized foreground $G_f$ into the inpainted background $G_b$ . This mapping aligns the aspect ratio of the inserted text:

For $|T_\text{target}| < |T_\text{source}|$ , height is matched; margins are left at the sides.
For $|T_\text{target}| > |T_\text{source}|$ , width is matched; text is wrapped or extended smoothly.

Residual convolution blocks then blend the two regions, preventing artifacts from misaligned or stretched patches.

5. Experimental Results and Comparative Benchmarks

On real-world datasets (ICDAR2015, MLT’17/19, SROIE, ICDAR2019, ICDAR2017rctw) and synthetic data, GLASTE demonstrates superior quantitative performance compared to prior art (SRNet, TextStyleBrush, MOSTEL, DIFFSTE):

Method	MSE	PSNR	SSIM	LPIPS	FID	Acc (%)
SRNet	0.336	16.9	0.505	0.258	26.9	74.6
TextStyleBrush	0.227	17.8	0.563	0.250	27.3	58.7
MOSTEL	0.280	17.9	0.491	0.274	44.6	64.1
DIFFSTE	0.923	11.1	0.203	0.476	54.2	7.9
GLASTE	0.108	22.4	0.721	0.129	12.0	83.7

GLASTE maintains high accuracy and low character error rates even for extreme text length variation (e.g., length 1,2: 84.3/88.2% accuracy; length 9,10: CER 0.073/0.086). Qualitatively, the inpainting module reliably erases original text shadows, and adaptive fusion prevents unnatural stretching or squeeze artifacts (Yang et al., 3 Dec 2025).

6. Relation to Contemporary and Prior Approaches

GLASTE contrasts with diffusion-based frameworks such as GlyphMastero (Wang et al., 8 May 2025), which employs a cross-level glyph attention module and a feature pyramid network for multi-scale OCR feature fusion. Key distinctions:

Generator Backbone: GLASTE is GAN-based with convolutional generators; GlyphMastero uses latent diffusion with explicit Transformer-based attention.
Style Fusion: GLASTE uses a global style encoder and AdaIN; GlyphMastero introduces cross-level glyph attention to align local strokes with global line context.
Losses: GLASTE employs GAN and perceptual/recognition losses; GlyphMastero trains solely with diffusion noise-prediction loss.

A plausible implication is that future hybrid frameworks might combine GLASTE’s adaptive affine fusion and global-local loss design with diffusion model benefits for further advances (Wang et al., 8 May 2025).

7. Limitations and Prospects

GLASTE’s reliance on GAN-based optimization, while advantageous for computational throughput and adaptive style transfer, imposes limitations on photorealism and fine-grained character-level control. Diffusion models have demonstrated improved stroke-level fidelity but are computationally expensive and can introduce character omissions or repetitions. Future directions include hybrid GAN–diffusion designs and explicit conditioning strategies to address overfitting and achieve even higher fidelity (Yang et al., 3 Dec 2025).

PDF Markdown Chat (Pro)

References (2)

Global-Local Aware Scene Text Editing (2025)

GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Global-Local Aware Scene Text Editing (GLASTE).