Conditional VAE-GAN for BanglaWriting
- The paper introduces a two-stage Conditional VAE-GAN pipeline that combines latent diversity with adversarial refinement to synthesize Bangla writing images.
- The CVAE stage creates coarse sketches using text embeddings, while the CGAN stage upsamples and sharpens these sketches with conditioning augmentation.
- Evaluation using Inception Score and FID demonstrates the model's ability to produce diverse and photorealistic Bangla handwriting images.
Conditional VAEāGAN for BanglaWriting refers to a two-stage neural architecture adapted for synthesizing images of Bangla writing conditioned on textual input. The methodology combines the strengths of Conditional Variational Autoencoders (CVAE) and Conditional Generative Adversarial Networks (CGAN) in a sequential manner, enabling both diversity in generated stroke layouts and photorealistic sharpness. The architecture extends core principles introduced in the context of text-to-image synthesis for natural images to the domain of Bangla character or handwriting image generation (Tibebu et al., 2022).
1. Two-Stage Conditional Pipeline
The CVAEāCGAN framework decomposes BanglaWriting synthesis into two stages. Stage I leverages CVAE to generate a coarse, low-resolution āsketchā reflecting global structural cues given by the textual embedding. Stage II employs CGAN to upsample and refine the sketch, yielding high-resolution, realistic images aligned to the semantic content and stylistic variability of Bangla script. A plausible implication is that this modular design allows controlled diversity in morphological features (via latent variable sampling in CVAE), while GAN-based refinement ensures crisp, visually plausible glyphs representative of handwriting or printed Bangla.
Stage I: Conditional VAE (Sketch Generator)
- Inputs: ālow-res Bangla writing image ; ātext embedding.
- Encoder fuses and image features, outputs Gaussian latent with conditioning .
- Decoder reconstructs image given , enforcing output diversity due to the KL-divergence regularization.
Stage II: Conditional GAN (Refinement Network)
- Inputs: CVAE output sketch and text conditioning .
- Generator upsamples and sharpens to resolution.
- Discriminator evaluates real vs. generated images, concatenated with the text code .
- Residual blocks and āconditioning augmentationā modules in stabilize training and enhance details.
2. Objective Functions and Training Losses
The systemās objective functions reflect its dual purpose:
CVAE Loss (Stage I)
The Evidence Lower Bound is: with reconstruction and KL terms driving accurate sketch recovery and latent diversity, respectively.
CGAN Loss (Stage II)
Discriminator aims to maximize: while Generator minimizes adversarial loss plus a KL regularizer on the conditioning augmentation vector : This enforces both semantic fidelity and distributional robustness for text conditioning.
3. Conditioning Augmentation Mechanism
Conditioning Augmentation (CA) injects controlled stochasticity:
- Text embedding (from char-CNN+RNN; for Bangla, a BiLSTM or Transformer encoder is applicable) is mapped to , by dense layers.
- The conditioning vector is sampled as and injected into both CVAE and CGAN stages.
- Regularization via mitigates mode collapse and keeps well-behaved.
A plausible implication is that incorporating CA for per-character or per-stroke embeddings enables the generation pipeline to model subtle variations and maintain rich output diversity in Bangla glyph structure.
4. Architectural Specifications
Both stages employ convolutional and residual blocks, batch normalization, and activation strategies optimized for stability and expressivity.
Stage IāCVAE
- Encoder: 5 convolutions (5Ć5, stride 2), downsampling to spatial size; text embedding reshaped and concatenated; outputs 2048-D latent vectors .
- Decoder: Concatenated latent and text code, passed through successive Conv2DTranspose layers, upsampling back to , final Tanh activation.
Stage IIāCGAN
- Generator: Downsampling convolutions, residual blocks at bottleneck, upsampling to .
- Discriminator: 5 convolutional blocks to features; concatenation with text code ; dense layer with sigmoid output; LeakyReLU activations.
For BanglaWriting generation, per-character or per-stroke text embeddings are utilized in place of English-language descriptors, implying strong adaptability to complex script morphologies.
5. Training Protocols and Hyperparameters
Optimization details include:
- Noise ().
- Conditioning vector dimensionality $128$ā$512$.
- Adam optimizer; : learning rate , : , decayed every $25$ epochs.
- Epochs: $150$ for each stage.
- Batch sizes of $64$ā$128$ (not specified in the cited work).
- for CA-KL regularization set via cross-validation.
- ReLU in generators, LeakyReLU () in discriminator.
This protocol ensures robust convergence and regularization across both diversity and realism objectives.
6. Quantitative Evaluation Metrics
The system outputs are assessed as follows:
- Inception Score (IS): , quantifying both image diversity and confidence.
- FrƩchet Inception Distance (FID):
Lower FID values indicate closer correspondence between synthesized and real Bangla writing image distributions.
These metrics enable direct empirical comparison with prior art (e.g., StackGAN, GAN-INT-CLS).
7. Significance and Extension to BanglaWriting
Stacking CVAE with CGAN leverages the coverage of latent mode-space by the VAE and adversarial sharpening by the GAN. Conditioning augmentation further moderates the mapping from Bangla text to images, thus facilitating output diversity and mitigating mode collapse. The architectureās modularity and explicitly conditioned latent structureāwhen adapted for Bangla embeddings and appropriate image corporaācan synthesize a broad array of glyph styles and layouts. This suggests strong applicability to digital handwriting synthesis, font generation, or automatic Bangla calligraphy rendering (Tibebu et al., 2022).
A plausible implication is that by training on a curated Bangla writing dataset and employing character or stroke-level embeddings within the CA framework, researchers can obtain image generators capturing the stylistic complexity of Bangla scripts with both high fidelity and expressive variability.