Papers
Topics
Authors
Recent
Search
2000 character limit reached

Conditional VAE-GAN for BanglaWriting

Updated 8 January 2026
  • The paper introduces a two-stage Conditional VAE-GAN pipeline that combines latent diversity with adversarial refinement to synthesize Bangla writing images.
  • The CVAE stage creates coarse sketches using text embeddings, while the CGAN stage upsamples and sharpens these sketches with conditioning augmentation.
  • Evaluation using Inception Score and FID demonstrates the model's ability to produce diverse and photorealistic Bangla handwriting images.

Conditional VAE–GAN for BanglaWriting refers to a two-stage neural architecture adapted for synthesizing images of Bangla writing conditioned on textual input. The methodology combines the strengths of Conditional Variational Autoencoders (CVAE) and Conditional Generative Adversarial Networks (CGAN) in a sequential manner, enabling both diversity in generated stroke layouts and photorealistic sharpness. The architecture extends core principles introduced in the context of text-to-image synthesis for natural images to the domain of Bangla character or handwriting image generation (Tibebu et al., 2022).

1. Two-Stage Conditional Pipeline

The CVAE–CGAN framework decomposes BanglaWriting synthesis into two stages. Stage I leverages CVAE to generate a coarse, low-resolution ā€œsketchā€ reflecting global structural cues given by the textual embedding. Stage II employs CGAN to upsample and refine the sketch, yielding high-resolution, realistic images aligned to the semantic content and stylistic variability of Bangla script. A plausible implication is that this modular design allows controlled diversity in morphological features (via latent variable sampling in CVAE), while GAN-based refinement ensures crisp, visually plausible glyphs representative of handwriting or printed Bangla.

Stage I: Conditional VAE (Sketch Generator)

  • Inputs: xx—low-res Bangla writing image (64Ɨ64Ɨ3)(64\times 64\times 3); Ļ•t\phi_t—text embedding.
  • Encoder fuses Ļ•t\phi_t and image features, outputs Gaussian latent z∼qĻ•(z∣x,c)z\sim q_\phi(z|x,c) with conditioning cc.
  • Decoder reconstructs image given (z,c)(z, c), enforcing output diversity due to the KL-divergence regularization.

Stage II: Conditional GAN (Refinement Network)

  • Inputs: CVAE output sketch x~\tilde x and text conditioning cc.
  • Generator GG upsamples and sharpens x~\tilde x to 256Ɨ256256\times 256 resolution.
  • Discriminator DD evaluates real vs. generated images, concatenated with the text code cc.
  • Residual blocks and ā€œconditioning augmentationā€ modules in GG stabilize training and enhance details.

2. Objective Functions and Training Losses

The system’s objective functions reflect its dual purpose:

CVAE Loss (Stage I)

The Evidence Lower Bound is: LCVAE=āˆ’Ez∼qĻ•(z∣x,c)[log⁔pĪø(x∣z,c)]+KL(qĻ•(z∣x,c)ā€‰āˆ„ā€…ā€Šp(z∣c))\mathcal{L}_{\text{CVAE}} = -\mathbb{E}_{z\sim q_\phi(z|x,c)} [\log p_\theta(x|z,c)] + \text{KL}\big(q_\phi(z|x,c)\,\|\;p(z|c)\big) with reconstruction and KL terms driving accurate sketch recovery and latent diversity, respectively.

CGAN Loss (Stage II)

Discriminator aims to maximize: LD=E(x,t)∼pdata[log⁔D(x,c)]+Ez,t[log⁔(1āˆ’D(G(z,c),c))]\mathcal{L}_D = \mathbb{E}_{(x,t)\sim p_{\rm data}}[\log D(x,c)] + \mathbb{E}_{z,t}[\log(1-D(G(z,c),c))] while Generator minimizes adversarial loss plus a KL regularizer on the conditioning augmentation vector cc: LG=Ez,t[log⁔(1āˆ’D(G(z,c),c))]+Ī»CA KL(N(μc,Ī£c)ā€…ā€Šāˆ„ā€…ā€ŠN(0,I))\mathcal{L}_G = \mathbb{E}_{z,t}[\log(1-D(G(z,c),c))] + \lambda_{\mathrm{CA}}\, \text{KL}\left(\mathcal{N}(\mu_c, \Sigma_c) \;\|\; \mathcal{N}(0, I)\right) This enforces both semantic fidelity and distributional robustness for text conditioning.

3. Conditioning Augmentation Mechanism

Conditioning Augmentation (CA) injects controlled stochasticity:

  • Text embedding Ļ•t\phi_t (from char-CNN+RNN; for Bangla, a BiLSTM or Transformer encoder is applicable) is mapped to μc(Ļ•t)\mu_c(\phi_t), log⁔Σc(Ļ•t)\log \Sigma_c(\phi_t) by dense layers.
  • The conditioning vector cc is sampled as c∼N(μc,diag(Ī£c))c \sim \mathcal{N}(\mu_c, \mathrm{diag}(\Sigma_c)) and injected into both CVAE and CGAN stages.
  • Regularization via KL(N(μc,Ī£c)ā€…ā€Šāˆ„ā€…ā€ŠN(0,I))\text{KL}(\mathcal{N}(\mu_c,\Sigma_c)\;\|\;\mathcal{N}(0,I)) mitigates mode collapse and keeps cc well-behaved.

A plausible implication is that incorporating CA for per-character or per-stroke embeddings enables the generation pipeline to model subtle variations and maintain rich output diversity in Bangla glyph structure.

4. Architectural Specifications

Both stages employ convolutional and residual blocks, batch normalization, and activation strategies optimized for stability and expressivity.

Stage I—CVAE

  • Encoder: 5 convolutions (5Ɨ5, stride 2), downsampling to 4Ɨ44\times 4 spatial size; text embedding reshaped and concatenated; outputs 2048-D latent vectors μz,log⁔σz2\mu_z, \log \sigma_z^2.
  • Decoder: Concatenated latent zz and text code, passed through successive Conv2DTranspose layers, upsampling back to 64Ɨ6464\times 64, final Tanh activation.

Stage II—CGAN

  • Generator: Downsampling convolutions, residual blocks at bottleneck, upsampling to 256Ɨ256256\times 256.
  • Discriminator: 5 convolutional blocks to 4Ɨ44\times 4 features; concatenation with text code cc; dense layer with sigmoid output; LeakyReLU activations.

For BanglaWriting generation, per-character or per-stroke text embeddings are utilized in place of English-language descriptors, implying strong adaptability to complex script morphologies.

5. Training Protocols and Hyperparameters

Optimization details include:

  • Noise z∈R100z \in \mathbb{R}^{100} (N(0,I)N(0,I)).
  • Conditioning vector cc dimensionality $128$–$512$.
  • Adam optimizer; CVAE\text{CVAE}: learning rate 2Ɨ10āˆ’42\times 10^{-4}, CGAN\text{CGAN}: 2Ɨ10āˆ’32\times 10^{-3}, decayed every $25$ epochs.
  • Epochs: $150$ for each stage.
  • Batch sizes of $64$–$128$ (not specified in the cited work).
  • Ī»CA\lambda_{\text{CA}} for CA-KL regularization set via cross-validation.
  • ReLU in generators, LeakyReLU (α=0.2\alpha=0.2) in discriminator.

This protocol ensures robust convergence and regularization across both diversity and realism objectives.

6. Quantitative Evaluation Metrics

The system outputs are assessed as follows:

  • Inception Score (IS): IS=exp⁔(Ex KL[p(y∣x)ā€‰āˆ„ā€‰p(y)])IS = \exp(\mathbb{E}_x \,\text{KL}[p(y|x) \,\|\, p(y)]), quantifying both image diversity and confidence.
  • FrĆ©chet Inception Distance (FID):

FID=∄μrāˆ’Ī¼g∄2+Tr(Ī£r+Ī£gāˆ’2(Ī£rĪ£g)1/2)FID = \|\mu_r-\mu_g\|^2 + \text{Tr}(\Sigma_r+\Sigma_g-2(\Sigma_r\Sigma_g)^{1/2})

Lower FID values indicate closer correspondence between synthesized and real Bangla writing image distributions.

These metrics enable direct empirical comparison with prior art (e.g., StackGAN, GAN-INT-CLS).

7. Significance and Extension to BanglaWriting

Stacking CVAE with CGAN leverages the coverage of latent mode-space by the VAE and adversarial sharpening by the GAN. Conditioning augmentation further moderates the mapping from Bangla text to images, thus facilitating output diversity and mitigating mode collapse. The architecture’s modularity and explicitly conditioned latent structure—when adapted for Bangla embeddings and appropriate image corpora—can synthesize a broad array of glyph styles and layouts. This suggests strong applicability to digital handwriting synthesis, font generation, or automatic Bangla calligraphy rendering (Tibebu et al., 2022).

A plausible implication is that by training on a curated Bangla writing dataset and employing character or stroke-level embeddings within the CA framework, researchers can obtain image generators capturing the stylistic complexity of Bangla scripts with both high fidelity and expressive variability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conditional VAE–GAN for BanglaWriting.