Conditional VAE-GAN for BanglaWriting

Updated 8 January 2026

The paper introduces a two-stage Conditional VAE-GAN pipeline that combines latent diversity with adversarial refinement to synthesize Bangla writing images.
The CVAE stage creates coarse sketches using text embeddings, while the CGAN stage upsamples and sharpens these sketches with conditioning augmentation.
Evaluation using Inception Score and FID demonstrates the model's ability to produce diverse and photorealistic Bangla handwriting images.

Conditional VAE–GAN for BanglaWriting refers to a two-stage neural architecture adapted for synthesizing images of Bangla writing conditioned on textual input. The methodology combines the strengths of Conditional Variational Autoencoders (CVAE) and Conditional Generative Adversarial Networks (CGAN) in a sequential manner, enabling both diversity in generated stroke layouts and photorealistic sharpness. The architecture extends core principles introduced in the context of text-to-image synthesis for natural images to the domain of Bangla character or handwriting image generation (Tibebu et al., 2022).

1. Two-Stage Conditional Pipeline

The CVAE–CGAN framework decomposes BanglaWriting synthesis into two stages. Stage I leverages CVAE to generate a coarse, low-resolution “sketch” reflecting global structural cues given by the textual embedding. Stage II employs CGAN to upsample and refine the sketch, yielding high-resolution, realistic images aligned to the semantic content and stylistic variability of Bangla script. A plausible implication is that this modular design allows controlled diversity in morphological features (via latent variable sampling in CVAE), while GAN-based refinement ensures crisp, visually plausible glyphs representative of handwriting or printed Bangla.

Stage I: Conditional VAE (Sketch Generator)

Inputs: $x$ —low-res Bangla writing image $(64\times 64\times 3)$ ; $\phi_t$ —text embedding.
Encoder fuses $\phi_t$ and image features, outputs Gaussian latent $z\sim q_\phi(z|x,c)$ with conditioning $c$ .
Decoder reconstructs image given $(z, c)$ , enforcing output diversity due to the KL-divergence regularization.

Stage II: Conditional GAN (Refinement Network)

Inputs: CVAE output sketch $\tilde x$ and text conditioning $c$ .
Generator $G$ upsamples and sharpens $\tilde x$ to $256\times 256$ resolution.
Discriminator $D$ evaluates real vs. generated images, concatenated with the text code $c$ .
Residual blocks and “conditioning augmentation” modules in $G$ stabilize training and enhance details.

2. Objective Functions and Training Losses

The system’s objective functions reflect its dual purpose:

CVAE Loss (Stage I)

The Evidence Lower Bound is: $\mathcal{L}_{\text{CVAE}} = -\mathbb{E}_{z\sim q_\phi(z|x,c)} [\log p_\theta(x|z,c)] + \text{KL}\big(q_\phi(z|x,c)\,\|\;p(z|c)\big)$ with reconstruction and KL terms driving accurate sketch recovery and latent diversity, respectively.

CGAN Loss (Stage II)

Discriminator aims to maximize: $\mathcal{L}_D = \mathbb{E}_{(x,t)\sim p_{\rm data}}[\log D(x,c)] + \mathbb{E}_{z,t}[\log(1-D(G(z,c),c))]$ while Generator minimizes adversarial loss plus a KL regularizer on the conditioning augmentation vector $c$ : $\mathcal{L}_G = \mathbb{E}_{z,t}[\log(1-D(G(z,c),c))] + \lambda_{\mathrm{CA}}\, \text{KL}\left(\mathcal{N}(\mu_c, \Sigma_c) \;\|\; \mathcal{N}(0, I)\right)$ This enforces both semantic fidelity and distributional robustness for text conditioning.

3. Conditioning Augmentation Mechanism

Conditioning Augmentation (CA) injects controlled stochasticity:

Text embedding $\phi_t$ (from char-CNN+RNN; for Bangla, a BiLSTM or Transformer encoder is applicable) is mapped to $\mu_c(\phi_t)$ , $\log \Sigma_c(\phi_t)$ by dense layers.
The conditioning vector $c$ is sampled as $c \sim \mathcal{N}(\mu_c, \mathrm{diag}(\Sigma_c))$ and injected into both CVAE and CGAN stages.
Regularization via $\text{KL}(\mathcal{N}(\mu_c,\Sigma_c)\;\|\;\mathcal{N}(0,I))$ mitigates mode collapse and keeps $c$ well-behaved.

A plausible implication is that incorporating CA for per-character or per-stroke embeddings enables the generation pipeline to model subtle variations and maintain rich output diversity in Bangla glyph structure.

4. Architectural Specifications

Both stages employ convolutional and residual blocks, batch normalization, and activation strategies optimized for stability and expressivity.

Stage I—CVAE

Encoder: 5 convolutions (5×5, stride 2), downsampling to $4\times 4$ spatial size; text embedding reshaped and concatenated; outputs 2048-D latent vectors $\mu_z, \log \sigma_z^2$ .
Decoder: Concatenated latent $z$ and text code, passed through successive Conv2DTranspose layers, upsampling back to $64\times 64$ , final Tanh activation.

Stage II—CGAN

Generator: Downsampling convolutions, residual blocks at bottleneck, upsampling to $256\times 256$ .
Discriminator: 5 convolutional blocks to $4\times 4$ features; concatenation with text code $c$ ; dense layer with sigmoid output; LeakyReLU activations.

For BanglaWriting generation, per-character or per-stroke text embeddings are utilized in place of English-language descriptors, implying strong adaptability to complex script morphologies.

5. Training Protocols and Hyperparameters

Optimization details include:

Noise $z \in \mathbb{R}^{100}$ ( $N(0,I)$ ).
Conditioning vector $c$ dimensionality $128$–$512$.
Adam optimizer; $\text{CVAE}$ : learning rate $2\times 10^{-4}$ , $\text{CGAN}$ : $2\times 10^{-3}$ , decayed every $25$ epochs.
Epochs: $150$ for each stage.
Batch sizes of $64$–$128$ (not specified in the cited work).
$\lambda_{\text{CA}}$ for CA-KL regularization set via cross-validation.
ReLU in generators, LeakyReLU ( $\alpha=0.2$ ) in discriminator.

This protocol ensures robust convergence and regularization across both diversity and realism objectives.

6. Quantitative Evaluation Metrics

The system outputs are assessed as follows:

Inception Score (IS): $IS = \exp(\mathbb{E}_x \,\text{KL}[p(y|x) \,\|\, p(y)])$ , quantifying both image diversity and confidence.
Fréchet Inception Distance (FID):

$FID = \|\mu_r-\mu_g\|^2 + \text{Tr}(\Sigma_r+\Sigma_g-2(\Sigma_r\Sigma_g)^{1/2})$

Lower FID values indicate closer correspondence between synthesized and real Bangla writing image distributions.

These metrics enable direct empirical comparison with prior art (e.g., StackGAN, GAN-INT-CLS).

7. Significance and Extension to BanglaWriting

Stacking CVAE with CGAN leverages the coverage of latent mode-space by the VAE and adversarial sharpening by the GAN. Conditioning augmentation further moderates the mapping from Bangla text to images, thus facilitating output diversity and mitigating mode collapse. The architecture’s modularity and explicitly conditioned latent structure—when adapted for Bangla embeddings and appropriate image corpora—can synthesize a broad array of glyph styles and layouts. This suggests strong applicability to digital handwriting synthesis, font generation, or automatic Bangla calligraphy rendering (Tibebu et al., 2022).

A plausible implication is that by training on a curated Bangla writing dataset and employing character or stroke-level embeddings within the CA framework, researchers can obtain image generators capturing the stylistic complexity of Bangla scripts with both high fidelity and expressive variability.

Markdown Report Issue Upgrade to Chat

References (1)

Text to Image Synthesis using Stacked Conditional Variational Autoencoders and Conditional Generative Adversarial Networks (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conditional VAE–GAN for BanglaWriting.

Conditional VAE-GAN for BanglaWriting

1. Two-Stage Conditional Pipeline

2. Objective Functions and Training Losses

CVAE Loss (Stage I)

CGAN Loss (Stage II)

3. Conditioning Augmentation Mechanism

4. Architectural Specifications

Stage I—CVAE

Stage II—CGAN

5. Training Protocols and Hyperparameters

6. Quantitative Evaluation Metrics

7. Significance and Extension to BanglaWriting

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Conditional VAE-GAN for BanglaWriting

1. Two-Stage Conditional Pipeline

2. Objective Functions and Training Losses

CVAE Loss (Stage I)

CGAN Loss (Stage II)

3. Conditioning Augmentation Mechanism

4. Architectural Specifications

Stage I—CVAE

Stage II—CGAN

5. Training Protocols and Hyperparameters

6. Quantitative Evaluation Metrics

7. Significance and Extension to BanglaWriting

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research