DeepGen 1.0: Innovations in Depth, Image & Text

Updated 18 February 2026

DeepGen 1.0 is a suite of frameworks that leverage diffusion-based methods to excel in depth estimation, multimodal image creation, and dynamic text generation.
It employs novel techniques such as step-unrolled denoising, stacked channel bridging with 'think tokens', and k-DPP sampling to optimize performance in diverse tasks.
The systems achieve significant gains in practical metrics—from enhanced depth precision and improved CTR to competitive quality with reduced parameter counts.

DeepGen 1.0 designates several distinct, high-impact frameworks that have advanced deep learning in depth estimation, multimodal image generation and editing, and large-scale real-time text generation. Notably, the term “DeepGen 1.0” refers to (i) a denoising diffusion-based monocular depth estimator, (ii) a state-of-the-art lightweight unified multimodal image creation and editing model, and (iii) a production-scale system for sponsored search advertisement generation. Despite domain differences, each system introduces significant architectural innovations, training protocols, and empirical results, and has spurred methodological developments in its area.

1. DeepGen 1.0 for Monocular Depth Estimation

The DeepGen 1.0 system (Saxena et al., 2023) for monocular depth estimation formulates the task as conditional denoising diffusion, adapting generative denoising approaches for pixel-wise depth prediction from single RGB images. It employs an Efficient U-Net backbone inherited from Imagen/Palette and is formulated as follows:

Given a clean image $x \in \mathbb{R}^{H \times W \times 3}$ , a noisy depth $y_t \in \mathbb{R}^{H \times W \times 1}$ , and time index $t \in [0,1]$ , training proceeds by infilling missing pixels in ground-truth $y_0$ using nearest-neighbor interpolation (indoors) or sky-masking (outdoors). Artificial noise, sampled as $\epsilon \sim \mathcal{N}(0, I)$ , is injected:

$y_t = \sqrt{\gamma_t}\, y_0 + \sqrt{1-\gamma_t}\, \epsilon$

with $\gamma_t = \prod_{u=1}^t \alpha_u$ . The U-Net $f_\theta$ predicts either $\epsilon$ (standard DDPM parameterization) or denoised depth. In inference, samples are iteratively drawn along learned reverse-time transitions:

$p_\theta(y_{t-1} \mid y_t, x) = \mathcal{N}(y_{t-1} ; \mu_\theta(y_t, x, t), \Sigma_\theta(t))$

for $t_K=1 > t_{K-1} > ... > t_0=0$ .

Several methodological advances are central:

Step-unrolled denoising diffusion (SUD): One forward denoising/re-noising step is unrolled in training to resolve the train-inference mismatch arising from noise compounding, reducing latent shift.
$L_1$ loss: The objective replaces $L_2$ with $L_1$ loss, focusing only on originally valid pixels:

$\mathcal{L}_{L_1}(\theta) = \mathbb{E} [|f_\theta(x, y_t, t) - \epsilon|\ [\mathit{valid\_mask}]]$

This mitigates the effects of depth sensor noise and outlier pixels.

Self-supervised pre-training: Pre-training on multi-task image-to-image translation improves data efficiency. Results on NYU Depth v2 indicate REL improvement from 0.081 to 0.075.

The approach models the full conditional $p(y_0|x)$ , yielding multimodal plausible depth samples, especially in ambiguous regions (e.g., glass, reflective surfaces). This formulation allows zero-shot depth estimation and enables a text-to-3D pipeline by leveraging RGB samples from text-to-image diffusion and infilling missing views via joint generation.

Performance details:

Dataset	REL	δ<1.25	RMS
NYU Depth v2	0.074	0.946	0.315
KITTI (outdoor)	0.064	0.953	2.985

DeepGen 1.0 achieves SOTA or near-SOTA indoor/outdoor results using only 4–8 posterior samples per image.

2. DeepGen 1.0: Unified Multimodal Generation and Editing

The 2026 DeepGen 1.0 system (Wang et al., 12 Feb 2026) constitutes a unified 5B-parameter Vision–LLM (VLM) plus Diffusion Transformer (DiT) model capable of robust image generation, editing, and reasoning at scale, challenging models with 4×–16× greater parameter counts.

Architecture and Conditioning:

VLM: Qwen-2.5-VL (3B) with ViT-based vision encoder, text encoder, and a VAE encoder for mapping images into diffusion latents.
Diffusion Backbone: SD3.5-Medium DiT (2B).
Stacked Channel Bridging (SCB): Fuses multimodal features extracted from $n=6$ VLM layers via channel-wise concatenation. Learnable “think tokens” $T\in\mathbb{R}^{L_T \times d}$ are injected and used as a chain-of-thought buffer, yielding richer semantic and reasoning context.
Shallow Transformer-encoder: Aggregates channel-stacked features into DiT conditioning tokens.

Training Protocol:

Stage 1: Alignment Pre-training. Train only the SCB connector and think tokens with VLM and DiT backbones frozen. Use large-scale (35M image-text pairs, 6.6M editing triplets) data and flow-matching losses.
Stage 2: Joint Supervised Fine-tuning. Unfreeze Connector, DiT, and LoRA adapters in VLM. Data comprises general generation/editing, reasoning, and text rendering. Flow-matching loss remains primary.
Stage 3: RL with MR-GRPO. Optimize generation/editing via group sampling, multi-reward advantage normalization, and a KL-anchored loss to stabilize policy deviation, supplemented by auxiliary SFT loss.

Empirical Results:

Surpasses 80B HunyuanImage on WISE T2I by 28% (0.73 vs 0.57) and 27B Qwen-Image-Edit on UniREditBench by 37% (77.5 vs 56.5).
Demonstrates complex multi-object reasoning and causal/temporal editing capabilities, e.g., correctly depicting temporal evolution (face aging), or causal structure (directional lighting changes).

Model	Params	WISE Overall	UniREditBench
HunyuanImage 3.0	80 B	0.57	–
Qwen-Image-Edit	27 B	–	56.5
DeepGen 1.0 (SFT)	5 B	0.72	77.5
DeepGen 1.0 (RL)	5 B	0.73	75.7

Notable methodological features:

“Think tokens” and SCB maximize the utility of compact models, enabling competitive performance via deep hierarchical feature fusion and chain-of-thought style conditioning.
RL stage (MR-GRPO) fuses multiple reward signals to improve alignment, quality, and visual fidelity.

Limitations include: slower RL convergence and diminished performance on extremely abstract reasoning or complex text rendering.

3. DeepGen 1.0 for Web-scale Text Generation and Ad Customization

DeepGen 1.0 (Golobokov et al., 2022) denotes a deployed system for generating sponsored search ads at scale for Bing. This system is built on large-scale NLG workflows combining offline parallel abstractive asset generation, rigorous factuality checks, diversity-sampling, and online query customization.

Pipeline:

Offline Stage: Crawl advertiser sites; extract content; generate candidate “assets” via (i) extraction models, (ii) abstractive AdCopy transformers (UniLMv2), (iii) controllable guided NLG models. Post-process via keyword, brand, and domain cross-checks for factuality.
Diversity Selection: CDSSM-embedded assets are subsampled using k-DPP (Determinantal Point Process) to maximize mutual diversity.
Online Stage: At query time, logistic regression rankers and contextual bandits select and stitch assets into final ads, producing position-optimized, query-customized ad units in real time.

Neural Model Details:

Guided Asset Model: Adds control codes to landing-page input, enabling controllable style/intent.
AdCopy Model: Transformer encoder-decoder, trained for maximum likelihood with beam search (FastSeq optimizations yielding 5× speedup).
Diversity Metrics: Ensemble + DPP achieves highest distinct-n and lowest redundancy for generated titles.

Technique	Overall Good	Text Quality	Factuality	Relevance
Advertiser-written	90.7%	97.9%	92.7%	99.0%
DeepGen (full x-check)	96.3%	100%	97.0%	99.7%

A/B tests on Bing's ad traffic:

DeepGen delivered 13.28% CTR uplift and ∼25% RPM improvement over extraction-only baselines.
Query-specific online stitching increased Impression Yield (+14.43%) and RPM (+10.65%), confirming the ad auction benefit.

4. Technical Innovations and Methodological Advances

Across the DeepGen 1.0 lineage, central technical contributions include:

Diffusion-based conditional generative models adapted to structured pixelwise regression (depth) and high-dimensional image synthesis using denoising score matching frameworks.
Stacked Channel Bridging (SCB) and “think tokens” as a novel VLM→diffusion alignment protocol, enabling compact model utility for reasoning-augmented multimodal generation (Wang et al., 12 Feb 2026).
k-DPP sampling for scalable diversity maximization in discrete sequence generation (Golobokov et al., 2022).
Step-unrolled diffusion and infilling strategies for handling sensor-impaired labels and missing data in joint generative training (Saxena et al., 2023).
Multi-reward RL with MR-GRPO to directly optimize alignment with human preference and multi-faceted quality signals in image diffusion.

These techniques have enabled both domain-agnostic, generalist models (image, text, multimodal) and fine-tuned systems (depth, search ads) to achieve SOTA results with reduced parameters or compute.

5. Limitations and Prospective Directions

Known limitations are:

DeepGen for depth estimation: Inherent in ambiguous cases, uncertainty persists despite multimodal posteriors; further addressing dynamic occlusion or transparent media remains challenging (Saxena et al., 2023).
Unified multimodal model: Performance deficit remains on highly abstract world knowledge, advanced scientific reasoning, and exotic layout rendering (Wang et al., 12 Feb 2026).
Industrial ad generation: While factuality improves with cross-checkers, highly dynamic site content or nonstandard web syntax can hinder automatic extraction (Golobokov et al., 2022).

Prospective advances include:

Automatic/learned feature selection within SCB to improve parameter efficiency.
Retrieval-augmented VLM backbones integrating external factual knowledge for narrowing performance gaps on out-of-distribution queries or prompts.
Multi-stage chain-of-thought mechanisms spanning diffusion denoising steps for finer reasoning.
Lightweight RL algorithms (e.g., DPO) to further reduce alignment and compute costs.

6. Context, Impact, and Research Trajectory

The DeepGen 1.0 name captures a methodological convergence. Across modalities, DeepGen systems have:

Elevated the efficiency frontier in depth estimation, generation, and customization.
Demonstrated that targeted data-centric curriculum design and architectural alignment (SCB, think tokens, SUD, k-DPP) can compensate for scale, delivering top-tier empirical results with 5B–7B parameters.
Established new deployment paradigms for real-time, large-scale content generation in both research and production.

The frameworks introduced in the corresponding papers (Saxena et al., 2023, Wang et al., 12 Feb 2026), and (Golobokov et al., 2022) have directly influenced downstream research agendas, including efficient multimodal Transformer-diffusion hybrids, data-efficient pretraining strategies, and robust factuality and diversity controls for responsible generation at scale.