Vision Banana: Unified Generative Vision

Updated 24 April 2026

Vision Banana is a unified vision model that reframes tasks—such as segmentation, depth, and normal estimation—as image-to-image generation using invertible RGB mappings.
It leverages instruction-tuning on Nano Banana Pro by mixing generative data with vision task triplets, creating a single interface for perception and generation.
Empirical results demonstrate state-of-the-art performance across multiple benchmarks, while also highlighting challenges in iterative editing and occluded object detection.

Vision Banana is a generalist vision model obtained by instruction-tuning the Nano Banana Pro (NBP) latent diffusion image generator on a joint mixture of its original generative data and a compact corpus of vision task supervision. This approach reframes classical perception problems as image-to-image generation tasks, exporting vision outputs—such as segmentation masks, depth maps, and surface normals—as strictly invertible RGB visualizations. Vision Banana demonstrates that state-of-the-art vision understanding can emerge from a large-scale generative model with minimal changes, bridging image creation and perception within a unified interface and establishing new paradigms for foundational vision models (Gabeur et al., 22 Apr 2026).

1. Model Lineage and Theoretical Foundations

Nano Banana Pro was introduced by Google DeepMind as a flagship text-to-image and image-editing latent diffusion model, structurally derived from Gemini 3 Pro. NBP operates in a learned latent space, utilizing a U-Net denoiser with hierarchical cross-attention layers for text conditioning. No internal architectural parameters, such as exact depth, width, or attention mechanisms, are publicly disclosed: the model is accessed exclusively as a black-box API (Tang et al., 3 Apr 2026). NBP’s large-scale generative pretraining on web-scale image–text pairs enables the emergence of general visual representations analogous to the implicit language understanding seen in LLMs.

The hypothesis driving Vision Banana is that generation pretraining, by forcing high-fidelity visual synthesis conditioned on textual prompts, incidentally induces powerful visual abstractions. Instruction-tuning then exposes these representations, parameterizing vision tasks as generation problems, and eliminating the need for dedicated task-specific heads or regression objectives (Gabeur et al., 22 Apr 2026).

2. Vision Task Parameterization and Instruction-Tuning

Vision Banana is engineered by lightly instruction-tuning NBP with a dataset mixing classic image-generation trajectories and a small fraction (≈0.5%) of vision-task triplets, each consisting of an image, a textual prompt, and an RGB visualization target. All outputs, including segmentation, depth, and surface normals, are encoded as RGB images via strictly invertible mappings:

Semantic/Instance Segmentation: Classes map to unique colors. Decoding assigns each pixel by nearest color. For instance segmentation, unique hues/color clusters denote individual instances, extracted via connectivity clustering.
Metric Depth Estimation: Real-valued depth is “curved” using a power transform (α=–3, c=10/3 per Barron 2025), followed by bijective traversal of the RGB cube edges. Depths are decoded by projecting generated RGB pixels to the cube edge and applying the analytical inverse transformation.
Surface Normal Estimation: Normals $n = (n_x, n_y, n_z)$ are mapped linearly: $\mathrm{RGB} = \frac{1}{2}(n + (1,1,1))$ .

No additional task-specific heads or multi-objective optimization is introduced; a single joint denoising diffusion loss is retained throughout instruction-tuning. Fine-tuning is conducted for approximately 200,000 steps with a batch size of 1,024 on TPU v4 pods, keeping generative priors intact (Gabeur et al., 22 Apr 2026).

3. Empirical Performance and Task Coverage

Vision Banana achieves state-of-the-art or competitive performance across a diverse portfolio of tasks, notably in zero-shot and low-data learning settings:

Task	Benchmark	Vision Banana	Best Specialist
Referring Segmentation	RefCOCOg UMD (cIoU)	0.738	0.734 (SAM 3)
ReasonSeg (gIoU)	ReasonSeg val	0.793	0.770 (SAM 3)
Semantic Segmentation	Cityscapes (mIoU)	0.699	0.652 (SAM 3)
Instance Segmentation	SA-Co/Gold (pmF₁)	0.540*	0.552 (DINO-X)
Metric Depth Estimation	Avg of 4 sets (δ₁)	0.929	0.918 (DepthAnything3)
Surface Normals	Avg of 4 sets (MAE)	18.93°	19.64° (Lotus-2)
T2I Generation	GenAI-Bench (win %)	53.5	46.5 (NBP)
Editing	ImgEdit (win %)	47.8	52.2 (NBP)

*On 500 sampled queries (Gabeur et al., 22 Apr 2026).

Key findings:

Vision Banana surpasses strong “Segment Anything Model 3” and “Depth Anything 3” baselines in corresponding domains.
The model retains generative abilities, as evidenced by unchanged or improved scores in human-preference T2I and image editing.
Qualitative analysis shows robust multi-color segmentations, invertible depth reconstructions, and high-fidelity normal maps.

Vision Banana’s primary weaknesses are in highly cluttered instance segmentation, extremely small or occluded objects, and long-range monocular depth, with observed error increases for distant pixels due to non-linear quantization in the RGB mapping.

4. Zero-Shot Low-Level Vision Capabilities

Nano Banana Pro, the precursor to Vision Banana, exhibits substantial capabilities as a zero-shot “vision all-rounder” across 14 low-level restoration, enhancement, and fusion tasks spanning 40 datasets. Driven solely by natural language prompts, NBP achieves:

Superior perceptual visual quality, often hallucinating high-frequency detail and plausible new textures not present in ground truth.
Significant deficits in reference-based quantitative metrics (e.g., PSNR, SSIM, LPIPS), trailing best specialists by 5–12 dB (PSNR) and 0.2–0.5 (SSIM).

This dichotomy is attributed to the stochastic, distributional nature of generative outputs (which prioritize plausible visual reconstructions) contrasted with the deterministic regression targets of specialist models (Zuo et al., 17 Dec 2025). Traditional metrics penalize NBP’s mode-seeking or creative outputs, leading to consistent gaps between subjective quality and quantitative fidelity. Vision Banana inherits this prior but, via instruction-tuning, harmonizes perception and generation more tightly.

5. Robustness, Failure Modes, and Iterative Editing

While Vision Banana demonstrates “emergent” vision understanding, NBP underpins significant limitations when deployed in multi-turn, agentic settings:

In iterative editing (100-step replications), minor artifacts from each generative pass (e.g., quantization noise, color drifts, simplifications) accumulate, resulting in dramatic degradation (salt-and-pepper noise, tints, static) by 20+ iterations (Tang et al., 3 Apr 2026).
Instruction-following fidelity collapses over long horizons, with fine textures degrading in a few rounds, object addition/counts failing, and global aspects, such as cropping and denoising, exhibiting irreversible drift.
No-reference IQA (NR-IQA) metrics—21 classical variants including BRISQUE, NIQE, MUSIQ—are blind to this progressive collapse: none consistently assign lower scores to degraded than clean outputs. Only recent large VLM-based metrics (RALI, VisualQuality-R1) demonstrate adequate sensitivity.
The Banana100 dataset was specifically constructed to benchmark and expose these iterative-editing vulnerabilities (Tang et al., 3 Apr 2026).

A plausible implication is that long-horizon agentic systems built atop generative models are at substantial risk of silent, compounding failures undetectable to current automated quality gates.

6. Safety, Compliance, and Deployment Considerations

Vision Banana, as a direct descendant of NBP, inherits its safety characteristics. Benchmark, adversarial robustness, and regulatory compliance data indicate:

NBP refuses to answer 6–21% of potentially unsafe prompts (benchmark, regulatory compliance), with unsafe generation rates ranging from ~25–33% on adversarial inputs.
In adversarial settings (PGJ, GenBreak attacks), safe output rates in the “Hate” category degrade severely (to 24%), reflecting that “implicit sanitization” sometimes allows residual content (e.g., coded hate, unsettling gore) to evade hard filters (Ma et al., 15 Jan 2026).
Regulatory compliance, especially in IP and privacy categories, is variable, with NBP favoring steerable redirection over refusals. This yields competitive overall safety ratings versus other T2I models but suggests the necessity for downstream filtering and auditing in production deployment.

7. Paradigm Shifts and Future Directions

The success of Vision Banana substantiates the proposal that latent diffusion generative pretraining can serve as the foundation for universal, multi-task vision models. Major implications and open problems include:

Reframing of virtually all vision tasks—including 2D, 3D, and potentially 4D spatio-temporal inference—as image generation.
The emergence of a unified, instruction-tuned generative interface analogous to text decoding in NLP foundational models.
Persistent challenges in cost (high-compute inference, training), edge-case controllability, and the expressiveness of RGB parameterizations for fine-grained or structured predictions.
The demand for novel no-reference metrics and multi-reference datasets that reflect the multimodality of generative perception rather than single ground-truth targets.
A plausible implication is that integration with multimodal LLMs, improvements in distillation, and the expansion of RGB-based invertible encodings could further extend generalist vision models’ reach and reliability.

Vision Banana exemplifies a paradigm shift: the convergence of generative and discriminative vision within a tractable, instruction-driven architecture, expanding the scope and flexibility of foundational vision models for both academic and practical domains (Gabeur et al., 22 Apr 2026, Tang et al., 3 Apr 2026, Zuo et al., 17 Dec 2025, Ma et al., 15 Jan 2026).