Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visual Programming for Text-to-Image Generation and Evaluation (2305.15328v2)

Published 24 May 2023 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: As LLMs have demonstrated impressive performance in many domains, recent works have adopted LLMs (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task. Furthermore, we leverage the world knowledge of pretrained LMs, overcoming the limitation of previous layout-guided T2I works that can only handle predefined object classes. We demonstrate that our VPGen has improved control in counts/spatial relations/scales of objects than state-of-the-art T2I generation models. Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming. Unlike previous T2I evaluations with a single scoring model that is accurate in some skills but unreliable in others, VPEval produces evaluation programs that invoke a set of visual modules that are experts in different skills, and also provides visual+textual explanations of the evaluation results. Our analysis shows that VPEval provides a more human-correlated evaluation for skill-specific and open-ended prompts than widely used single model-based evaluation. We hope that our work encourages future progress on interpretable/explainable generation and evaluation for T2I models.

This paper introduces two visual programming frameworks, VPGen and VPEval, aimed at improving the interpretability, controllability, and evaluation of text-to-image (T2I) generation models.

VPGen: Visual Programming for Step-by-Step Text-to-Image Generation

VPGen is an interpretable T2I generation framework that decomposes the process into three distinct steps:

  1. Object/Count Generation: An LM (Vicuna 13B fine-tuned on text-layout pairs) identifies objects mentioned in the text prompt and their counts, outputting them as a textual list (e.g., "dog (1) cat (2)").
  2. Layout Generation: The same LM generates spatial layouts (bounding boxes) for the identified objects, represented textually using quantized coordinates (e.g., "dog (10,20,50,60) cat (40,50,80,90)"). This text-based representation allows leveraging the LM's world knowledge, enabling the placement of objects unseen during layout fine-tuning.
  3. Image Generation: An off-the-shelf layout-to-image model (like GLIGEN) takes the original text prompt and the generated layout (bounding boxes + object descriptions) to synthesize the final image.
  • Implementation: VPGen uses a fine-tuned Vicuna 13B LM with LoRA for layout generation and GLIGEN (built on Stable Diffusion) for image synthesis. The LM is trained on text-layout pairs from Flickr30k Entities, MS COCO, and PaintSkills. Bounding boxes are normalized and quantized into 100 bins.
  • Advantages: Experiments show VPGen provides significantly better control over object counts, spatial relationships, and object scales compared to end-to-end T2I models like Stable Diffusion, while maintaining interpretability through its step-by-step process. It can handle unseen object types due to the LM's pre-existing knowledge.
  • Limitations: Performance on text rendering is poor, likely due to insufficient text-rendering examples in the layout training data. The final image quality is dependent on the capabilities of the layout-to-image model (GLIGEN).

VPEval: Visual Programming for Explainable Evaluation of Text-to-Image Generation

VPEval offers an interpretable and explainable framework for evaluating T2I models, addressing the limitations of single-model metrics (like CLIP score or FID) which lack interpretability and may not reliably assess specific skills.

  • Approach: Instead of a single score, VPEval generates or uses predefined evaluation programs. These programs call a suite of specialized visual evaluation modules (experts in different skills) to assess specific aspects of the generated image against the prompt.
  • Evaluation Skills: VPEval focuses on five key skills: Object presence, Count accuracy, Spatial relationships, Scale comparison, and Text Rendering quality.
  • Visual Modules: It employs modules like object detectors (Grounding DINO + DPT for depth), OCR (EasyOCR), VQA (BLIP-2), and specialized functions (objectEval, countEval, spatialEval, scaleEval, textEval) built upon these. Each module outputs a binary score (pass/fail) for the aspect it evaluates, along with visual (e.g., bounding boxes) and textual explanations for its decision.
  • Evaluation Modes:
    • Skill-based: Uses predefined prompts and corresponding evaluation programs targeting one specific skill at a time.
    • Open-ended: Handles complex, user-written prompts potentially requiring multiple skills. An LM (like ChatGPT) dynamically generates an evaluation program based on the prompt and the available visual modules. The final score is the average of the individual module scores.
  • Advantages: VPEval provides fine-grained, interpretable scores with explanations. Experiments demonstrate that VPEval scores correlate significantly better with human judgments than traditional metrics like CLIP similarity or captioning-based scores, especially for specific skills like counting and text rendering. The program generation via in-context learning with ChatGPT avoids the need for expensive annotation.

Key Experimental Findings

  • VPGen significantly outperformed baseline T2I models (Stable Diffusion v1.4/v2.1, Karlo, minDALL-E, DALL-E Mega) on skill-based evaluations, particularly for Count, Spatial, and Scale skills.
  • On open-ended prompts (TIFA160 dataset), VPGen performed competitively with Stable Diffusion. The authors note these prompts often focus less on complex layouts where VPGen excels.
  • Error analysis on VPGen showed that the layout generation step (Vicuna) was generally accurate, but the layout-to-image generation step (GLIGEN) often introduced errors, suggesting potential for improvement with better image synthesis backbones.
  • VPEval demonstrated higher Spearman's ρ\rho correlation with human judgments compared to CLIP, BLIP-2 Captioning metrics, and BLIP-2 VQA on both skill-based and open-ended evaluations.
  • Analysis of VPEval's generated programs showed they achieved high coverage (94%) of prompt elements and high module accuracy (83%).

In conclusion, the paper presents VPGen and VPEval as novel frameworks leveraging visual programming principles to enhance the controllability, interpretability, and evaluation reliability of T2I systems, paving the way for more explainable and accurate T2I models and benchmarks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jaemin Cho (36 papers)
  2. Abhay Zala (10 papers)
  3. Mohit Bansal (304 papers)
Citations (44)