This paper introduces two visual programming frameworks, VPGen and VPEval, aimed at improving the interpretability, controllability, and evaluation of text-to-image (T2I) generation models.
VPGen: Visual Programming for Step-by-Step Text-to-Image Generation
VPGen is an interpretable T2I generation framework that decomposes the process into three distinct steps:
- Object/Count Generation: An LM (Vicuna 13B fine-tuned on text-layout pairs) identifies objects mentioned in the text prompt and their counts, outputting them as a textual list (e.g., "dog (1) cat (2)").
- Layout Generation: The same LM generates spatial layouts (bounding boxes) for the identified objects, represented textually using quantized coordinates (e.g., "dog (10,20,50,60) cat (40,50,80,90)"). This text-based representation allows leveraging the LM's world knowledge, enabling the placement of objects unseen during layout fine-tuning.
- Image Generation: An off-the-shelf layout-to-image model (like GLIGEN) takes the original text prompt and the generated layout (bounding boxes + object descriptions) to synthesize the final image.
- Implementation: VPGen uses a fine-tuned Vicuna 13B LM with LoRA for layout generation and GLIGEN (built on Stable Diffusion) for image synthesis. The LM is trained on text-layout pairs from Flickr30k Entities, MS COCO, and PaintSkills. Bounding boxes are normalized and quantized into 100 bins.
- Advantages: Experiments show VPGen provides significantly better control over object counts, spatial relationships, and object scales compared to end-to-end T2I models like Stable Diffusion, while maintaining interpretability through its step-by-step process. It can handle unseen object types due to the LM's pre-existing knowledge.
- Limitations: Performance on text rendering is poor, likely due to insufficient text-rendering examples in the layout training data. The final image quality is dependent on the capabilities of the layout-to-image model (GLIGEN).
VPEval: Visual Programming for Explainable Evaluation of Text-to-Image Generation
VPEval offers an interpretable and explainable framework for evaluating T2I models, addressing the limitations of single-model metrics (like CLIP score or FID) which lack interpretability and may not reliably assess specific skills.
- Approach: Instead of a single score, VPEval generates or uses predefined evaluation programs. These programs call a suite of specialized visual evaluation modules (experts in different skills) to assess specific aspects of the generated image against the prompt.
- Evaluation Skills: VPEval focuses on five key skills: Object presence, Count accuracy, Spatial relationships, Scale comparison, and Text Rendering quality.
- Visual Modules: It employs modules like object detectors (Grounding DINO + DPT for depth), OCR (EasyOCR), VQA (BLIP-2), and specialized functions (
objectEval
,countEval
,spatialEval
,scaleEval
,textEval
) built upon these. Each module outputs a binary score (pass/fail) for the aspect it evaluates, along with visual (e.g., bounding boxes) and textual explanations for its decision. - Evaluation Modes:
- Skill-based: Uses predefined prompts and corresponding evaluation programs targeting one specific skill at a time.
- Open-ended: Handles complex, user-written prompts potentially requiring multiple skills. An LM (like ChatGPT) dynamically generates an evaluation program based on the prompt and the available visual modules. The final score is the average of the individual module scores.
- Advantages: VPEval provides fine-grained, interpretable scores with explanations. Experiments demonstrate that VPEval scores correlate significantly better with human judgments than traditional metrics like CLIP similarity or captioning-based scores, especially for specific skills like counting and text rendering. The program generation via in-context learning with ChatGPT avoids the need for expensive annotation.
Key Experimental Findings
- VPGen significantly outperformed baseline T2I models (Stable Diffusion v1.4/v2.1, Karlo, minDALL-E, DALL-E Mega) on skill-based evaluations, particularly for Count, Spatial, and Scale skills.
- On open-ended prompts (TIFA160 dataset), VPGen performed competitively with Stable Diffusion. The authors note these prompts often focus less on complex layouts where VPGen excels.
- Error analysis on VPGen showed that the layout generation step (Vicuna) was generally accurate, but the layout-to-image generation step (GLIGEN) often introduced errors, suggesting potential for improvement with better image synthesis backbones.
- VPEval demonstrated higher Spearman's correlation with human judgments compared to CLIP, BLIP-2 Captioning metrics, and BLIP-2 VQA on both skill-based and open-ended evaluations.
- Analysis of VPEval's generated programs showed they achieved high coverage (94%) of prompt elements and high module accuracy (83%).
In conclusion, the paper presents VPGen and VPEval as novel frameworks leveraging visual programming principles to enhance the controllability, interpretability, and evaluation reliability of T2I systems, paving the way for more explainable and accurate T2I models and benchmarks.