PaperBanana Framework
- PaperBanana is an agentic framework for generating publication-ready academic illustrations, eliminating manual bottlenecks in creating methodological diagrams and statistical plots.
- It integrates five specialized agents—Retriever, Planner, Stylist, Visualizer, and Critic—using state-of-the-art vision-language models for iterative self-critique and plan refinement.
- Empirical evaluations on PaperBananaBench demonstrate significant improvements in faithfulness, conciseness, readability, and aesthetics over existing baselines.
PaperBanana is an agentic framework for automated generation of publication-ready academic illustrations, designed to eliminate the manual bottleneck in creating methodology diagrams and statistical plots for AI research. It formalizes illustration generation as the mapping of a source context (e.g., a method description) and communicative intent (e.g., figure caption) into a scholarly-quality figure, leveraging reference examples through zero-shot or retrieval-augmented approaches. The system orchestrates five specialized agents, each operating via state-of-the-art vision-LLMs (VLMs) and image generation modules, with iterative self-critique to ensure faithfulness and visual quality. Rigorous empirical evaluation using the PaperBananaBench demonstrates significant enhancements over prior baselines in faithfulness, conciseness, readability, and aesthetics, both for methodological diagrams and statistical plots (Zhu et al., 30 Jan 2026).
1. Task Formalization and System Pipeline
PaperBanana aims to convert a source context (textual method description) and communicative intent (figure caption) into an academic illustration , potentially conditioned on a set of reference triplets . The formalization is: enabling both zero-shot and retrieval-augmented generation. The architecture comprises five sequential agents: Retriever, Planner, Stylist, Visualizer, and Critic. The orchestration follows a linear planning phase and a three-round iterative self-critique loop.
| Agent | Inputs | Outputs |
|---|---|---|
| Retriever | , , | (top-matched reference triplets) |
| Planner | , , | (textual diagram plan) |
| Stylist | , (style guide) | (styled plan) |
| Visualizer | ||
| Critic | , , , | (refined plan) |
The pipeline first retrieves similar prior diagrams based on visual structure, plans and styles a diagram, then iteratively visualizes and critiques, aiming for a final, publication-ready output.
2. Agent Specialization and Workflow
Retriever
The Retriever agent employs a generative retrieval strategy with a dedicated vision–LLM, . It computes a learned matching score for each potential reference , emphasizing visual structural similarity over topical correlation, and selects a top-ranked set .
Planner
The Planner agent uses a large VLM to synthesize a detailed textual plan of the prospective diagram. It performs few-shot prompting with selected examples:
1 2 3 4 |
function PLAN_DIAGRAM(S, C, E): prompt ← few‐shot examples from E plus (S, C) P ← VLM_generate(prompt) return P |
Stylist
The Stylist agent constructs an auto-summarized style guide from the reference set, incorporating conventions for color palettes, typography, shapes, and layout (“NeurIPS look”). It stylizes the plan:
Visualizer
The Visualizer translates styled plans to pixel-based illustrations using either Nano-Banana-Pro (v2025) or GPT-Image-1.5, both bespoke image generation models with high diagrammatic fidelity. For statistical plots, Visualizer can emit executable Matplotlib code.
Critic
The Critic agent evaluates the generated image in context against , , and the plan, producing plan refinements via VLM-based feedback. A fixed number of () iterative rounds balance faithfulness with aesthetics.
3. Model Selection, Prompting, and Evaluation Metrics
The framework’s backbone is Gemini-3-Pro (VLM) for retrieval, planning, styling, and judgment; Nano-Banana-Pro and GPT-Image-1.5 serve as the Visualizer. System prompts are engineered for each agent—no fine-tuning is performed, relying solely on in-context learning.
Evaluation employs the PaperBananaBench: 292 test cases of methodology diagrams from NeurIPS 2025, stratified by domain and style. Key metrics (range: ) include:
- Faithfulness : correspondence of to ,
- Conciseness : signal-to-noise ratio
- Readability : visual clarity and non-overlapping elements
- Aesthetics : compliance with the domain style guide
The hierarchical overall score prioritizes faithfulness and readability, implementing tie-breaking based on conciseness and aesthetic adherence.
4. Empirical Performance and Ablation Analysis
On PaperBananaBench, PaperBanana achieves improvements over existing baselines:
| Method | Overall | ||||
|---|---|---|---|---|---|
| GPT-Image-1.5 | 4.5 | 37.5 | 30.0 | 37.0 | 11.5 |
| Nano-Banana-Pro | 43.0 | 43.5 | 38.5 | 65.5 | 43.2 |
| Few-shot Nano-Banana | 41.6 | 49.6 | 37.6 | 60.5 | 41.8 |
| Paper2Any (Nano-Banana) | 6.5 | 44.0 | 20.5 | 40.0 | 8.5 |
| PaperBanana (Ours) | 45.8 | 80.7 | 51.4 | 72.1 | 60.2 |
| Human Reference | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 |
All improvements are statistically significant (). Removal of the Stylist component reduces conciseness by 17.5 points; omission of the Critic component decreases faithfulness by 15.1 points. Qualitative assessment indicates increased logical structure preservation and domain-appropriate palettes in PaperBanana outputs (Zhu et al., 30 Jan 2026).
5. Extension to Statistical Plots
PaperBanana directly extends to statistical plot synthesis by generating Matplotlib code in the Visualizer phase. The system applies a dedicated plot-style guide and retains the Retriever and Planner stages. On ChartMimic subset benchmarks (240 test cases), PaperBanana demonstrates an overall 4.1-point improvement over vanilla Gemini-3-Pro, with some instances surpassing human references in clarity and visual discipline.
6. Limitations and Prospects
Identified limitations include:
- Raster versus vector output: PaperBanana’s bitmap illustrations are less amenable to post-generation editing. Potential advances involve deploying autonomous agents for vector graphic tools.
- Style rigidity versus creative diversity: The reliance on a fixed, auto-inferred style guide can diminish variety. Parameterizable or user-driven style selection is a target for further work.
- Fine-grained faithfulness: Certain failures, e.g., details such as precise arrow endpoints, remain challenging. Enhanced VLM perception or graph-based validation is proposed.
- Evaluation paradigms: "VLM-as-Judge" is susceptible to subjective bias. Structure-based metrics and reward learning are suggested alternatives.
- Test-time preference adaptation: Present design generates a single illustration; future iterations may incorporate candidate generation with preference-ranking.
- Applicability beyond academic illustrations: Potential applications extend to patent schematics, UI/UX mockups, and industrial diagrams.
PaperBanana offers a unified, modular, and empirically validated approach for automating academic illustration, setting a foundation for agentic scientific workflows (Zhu et al., 30 Jan 2026).