Problem Vision Board Framework
- Problem Vision Board is an interactive framework that uses large language and vision models alongside computational creativity to address poorly defined problems.
- It integrates exploratory, combinational, and transformational creativity methodologies to generate and refine visual and textual artifacts.
- Empirical results show improvements in creative substitution tasks and multi-step reasoning through user-guided iterative refinement and latent space exploration.
A Problem Vision Board is an interactive ideation and problem-solving framework that leverages large language and vision models (LLVMs) and principles from computational creativity (CC) to generate, evaluate, and refine visual and textual artifacts for open-ended or poorly defined problems. Its design integrates model-driven exploration of solution spaces, combinatorial and transformational approaches to concept blending, and human-in-the-loop curation to actively create novel, context-aware solutions rather than retrieve existing ones (Nair et al., 2 May 2024, Akter et al., 16 Jan 2024).
1. Foundations: Computational Creativity with LLVMs
The theoretical basis for Problem Vision Boards utilizes Margaret Boden’s taxonomy of creativity within the context of LLVMs:
- Exploratory Creativity: Navigation of a model's continuous embedding space (not merely its output space) to reveal novel or latent solution regions. Techniques previously confined to output-space exploration (e.g., beam search, tree-of-thoughts) are insufficient for true embedding-level creativity.
- Combinational Creativity: Cross-attending and blending within embedding manifolds to generate functionally novel concepts (e.g., merging “coin” and “pliers” to obtain a makeshift screwdriver), which exceeds simple aesthetic style fusion.
- Transformational Creativity: Dynamic restructuring of the conceptual or affordance space of the problem, either via fine-tuning or prompt-based re-representation. For instance, reframing a missing-tool scenario by specifying affordances (via OROC-style prompts) enables creative substitutions.
This CC-compositional approach underpins the generative and evaluative workflow of a Problem Vision Board and is intended to transcend the limitations of standard LLVM capabilities in creative problem solving (Nair et al., 2 May 2024).
2. Model Architectures and Enabling Mechanisms
Problem Vision Boards are realized by orchestrating off-the-shelf components within a CC-aware system architecture:
- Vision–Language Backbones: Multimodal transformers with joint embedding spaces such as CLIP, Flamingo, or ViLT. For example, CLIP’s training objective,
where and are vision/text encoders and is a learned temperature.
- Cross-Attention Modules (enabling combinational creativity): Given embedding sequences , , attention is applied via
facilitating in-model concept blending.
- Prompt-Based Re-Representation (enabling transformational creativity): Problems are re-encoded in terms of affordances and functionality. For example: “Hammers must be heavy and have a handle attached to a cylinder. Can this object be used as a hammer?”
- Latent Search Modules: Ideally, learnable or gradient-guided policies for sampling and exploring the joint embedding manifold, which surpasses output-level search.
This framework supports modules for transforming, exploring, combining, visualizing, and refining solution proposals, as detailed below (Nair et al., 2 May 2024, Akter et al., 16 Jan 2024).
3. Operational Workflow and Pipeline Architecture
The canonical Problem Vision Board is structured as a sequence of interactive modules:
- Problem Specification: User introduces an under-defined problem (e.g., “I have no ladle to serve soup”).
- Transform Module:
- An LLM re-represents the problem in affordance-centric language, producing multiple prompt variants.
- Explore Module:
- A latent explorer module samples the joint embedding space, retrieving nearest-neighbor object embeddings and optionally employing planning heuristics (e.g., tree-of-thoughts).
- Combine Module:
- Cross-attention heads blend candidate object embeddings to synthesize hybrid props or design elements.
- Visualize & Refine:
- A grid of candidate visualizations (images, sketches, or “concept cards”) with textual affordance rationales is presented to the user, who can iteratively “reject” or “refine” candidates, narrowing the latent search region.
- Iteration:
- Steps 2–5 are repeated, concentrating the search until a feasible, novel solution emerges.
Throughout this pipeline, all three pillars of computational creativity are instantiated: transformational (problem reframing), exploratory (embedding space search), and combinational (embedding fusion) (Nair et al., 2 May 2024, Akter et al., 16 Jan 2024).
4. Empirical Evidence and Benchmarking
Preliminary experiments establishing the potential and limitations of Problem Vision Board elements include:
- Affordance-Prompted Tool Substitution: On makeshift-tool tasks, CLIP and ViLT were evaluated on sets containing either the genuine tool or a plausible creative substitute. Classification accuracy in the “replacement + regular prompt” setting was near random (25–35%). Introducing functional affordance information (“concave and hollow” for scoop) improved overall accuracy by 10–20 points, with, e.g., CLIP-B-32 performance on “bowl as scoop” rising from 0.28 to 0.42. Task prompts or combined prompts had inconsistent effects, indicating the critical role of appropriately grounded affordance information (Nair et al., 2 May 2024).
- Self-Imagination for Mathematical Reasoning: The Self-Imagine method operationalizes a Problem Vision Board pipeline using a single VLM. The process consists of (a) few-shot prompting for HTML representation, (b) HTML generation and rendering to image, (c) feeding both text and image back to the VLM to solve the problem. This approach yielded systematic performance gains on multiple math benchmarks (e.g., GSM8K: +3.1%), demonstrating the effectiveness of structured visual augmentation for complex reasoning (Akter et al., 16 Jan 2024).
- Spatial Planning Failures in VLMs: The VSP benchmark quantifies model deficits in spatial planning, perception, and reasoning. Even state-of-the-art models (e.g., GPT-4o) perform at only 68% accuracy on 3×3 mazes (dropping to 23% at 8×8) and below 10% on multi-step tasks for open-source VLMs. Perception errors (e.g., misclassifying empty map regions) and reasoning bottlenecks (e.g., inability to simulate dynamic state over plan steps) remain dominant limitations (Wu et al., 2 Jul 2024).
5. Implementation Strategies and Example Artifacts
Technical realization of a Problem Vision Board can follow a Self-Imagine-style pipeline:
- Few-shot prompt construction with diverse HTML renderings of problems.
- VLM-based HTML generation (using a dummy input image).
- Conversion of HTML to PNG/SVG using headless browsers or HTML-to-image libraries (e.g., imgkit, wkhtmltoimage).
- VLM answer generation using both the original question and the rendered diagram/image.
- Example prompts and artifacts include sequential “box and arrow” financial calculations, tabular aggregations in reading comprehension, and custom SVGs for geometric or physics problems.
- Consistency in HTML/CSS templates and explicit tagging of unknowns (“?”) aid in correct model interpretation (Akter et al., 16 Jan 2024).
The following table summarizes principal modules and their representative mechanisms:
| Module | Core Technique | Example Mechanism |
|---|---|---|
| Transform | Prompt-based re-representation | OROC-style affordance prompts |
| Explore | Embedding space sampling | Latent search, tree-of-thoughts |
| Combine | In-model attention blending | Cross-attention, “blend” heads |
| Visualize | HTML/SVG rendering, visualization | imgkit, headless browser |
| Refine | User-in-the-loop grid selection | Iterative re-weighting |
6. Limitations, Open Challenges, and Future Directions
Significant challenges and research avenues remain:
- Scale and Generalizability: Most current experiments are single-object, human-centric, or toy-scale. Impactful deployment requires large, agent-diverse datasets and robust prompt engineering or learned re-representation modules (Nair et al., 2 May 2024).
- Embedding-Space Search: Existing methods inadequately sample the full conceptual manifold. Gradient-guided or learnable trajectory policies in embedding space are required for nontrivial exploratory creativity.
- Combinational Blending: Reliable in-model blending (e.g., learned “blend” heads) at the embedding level for functional rather than cosmetic fusion is an unresolved technical problem.
- Perception–Reasoning Bottlenecks: Current VLMs exhibit significant perceptual and memory deficits, leading to compounding errors over multi-step reasoning tasks, as revealed by the VSP benchmark (Wu et al., 2 Jul 2024).
- Agent-Aware Affordances: Affordance-based approaches must account for varying agent capabilities (e.g., robotic vs. human hand morphology) to generalize beyond anthropocentric assumptions.
- Standardized Benchmarks: There is a need for vision–language benchmarks that explicitly target creative problem representation, solution novelty, and affordance reasoning, going beyond datasets such as MacGyver and Only-Connect-Wall (Nair et al., 2 May 2024).
This suggests that the Problem Vision Board paradigm will serve not only as a practical interface for hard creative problem solving, but as a foundational research scaffold for investigating and advancing the integration of multi-modal perception, reasoning, and creativity in artificial agents.