Transductive Visual Programming (TVP)
- Transductive Visual Programming (TVP) is a novel approach that evolves a visual toolbox by abstracting recurring computational motifs from solved 3D spatial problems.
- TVP alternates between problem-solving and tool discovery phases, using an example library and clustering algorithms to generate and validate specialized tools.
- Empirical evaluations show that TVP improves accuracy and reduces complexity, outperforming static toolsets in benchmark spatial reasoning tasks.
Transductive Visual Programming (TVP) is a self-evolving agent architecture for spatial reasoning in 3D scenes, characterized by its experiential tool creation process: solving problems with a basic arsenal and abstracting new tools from successful solution patterns. Rather than relying on fixed toolsets or speculative induction, TVP grounds its tool library in validated, recurring motifs derived from actual solved programs. This paradigm yields compositional clarity and precision in geometric calculations that challenge monolithic vision-LLMs.
1. Formal Framework and Objectives
TVP is defined over three primary entities:
- : a dataset of images and spatial questions .
- : Example Library, storing solved programs.
- : Tool Library, containing callable functions (initially basic vision and geometry tools).
The TVP procedure alternates between: (A) Problem solving—using current to generate and execute programs for queries in , storing validated outcomes in . (B) Tool discovery—mining for recurring program motifs, abstracting them as parameterized tools, and evolving .
This process directly targets spatial reasoning tasks: size, distance, ratio, and positional relationships in 3D, rendered tractable through decompositional programming steps mapped to specialized functions. TVP’s transductive approach is motivated by shortcomings in static toolsets (inadequate adaptability) and inductive tool induction (speculative, with low empirical utility: >94% of inductively created tools are unused).
2. System Architecture and Algorithms
TVP operates a closed-loop “program → experience → tool → program” cycle, formalized in Algorithm 1:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Input: Dataset 𝒟, tools 𝒯, empty ℰ, parameter thresholds, iterations T for t in 1…T: for each (Iᵢ, qᵢ) in 𝒟: E_sim ← RetrieveSimilar(ℰ, qᵢ; τ_sim, k_max) Candidates ← ∅ for m = 1…M: p ← LLM_prog.generate(qᵢ, E_sim, 𝒯) Candidates.add(p) Valid ← ExecuteAndFilter(Candidates, Iᵢ, 𝒯) (p*, score*) ← JudgeAndSelect(Valid; τ_q) if score* ≥ τ_q: ℰ.insert_or_replace(qᵢ, p*, score*) if |ℰ| mod n_a = 0: 𝒯 ← AbstractTools(ℰ, 𝒯; τ_cluster, τ_potential) if |ℰ| mod n_d = 0: 𝒯 ← MergeTools(𝒯; similarity=0.95) Output: ℰ, 𝒯 |
Key components:
- Example Library entries: (question, program, score, tools_used, trace).
- Tool Library entries: (id, signature, docstring, code, level, deprecated).
- Quality and clustering thresholds (, , , ); abstraction and deduplication intervals (, ).
Tool Abstraction
Every solved examples:
- Cluster programs in by embedding similarity ().
- For clusters of sufficient size (), an LLM identifies computational motifs and rates abstraction potential.
- If potential , a new tool is generated.
Objective function: High-signal, low-complexity abstractions are prioritized.
Validation Protocol
Two-stage tool validation ensures correctness:
- Stage 1: All cluster examples are rewritten using , requiring 100% execution success.
- Stage 2: Outputs must match original results within floating-point tolerance, judged acceptable if .
3. Illustrative Example: 3D Ratio Computation
Given a query: “How many cabinet-height units equal the combined height of the sofa and the TV?” the TVP process unfolds in two phases.
Phase I – Basic tools:
- loc(object) → bounding box
- depth(box) → estimated depth
- get_2d_object_size(box) → (w,h)
- Arithmetic
Initial solution:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
sofa_bb = loc("sofa") tv_bb = loc("tv") cab_bb = loc("cabinet") sofa_depth = depth(sofa_bb) tv_depth = depth(tv_bb) cab_depth = depth(cab_bb) h_sofa = get_2d_object_size(sofa_bb)[1] * sofa_depth h_tv = get_2d_object_size(tv_bb)[1] * tv_depth h_cab = get_2d_object_size(cab_bb)[1] * cab_depth combined = h_sofa + h_tv answer = combined / h_cab print(int(answer)) # e.g. 3 |
Phase II – Tool abstraction via clustering:
If similar patterns “” recur in , clustering triggers a new parameterized tool:
1 2 3 4 5 6 |
def compute_3d_ratio(boxes: List[BoundingBox], ref_box: BoundingBox) -> float: """Compute (sum of 3D heights of boxes) ÷ (3D height of ref_box).""" return ( sum(get_2d_object_size(b)[1] * depth(b) for b in boxes) / (get_2d_object_size(ref_box)[1] * depth(ref_box)) ) |
1 2 |
answer = compute_3d_ratio([sofa_bb, tv_bb], cab_bb)
print(int(answer)) |
4. Empirical Results and Evaluation
TVP is benchmarked on Omni3D-Bench (501 spatial queries) and SpatialScore-Hard (256 samples):
Omni3D-Bench Performance
| Method | Yes/No | MCQ | Count | MRA | ±10% | Overall |
|---|---|---|---|---|---|---|
| GPT-4o | 65.3 | 60.5 | 18.6 | 26.7 | 8.2 | 27.2 |
| VADAR | 56.0 | 57.6 | 21.7 | 35.5 | 15.9 | 29.9 |
| TVP | 60.0 | 61.6 | 24.3 | 36.5 | 19.3 | 33.3 |
TVP demonstrates a 22% improvement over GPT-4o and an 11% improvement over the previous best visual programming system.
Few-Shot and Scaling
Restricting TVP to basic tools with Example Library few-shot yields 31.7% accuracy versus full TVP at 33.3%. Scaling backbone models (e.g., open-source Qwen2.5 32B) achieves 30.7%—close to GPT-4o in a single iteration and surpassing VADAR.
Zero-Shot Generalization
On SpatialScore-Hard (no test-set tuning), TVP achieves:
| Dataset | 3DSR-B | SpatialSense | VG-B | Overall |
|---|---|---|---|---|
| GPT-4o | 52.1 | 46.5 | 20.3 | 42.6 |
| VADAR | 24.8 | 40.8 | 39.1 | 32.8 |
| TVP (ZS) | 52.9 | 59.2 | 43.8 | 52.3 |
TVP leads overall, especially on 3D positional relations and depth estimation.
5. Tool Utilization, Improvement, and Generalization
Empirically, transductively learned tools in become 5× more frequent as program dependencies than those induced by speculative induction. This reflects their grounded utility within actual solution contexts rather than hypothetical patterns.
TVP exhibits continuous improvement over three Omni3D-Bench iterations:
- Median cyclomatic complexity (CCN) drops from 3.0 to 1.0.
- Accuracy on programs using learned tools increases by +3.4 pp.
- Learned-tool performance improves +38% relative across iterations.
Merged and validated tools encapsulate robust 3D geometric abstractions (ratios, nearest-object, dimension matching), which generalize effectively across datasets without retraining.
6. Limitations and Prospects
Key limitations include runtime overhead associated with repeated LLM calls, most of which is amortized during library construction. There is sensitivity to the watermarking quality of the correctness-judging LLM, with potential drift if mis-judgments occur. Proposed future directions include on-device tool caching, human-in-the-loop curation, and extension to dynamic environments (e.g., video).
A plausible implication is that TVP’s experience-driven tool evolution paradigm offers long-term scalability in domains where problem substructure is discoverable but not known a priori, bridging the gap between brittle static toolsets and low-utility speculative induction. This mechanism supports the development of self-evolving agents for increasingly complex spatial reasoning tasks (Wu et al., 24 Dec 2025).