Papers
Topics
Authors
Recent
2000 character limit reached

Transductive Visual Programming (TVP)

Updated 31 December 2025
  • Transductive Visual Programming (TVP) is a novel approach that evolves a visual toolbox by abstracting recurring computational motifs from solved 3D spatial problems.
  • TVP alternates between problem-solving and tool discovery phases, using an example library and clustering algorithms to generate and validate specialized tools.
  • Empirical evaluations show that TVP improves accuracy and reduces complexity, outperforming static toolsets in benchmark spatial reasoning tasks.

Transductive Visual Programming (TVP) is a self-evolving agent architecture for spatial reasoning in 3D scenes, characterized by its experiential tool creation process: solving problems with a basic arsenal and abstracting new tools from successful solution patterns. Rather than relying on fixed toolsets or speculative induction, TVP grounds its tool library in validated, recurring motifs derived from actual solved programs. This paradigm yields compositional clarity and precision in geometric calculations that challenge monolithic vision-LLMs.

1. Formal Framework and Objectives

TVP is defined over three primary entities:

  • D={(Ii,qi)}i=1N\mathcal{D} = \{(I_i, q_i)\}_{i=1}^N: a dataset of images IiI_i and spatial questions qiq_i.
  • E\mathcal{E}: Example Library, storing solved programs.
  • T\mathcal{T}: Tool Library, containing callable functions (initially basic vision and geometry tools).

The TVP procedure alternates between: (A) Problem solving—using current T\mathcal{T} to generate and execute programs for queries in D\mathcal{D}, storing validated outcomes in E\mathcal{E}. (B) Tool discovery—mining E\mathcal{E} for recurring program motifs, abstracting them as parameterized tools, and evolving T\mathcal{T}.

This process directly targets spatial reasoning tasks: size, distance, ratio, and positional relationships in 3D, rendered tractable through decompositional programming steps mapped to specialized functions. TVP’s transductive approach is motivated by shortcomings in static toolsets (inadequate adaptability) and inductive tool induction (speculative, with low empirical utility: >94% of inductively created tools are unused).

2. System Architecture and Algorithms

TVP operates a closed-loop “program → experience → tool → program” cycle, formalized in Algorithm 1:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Input: Dataset 𝒟, tools 𝒯, empty ℰ, parameter thresholds, iterations T
for t in 1T:
  for each (Iᵢ, qᵢ) in 𝒟:
    E_sim  RetrieveSimilar(ℰ, qᵢ; τ_sim, k_max)
    Candidates  
    for m = 1M:
      p  LLM_prog.generate(qᵢ, E_sim, 𝒯)
      Candidates.add(p)
    Valid  ExecuteAndFilter(Candidates, Iᵢ, 𝒯)
    (p*, score*)  JudgeAndSelect(Valid; τ_q)
    if score*  τ_q:
      ℰ.insert_or_replace(qᵢ, p*, score*)
    if |ℰ| mod n_a = 0:
      𝒯  AbstractTools(ℰ, 𝒯; τ_cluster, τ_potential)
    if |ℰ| mod n_d = 0:
      𝒯  MergeTools(𝒯; similarity=0.95)
Output: ℰ, 𝒯

Key components:

  • Example Library entries: (question, program, score, tools_used, trace).
  • Tool Library entries: (id, signature, docstring, code, level, deprecated).
  • Quality and clustering thresholds (τq\tau_q, τsim\tau_{sim}, τcluster\tau_{cluster}, τpotential\tau_{potential}); abstraction and deduplication intervals (nan_a, ndn_d).

Tool Abstraction

Every nan_a solved examples:

  1. Cluster programs in E\mathcal{E} by embedding similarity (τsim\tau_{sim}).
  2. For clusters GG of sufficient size (τcluster\geq \tau_{cluster}), an LLM identifies computational motifs and rates abstraction potential.
  3. If potential τpotential\geq \tau_{potential}, a new tool tt is generated.

Objective function: Lpattern=i,jG1{pipj}λComplexity(t)\mathcal{L}_\text{pattern} = \sum_{i,j \in G} \mathbf{1}\{p_i \sim p_j\} - \lambda\,\text{Complexity}(t) High-signal, low-complexity abstractions are prioritized.

Validation Protocol

Two-stage tool validation ensures correctness:

  • Stage 1: All cluster examples are rewritten using tt, requiring 100% execution success.
  • Stage 2: Outputs must match original results within floating-point tolerance, judged acceptable if 85%\geq 85\%.

3. Illustrative Example: 3D Ratio Computation

Given a query: “How many cabinet-height units equal the combined height of the sofa and the TV?” the TVP process unfolds in two phases.

Phase I – Basic tools:

  • loc(object) → bounding box
  • depth(box) → estimated depth
  • get_2d_object_size(box) → (w,h)
  • Arithmetic

Initial solution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
sofa_bb = loc("sofa")
tv_bb   = loc("tv")
cab_bb  = loc("cabinet")

sofa_depth = depth(sofa_bb)
tv_depth   = depth(tv_bb)
cab_depth  = depth(cab_bb)

h_sofa = get_2d_object_size(sofa_bb)[1] * sofa_depth
h_tv   = get_2d_object_size(tv_bb)[1]   * tv_depth
h_cab  = get_2d_object_size(cab_bb)[1]  * cab_depth

combined = h_sofa + h_tv
answer   = combined / h_cab
print(int(answer)) # e.g. 3
Solution is stored in E\mathcal{E}.

Phase II – Tool abstraction via clustering:

If similar patterns “(h1+h2)/h3(h_1 + h_2)/h_3” recur in E\mathcal{E}, clustering triggers a new parameterized tool:

1
2
3
4
5
6
def compute_3d_ratio(boxes: List[BoundingBox], ref_box: BoundingBox) -> float:
    """Compute (sum of 3D heights of boxes) ÷ (3D height of ref_box)."""
    return (
        sum(get_2d_object_size(b)[1] * depth(b) for b in boxes)
        / (get_2d_object_size(ref_box)[1] * depth(ref_box))
    )
Clustered examples are rewritten:
1
2
answer = compute_3d_ratio([sofa_bb, tv_bb], cab_bb)
print(int(answer))

4. Empirical Results and Evaluation

TVP is benchmarked on Omni3D-Bench (501 spatial queries) and SpatialScore-Hard (256 samples):

Omni3D-Bench Performance

Method Yes/No MCQ Count MRA ±10% Overall
GPT-4o 65.3 60.5 18.6 26.7 8.2 27.2
VADAR 56.0 57.6 21.7 35.5 15.9 29.9
TVP 60.0 61.6 24.3 36.5 19.3 33.3

TVP demonstrates a 22% improvement over GPT-4o and an 11% improvement over the previous best visual programming system.

Few-Shot and Scaling

Restricting TVP to basic tools with Example Library few-shot yields 31.7% accuracy versus full TVP at 33.3%. Scaling backbone models (e.g., open-source Qwen2.5 32B) achieves 30.7%—close to GPT-4o in a single iteration and surpassing VADAR.

Zero-Shot Generalization

On SpatialScore-Hard (no test-set tuning), TVP achieves:

Dataset 3DSR-B SpatialSense VG-B Overall
GPT-4o 52.1 46.5 20.3 42.6
VADAR 24.8 40.8 39.1 32.8
TVP (ZS) 52.9 59.2 43.8 52.3

TVP leads overall, especially on 3D positional relations and depth estimation.

5. Tool Utilization, Improvement, and Generalization

Empirically, transductively learned tools in T\mathcal{T} become 5× more frequent as program dependencies than those induced by speculative induction. This reflects their grounded utility within actual solution contexts rather than hypothetical patterns.

TVP exhibits continuous improvement over three Omni3D-Bench iterations:

  • Median cyclomatic complexity (CCN) drops from 3.0 to 1.0.
  • Accuracy on programs using learned tools increases by +3.4 pp.
  • Learned-tool performance improves +38% relative across iterations.

Merged and validated tools encapsulate robust 3D geometric abstractions (ratios, nearest-object, dimension matching), which generalize effectively across datasets without retraining.

6. Limitations and Prospects

Key limitations include runtime overhead associated with repeated LLM calls, most of which is amortized during library construction. There is sensitivity to the watermarking quality of the correctness-judging LLM, with potential drift if mis-judgments occur. Proposed future directions include on-device tool caching, human-in-the-loop curation, and extension to dynamic environments (e.g., video).

A plausible implication is that TVP’s experience-driven tool evolution paradigm offers long-term scalability in domains where problem substructure is discoverable but not known a priori, bridging the gap between brittle static toolsets and low-utility speculative induction. This mechanism supports the development of self-evolving agents for increasingly complex spatial reasoning tasks (Wu et al., 24 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Transductive Visual Programming (TVP).