Text-Based Automatic Differentiation

Updated 30 December 2025

The paper introduces text-based automatic differentiation which replaces numeric gradients with structured natural language feedback to optimize complex, black-box AI components.
It models heterogeneous AI systems as directed acyclic graphs where paired forward and backward operators use LLM-generated critiques to guide system updates.
Empirical results demonstrate significant improvements in tasks such as code optimization, multi-choice Q&A, and molecule design compared to classical methods.

Automatic Differentiation via Text extends the foundational principles of automatic differentiation (AD) to the domain of complex, heterogeneous AI systems composed of LLMs, simulators, and other non-differentiable modules. Classical AD systematically propagates numerical derivatives through computational graphs by mechanically applying the chain rule at the operator level, enabling machine-precision computation of gradients, Jacobians, and higher-order derivatives with modest computational overhead. In contrast, "differentiation via text" employs LLM-generated natural language feedback as “textual gradients”—structured critiques or suggestions—enabling optimization when direct numerical differentiation is impossible. This approach generalizes the spirit of backpropagation, situating it within orchestrated AI systems where components are black-box or non-differentiable, and surfaces a new paradigm for end-to-end system improvement and tuning via the propagation and aggregation of actionable text-based guidance (Yuksekgonul et al., 2024).

1. Foundations: From Numeric to Textual Differentiation

Classical automatic differentiation computes derivatives by augmenting the original computation with systematic derivative propagation, commonly via forward- or reverse-mode algorithms. A central insight is the equivalence of a suite of mathematical formulations—including matrix–vector propagation, lifting to dual numbers, pushforward operators, and truncated Taylor expansions—all producing identical computational traces for the required derivatives (Hoffmann, 2014). The operator-level chain rule ensures that every function in a computation graph can be differentiated by accumulating local derivative information.

In heterogeneous, compound AI systems, this mechanism cannot operate as-is, since crucial nodes—such as LLM prompt executions, code interpreters, or molecular docking simulators—are neither differentiable nor admit a tractable chain-rule decomposition. Text-based automatic differentiation replaces numeric gradients with structured textual feedback, leveraging LLMs to emulate the role of derivative propagation by supplying component-level improvement suggestions, which then govern subsequent updates (Yuksekgonul et al., 2024). This approach maintains the global optimization loop but reformulates the intermediate representation of gradients.

2. Architecture of Text-Based Differentiation Frameworks

Text-based AD represents the composed AI system as a directed acyclic graph (DAG) $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ , where nodes encapsulate unstructured textual variables (e.g., prompts, code, SMILES strings, or hyperparameter specifications) and edges correspond to diverse forward operators—including LLM invocations, simulators, or explicit computation. Each node is annotated with metadata indicating whether it is optimizable (requires_grad), its role within the system, and a log of accumulated textual gradients. For each forward operator $f_v$ , a paired backward operator $\mathrm{Grad}_f$ is defined, implemented as an LLM prompt capable of digesting the node's text, its role, received downstream feedback (acting as a pseudo-gradient), and the global objective function.

The architectural correspondence to numerical AD frameworks is explicit: computational graphs are topologically sorted, variables are forward-executed to compute outputs and losses, and backward passes gather and propagate text-based gradients from outputs to inputs, guiding updates through natural-language optimization steps (Yuksekgonul et al., 2024).

3. Algorithmic Workflow and Update Dynamics

The optimization loop in text-based AD mirrors classical gradient-based methods but transposes each critical operation—gradient computation, aggregation, and parameter update—into the space of natural language:

Forward Pass: All variables $v_i$ are initialized. For each node, the forward operator $f_v$ computes its value based on its predecessors, ultimately evaluating a (possibly LLM-based) scalar or structured loss function $\mathcal{L}$ that drives optimization.
Backward Pass: For each node $v$ , in reverse topological order, backward operators $\mathrm{Grad}_{f_v}$ are invoked for each predecessor, producing textual feedback $g_{u \leftarrow v}$ stored as gradients in the predecessor's metadata. This recursion propagates human-readable, actionable critiques referencing the variable's content and its role in context of the system objective.
Textual Gradient Descent (TGD) Step: For each optimizable variable $v$ , an LLM prompt aggregates all accumulated feedback to generate an updated variable value, incoporating constraints, role-specific considerations, and optional concepts akin to momentum. The update logic can be summarized:

$m_j := \beta m_j + (1-\beta) g_j, \quad \theta_j := \theta_j - \alpha m_j$

where $g_j$ is interpreted as aggregated textual feedback, $m_j$ as a "momentum" buffer in the textual space, and $\alpha$ dissociates as a prompt hyperparameter (Yuksekgonul et al., 2024).

4. Mathematical Structure and Analogy to Numerical AD

The mathematical underpinning maintains the AD principle of decomposing system-level optimization into a series of local updates. For optimizable variables $\theta = \{v_1, \ldots, v_m\}$ and a global objective $\mathcal{L}(\theta)$ dependent on all components, the "textual gradient" for each $v_j$ is computed via $g_j \approx \frac{\partial \mathcal{L}}{\partial \theta_j} \approx \{\mathrm{Grad}_f(\theta_j, \text{context})\}$ and used to revise $v_j$ through a prompt-engineered update operator $\mathrm{TGD.step}$ , which integrates all available feedback. The update can be structured to incorporate momentum and constraints, extending the analogy to classical optimization mechanics.

This formalism directly parallels forward- and reverse-mode AD in their architectural or algebraic perspectives, but the computational objects propagating through the system are critiques rather than numeric derivatives (Yuksekgonul et al., 2024).

5. Feedback Processing and Aggregation

Prompt engineering is central: backward operator prompts explicitly delineate variables, roles, objective functions, and local context through carefully crafted tokens (e.g., <VARIABLE>, <ROLE>, <OBJECTIVE_FUNCTION>, <CONTEXT>). Textual gradients are appended verbatim, without numeric scoring, and aggregated for each variable—concatenation methods such as tg.sum merge multi-branch feedback into coherent guidance. Constraints and batch updates are enforced through prompt tags and instructions, while historical information and example contexts can be included to emulate momentum or provide demonstrations for few-shot improvement (Yuksekgonul et al., 2024).

A plausible implication is that qualitative properties of LLM feedback, such as specificity, relevance, and precision, become the main factors influencing optimization convergence and quality, supplanting the numerical properties (size, direction) central to classical AD.

6. Applications and Empirical Validation

Text-based AD has been empirically validated across a range of domains where numerical gradients are inaccessible:

Code Optimization: For LeetCode-Hard problems, TextGrad achieves 36% ± 1.8% test pass rate under zero-shot, five-iteration regimes, outperforming Reflexion (31% ± 1.2%) and zero-shot GPT-4o (23%) (Yuksekgonul et al., 2024).
Multiple-Choice Question Answering: On GPQA, MMLU–ML, and MMLU–Physics, TextGrad self-refinement with majority voting raises accuracy to 55.0% (from ~51%), 88.4% (from 85.7%), and 95.1% (from 91.2%), respectively.
Prompt Optimization for Reasoning: For Object Counting, BigBench Hard Word Sorting, and GSM8k, TextGrad matches or exceeds baselines such as DSPy, raising Object Counting performance from 77.8% to 91.9%.
Molecule Design: Using LLM-mediated scoring of druglikeness and docking, 95% of generated molecules are novel (no ChEMBL Tanimoto > 0.8), with candidates for 29 protein targets matching or exceeding clinical drugs in binding and QED (Yuksekgonul et al., 2024).
Radiotherapy Plan Optimization: On prostate cancer cases, text-optimized importance weights improve PTV coverage and organ-at-risk sparing relative to clinical plans.

These results collectively indicate that text-based AD frameworks can deliver substantial gains in zero-shot or few-shot scenarios, even when the ground-truth objective is either non-differentiable or not explicitly computable.

7. Limitations, Hyperparameters, and Future Outlook

Text-based AD exhibits certain limitations. Success is contingent on LLMs’ capability to deliver precise, context-aware feedback; if critiques are too generic or fail to address key compositional aspects (e.g., spatial geometry in molecular structures), convergence deteriorates. Hyperparameters corresponding to learning rate ( $\alpha$ ), batch size, and momentum (window size or historical context inclusion) modulate update efficiency—ablation experiments indicate slowed convergence absent momentum and increased constraint violations when format tags are omitted (Yuksekgonul et al., 2024).

Computational cost is governed by the number of LLM calls: for code/Q&A settings, three calls per instance per iteration are required, while prompt optimization tasks typically require two calls per batch item per iteration.

Proposed extensions include variance reduction (ensemble critics, self-verification), adaptive learning-rate scheduling, and incorporation of multimodal or retrieval-augmented primitives. Deployment in scientific applications with wet-lab or clinical feedback is highlighted as a future direction, positioning text-based AD as an enabling layer for optimized coordination among heterogeneous AI components (Yuksekgonul et al., 2024).

Selected References

Paper Title	arXiv ID	Contribution
TextGrad: Automatic "Differentiation" via Text	(Yuksekgonul et al., 2024)	Introduces, formalizes, and evaluates text-based AD
DiffSharp: Automatic Differentiation Library	(Baydin et al., 2015)	Describes numerical AD library, classical workflow
A Hitchhiker's Guide to Automatic Differentiation	(Hoffmann, 2014)	Mathematical foundations of numerical AD

These works establish the conceptual and operational advances from classical to text-mediated automatic differentiation frameworks.

PDF Markdown Chat (Pro)

References (3)

TextGrad: Automatic "Differentiation" via Text (2024)

A Hitchhiker's Guide to Automatic Differentiation (2014)

DiffSharp: Automatic Differentiation Library (2015)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Automatic Differentiation via Text.