Program-Aided Distillation

Updated 27 November 2025

Program-Aided Distillation (PaD) is a framework that transfers reasoning skills from large models to smaller ones through verifiable, executable programs.
It replaces noisy chain-of-thought data with structured program outputs, leveraging filtering, error injection, and stepwise decoding to improve performance.
Empirical results show significant gains in arithmetic, symbolic, and vision-language tasks compared to traditional fine-tuning and chain-of-thought methods.

Program-Aided Distillation (PaD) is a distillation paradigm in which reasoning capabilities from large models are transferred to smaller models through the use of executable programs, rather than natural-language chains of thought. By leveraging programmatic representations and verification via an execution oracle, PaD substantially improves the distillation of multi-step reasoning skills, resulting in small models with robust reasoning performance. PaD has been applied to both language (arithmetic and symbolic reasoning) and vision-language tasks, demonstrating significant performance gains over baseline fine-tuning and chain-of-thought (CoT) methods (Zhu et al., 2023, Hu et al., 2023).

1. Motivation and Conceptual Overview

Chain-of-thought prompting in LLMs elicits explicit multi-step natural-language rationales that enhance reasoning. However, synthetic CoT data, when used to train smaller "student" models, often contains errors in intermediate steps, even if the final answer is correct. This noise impedes the student's ability to robustly learn reasoning processes. PaD remedies these deficiencies by substituting free-form CoT with structured, executable programs—enabling automatic correctness verification, sample filtering, and the scaffolding of fine-grained reasoning via step-wise decoding and error feedback.

In visual domains, an analogous approach decomposes high-level visual reasoning tasks into explicit sequences of tool-invoking program steps, which are then executed and verified before being distilled as language explanations into vision-LLMs (VLMs) (Hu et al., 2023).

2. Formalization and Algorithmic Framework

PaD operates in a teacher–student setting, where a large model (the "teacher") proposes multi-step solutions in programmatic form, and a small model (the "student") is trained to generate equivalent programmatic rationales.

For the language domain (Zhu et al., 2023), let $x \in \mathcal{X}$ be an input (e.g., a math word problem), $y^* \in \mathcal{Y}$ the ground-truth answer, and $r=[r_1, \ldots, r_T]$ the corresponding program (a sequence of code lines). The execution oracle $E(r)$ attempts to run $r$ under Python, returning $\hat{y}=E(r)$ if successful or raising an error otherwise. The student model $f_\theta$ generates a distribution $p_\theta(r|x)$ over programs.

The distillation loss per example is:

$L(\theta; x, y^*, r) = -\log p_\theta(r \mid x) + \lambda\,\ell_\mathrm{prog}\bigl(r, E(r)\bigr)$

where $\ell_\mathrm{prog}(r, E(r))=0$ if $E(r)=y^*$ and large otherwise. In practice, only $(x, r, y^*)$ with valid execution are retained.

For visual program distillation (Hu et al., 2023), the input is $(x, q, y)$ (image, question, optional ground-truth), and the teacher LLM samples $k$ candidate programs $z_i$ , each executed by an external engine $\phi(x, z_i)$ to yield $(\hat{y}_i, t_i)$ . Programs yielding $\hat{y}_i=y$ are retained and converted into natural-language rationales via a CoT converter $\psi(q, z^*, t^*, \hat{y}^*) \mapsto c$ .

The student VLM $f_\theta$ is trained with the combined loss

$\mathcal{L}(\theta) = \sum_{j=1}^N \left[ \ell\left(f_\theta(x_j, q_j, p), \hat{y}_j\right) + \ell\left(f_\theta(x_j, q_j, s_c), c_j\right) \right]$

with cross-entropy supervision for both the answer and program-to-language rationale.

3. Key Distillation Techniques: Program Filtering, Error Injection, and Stepwise Decoding

Program Filtering and Verification

PaD employs automatic verification of synthesized reasoning by filtering program samples through execution. Only programs compiling and returning the correct answer are retained for student model training, sharply reducing the quantity of faulty supervision compared to CoT.

To enable robustness and self-debugging, PaD injects syntactic and semantic errors into ground-truth programs (e.g., variable renaming to induce NameError, invalid statements for SyntaxError) and collects the corresponding error messages. The student model is then multi-task-trained to map from $(x, \text{err\_msg})$ to the corrected program $r$ , directly teaching it to leverage interpreter feedback.

Stepwise Beam Search with Faithfulness Scoring

During decoding, PaD expands programs line-by-line with beam search, verifying each extension via type-checking and endorsing steps by a "faithfulness" score. In the language setting, this is implemented as

$\psi(r_t|x) = \cos(\mathrm{Emb}(x), \mathrm{Emb}(r_t))$

where $\mathrm{Emb}(\cdot)$ is a sentence-embedding function, ensuring that program steps are semantically aligned with the source problem. This process further amplifies distillation data quality and resultant student reliability.

4. Empirical Results and Quantitative Performance

Extensive experiments across both language and vision domains demonstrate the efficacy of PaD.

Arithmetic Reasoning (GSM8K):

Model	Params	CoT Fine-tune	PaD
CodeT5_large	0.77B	7.5%	44.9%
CodeT5_base	0.22B	6.3%	39.4%
CodeT5_small	0.06B	3.8%	32.2%
LLaMA-1	13B	—	17.8%

PaD-trained models with as few as 0.77B parameters outperform LLaMA-1 13B, and the smallest CodeT5_small model achieves a >28% absolute gain over standard fine-tuning.

On symbolic reasoning tasks (Coin Flip, Last-Letter Concatenation), PaD achieves 100% accuracy, whereas CoT baselines frequently fail. A trade-off with general ability is observed: performance on BIG-Bench Hard (BBH) drops by 10–20% as arithmetic specialization intensifies.

Ablation:

PaD vs fine-tuning: +29% absolute gain (GSM8K, CodeT5_small)
+3–5% by adding self-refinement
+4–6% by adding stepwise beam search

Generalist and specialist PaLI-X (55B) models trained with VPD outperform Instruct baselines and set new SOTA on targeted benchmarks:

Model	GQA	OK-VQA	TallyQA	POPE	MMBench
PaLI-X–Instruct (55B)	63.3	64.3	75.4	65.0	75.0
PaLI-X–VPD (55B)	64.9	64.6	76.6	65.4	76.2

On A-OKVQA, specialist VPD achieves 68.2%, substantially exceeding prior SOTA.

Sampling $k=5$ candidate programs, as opposed to $k=1$ , increases the fraction of correct program exemplars from 20% to 65%, directly improving distilled model accuracy.

Despite visual programs themselves only attaining ~50–55% accuracy on GQA/OK-VQA/TallyQA, distilled VLMs generalize well, attaining ~65%.

5. Distillation Pipeline in Practice

The distilled workflow comprises the following stages (language and vision instances):

Data Synthesis and Filtering: The teacher LLM generates candidate programmatic rationales per question; only successfully executed programs yielding correct results are retained.
Supervised Fine-tuning: The small student model is fine-tuned to output correct programs conditioned on input instances.
Error Injection and Self-Refinement: A dataset of corrupted programs paired with error messages augments training, facilitating error correction capabilities.
Stepwise Decoding and Verification: Decoding leverages beam search, iterative step verification, and faithfulness scoring to optimize intermediate-step fidelity.

In the vision domain, programs are translated into stepwise natural-language explanations before distillation, providing both final answers and detailed rationales for instruction tuning of VLMs.

6. Limitations, Trade-offs, and Future Directions

Programmatic distillation restricts the model's output space compared to free-form CoT, expediting learning and reducing data requirements. Automated filtering via execution oracle eliminates many failure modes inherent in CoT synthesis. Error injection provides robustness, and semantic scoring in stepwise decoding enhances intermediate fidelity.

However, models distilled via PaD often exhibit domain specialization. Significant gains in arithmetic and symbolic reasoning can result in modest declines in general or open-ended tasks (e.g., BBH). Program scaffolding remains optimal for formal domains where execution oracles are available and less amenable to semantically broad, free-text, or commonsense problems.

Extensions include tool-use with richer interfaces (symbolic algebra, APIs), logic-coherence and backtracking for combinatorial tasks, and application to non-numeric domains (e.g., semantic parsing, database queries).

In the vision-language space, further advances could arise from agentic program generation, improved correctness estimators for unlabeled data, integration of dense labeling tools, and modal expansion (audio, 3D, multi-turn dialog).

7. Broader Implications

PaD demonstrates that distillation anchored in formal, verifiable reasoning substantially reduces data noise and enhances the ability of small models to acquire complex, multi-step reasoning skills formerly accessible only to large LLMs or ensembles. In both language and vision-language domains, program-aided distillation defines a new methodology for model compression, enabling practical deployment of resource-efficient systems with high-fidelity reasoning grounded in executable logic (Zhu et al., 2023, Hu et al., 2023).

Markdown Upgrade to Chat

References (2)

PaD: Program-aided Distillation Can Teach Small Models Reasoning Better than Chain-of-thought Fine-tuning (2023)

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Program-Aided Distillation (PaD).

Program-Aided Distillation

1. Motivation and Conceptual Overview

2. Formalization and Algorithmic Framework

3. Key Distillation Techniques: Program Filtering, Error Injection, and Stepwise Decoding

Program Filtering and Verification

Error Injection and Self-Refinement

Stepwise Beam Search with Faithfulness Scoring

4. Empirical Results and Quantitative Performance

Language Reasoning Tasks (Zhu et al., 2023)

Arithmetic Reasoning (GSM8K):

Ablation:

Visual Program Distillation (Hu et al., 2023)

5. Distillation Pipeline in Practice

6. Limitations, Trade-offs, and Future Directions

7. Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Program-Aided Distillation

1. Motivation and Conceptual Overview

2. Formalization and Algorithmic Framework

3. Key Distillation Techniques: Program Filtering, Error Injection, and Stepwise Decoding

Program Filtering and Verification

Error Injection and Self-Refinement

Stepwise Beam Search with Faithfulness Scoring

4. Empirical Results and Quantitative Performance

Language Reasoning Tasks (Zhu et al., 2023)

Arithmetic Reasoning (GSM8K):

Ablation:

Visual Program Distillation (Hu et al., 2023)

5. Distillation Pipeline in Practice

6. Limitations, Trade-offs, and Future Directions

7. Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics