Dynamic Canonical Trace Divergence (DCTD)
- Dynamic Canonical Trace Divergence (DCTD) is a metric that quantifies runtime behavioral differences in functionally correct LLM-generated solutions using probability mass functions over opcode events.
- It utilizes methods like Jensen–Shannon divergence and normalized trace variance to assess both structural variations and computational cost differences among candidate solutions.
- Empirical studies show that higher LLM sampling temperatures increase DCTD values, underscoring a critical trade-off between solution diversity and stable runtime performance.
Dynamic Canonical Trace Divergence (DCTD) is a metric introduced to quantify the behavioral variance of multiple functionally correct code generations produced by LLMs when executed on a suite of test cases. Unlike traditional correctness evaluation, which only determines whether generated code yields correct outputs, DCTD directly characterizes how much the actual runtime behaviors of these solutions can diverge, illuminating distinctions in algorithmic efficiency and control flow that are consequential in production environments. DCTD operates by capturing and comparing the distribution of Python bytecode opcodes executed by each candidate solution under dynamic tracing, allowing measurement of both structural code differences and the computational costs they incur.
1. Formal Definition of DCTD
DCTD models each functionally correct solution as a probability mass function (PMF) over opcode event counts, tracked dynamically at runtime for each private test case. For solutions passing all public unit tests and private test cases, the count of each opcode for solution on test is recorded. Two PMFs are derived per solution/test pair:
- Structural PMF
- Cost-weighted PMF
where assigns a computational cost to each opcode by type, and is the opcode vocabulary size.
Two versions of DCTD are defined:
- Jensen–Shannon-based DCTD (DCTD_JSD):
For each test , compute the average pairwise Jensen–Shannon divergence (JSD) among the solutions’ PMFs:
The aggregate metric is:
with a mixing coefficient .
- Variance-trace-based DCTD (DCTD):
For each test , treat as samples drawn from a random vector ; likewise for and . The normalized total variance,
and analogously , are averaged over tests and combined. Both metrics lie in by construction.
2. Data Collection and Processing Pipeline
The computation of DCTD is supported by a standardized data preparation workflow:
| Step | Description | Note |
|---|---|---|
| Generation | Sample candidate Python functions at chosen temperature | LLM code generation |
| Correctness filtering | Retain solutions that pass all public unit tests | Ensures functional correctness |
| Static analysis (SCTD) | Compile solutions with dis, count bytecode opcodes |
For static divergence |
| Dynamic tracing (DCTD) | Run each solution on private tests under sys.settrace |
Record opcode event counts |
| Normalization | Transform event counts to PMFs , | Uses shared opcode vocab |
This pipeline ensures precise isolation of both structural (static) and behavioral (dynamic) code differences for subsequent divergence quantification.
3. Algorithmic Recipe and Pseudocode
The computation of DCTD_JSD proceeds via the following pseudocode, which operationalizes the above definition:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
inputs: list of m correct solutions sol[1..m], private tests test[1..r] opcode_weights w[1..d], hyperparameter alpha initialize sum_pairwise_struc[j]=0, sum_pairwise_cost[j]=0 for j=1..r for j in 1..r: # collect PMFs for s in 1..m: counts = run_and_trace(sol[s], test[j]) # returns vector c[1..d] total = sum(counts) for i in 1..d: p[s][i] = counts[i]/total weighted_total = sum_over_i(w[i]*counts[i]) for i in 1..d: q[s][i] = (w[i]*counts[i]) / weighted_total # compute average pairwise JSD among m PMFs pair_struc = 0; pair_cost = 0; pairs=0 for s in 1..m: for t in s+1..m: pair_struc += JSD(p[s], p[t]) pair_cost += JSD(q[s], q[t]) pairs += 1 sum_pairwise_struc[j] = (2/pairs)*pair_struc sum_pairwise_cost[j] = (2/pairs)*pair_cost DCTD_JSD = alpha * (sum_j sum_pairwise_struc[j]/r) + (1-alpha)*(sum_j sum_pairwise_cost[j]/r) output DCTD_JSD |
This procedure outputs a scalar in representing the average dynamic divergence of functionally correct solutions.
4. Relationship with SCTD and Behavioral Expression Factor (BEF)
DCTD is closely linked to Static Canonical Trace Divergence (SCTD) and the Behavioral Expression Factor (BEF):
- SCTD: Derived identically to DCTD but uses opcode PMFs from static compilation (via
dis) rather than dynamic execution. It quantifies diversity in algorithmic structure regardless of runtime path coverage. - DCTD: Measures true behavioral divergence under actual inputs as executed in an instrumented environment.
- BEF:
Interpretive thresholds: - : High static diversity, low dynamic divergence (suggesting redundant differences not exercised by tests). - : Structure and behavior are aligned. - : Small structural differences yield large runtime effects, flagging instability.
This triad—SCTD, DCTD, BEF—forms a diagnostic suite for code generation evaluation surpassing nominal correctness.
5. Empirical Characterization and Observed Trends
Empirical evaluation on BigOBench and CodeContests reveals characteristic DCTD behaviors:
- On BigOBench (temperature=0.7), DCTD_JSD typically has a median of ~0.01 (1% of max), with IQR 0.5–1.5%. CodeContests records slightly higher medians (~1–2%) due to more varied control flow.
- DCTD_JSD increases monotonically with sampling temperature:
| Temperature | DCTD_JSD Range | |-------------|--------------------| | 0.0 | 0.005 – 0.008 | | 0.7 | 0.010 – 0.015 | | 0.95 | 0.015 – 0.020 |
This trend denotes a "penalty of instability"—the pursuit of higher solution diversity and correctness rates comes at the cost of increased behavioral inconsistency.
- Occasional outlier runs yield DCTD_JSD up to 10–15%, indicating that certain generation parameters or prompts can produce solutions with widely divergent or potentially catastrophic runtime profiles.
A plausible implication is that relying solely on functional correctness metrics can obscure latent instability risks in LLM-generated code, especially when sampling-driven diversity is emphasized.
6. Practical Guidelines for Interpretation and Evaluation Integration
Practical use of DCTD involves threshold-based interpretation and automatic workflow adaptation:
| DCTD Value Range | Meaning |
|---|---|
| Perfect runtime stability | |
| Low to acceptable instability; suitable for production | |
| $0.02$ to | Moderate concern; recommend detailed profiling or more rigorous tests |
| Critical risk; possible severe performance divergence |
Recommended actions when DCTD exceeds policy thresholds (e.g., 0.05):
- Increase private-test coverage to better probe behavioral diversity.
- Lower LLM sampling temperature to encourage more consistent algorithms.
- Rerank or filter solutions using dynamic cost-based PMF minima.
- Incorporate DCTD directly as an objective during fine-tuning or reinforcement learning to bias toward behavioral stability.
Integrating DCTD in the code-generation pipeline extends evaluation beyond binary pass/fail outcomes, capturing whether correct code will behave predictably and efficiently in real-world deployments. Combining SCTD, DCTD, and BEF enables comprehensive assessment of both structural and dynamic dimensions of LLM-generated code.
7. Significance and Context in LLM Code Generation Evaluation
DCTD addresses a critical limitation in prevailing LLM code-generation benchmarks and methodologies, which focus on correctness but neglect the considerable variability in runtime behavior and computational cost latent within correct solutions. With empirical evidence showing that elevated DCTD accompanies popular strategies for improving accuracy (such as higher sampling temperatures), these findings advocate for stability-aware objectives, updated benchmarks incorporating asymptotic test cases, and explicit measurement of behavioral divergence as prerequisites for robust, production-oriented evaluation. By providing a systematic, principled approach to behavioral analysis, DCTD constitutes a core advancement in the quantitative assessment of code generation by LLMs (Rajput et al., 7 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free