Dynamic Canonical Trace Divergence (DCTD)

Updated 12 November 2025

Dynamic Canonical Trace Divergence (DCTD) is a metric that quantifies runtime behavioral differences in functionally correct LLM-generated solutions using probability mass functions over opcode events.
It utilizes methods like Jensen–Shannon divergence and normalized trace variance to assess both structural variations and computational cost differences among candidate solutions.
Empirical studies show that higher LLM sampling temperatures increase DCTD values, underscoring a critical trade-off between solution diversity and stable runtime performance.

Dynamic Canonical Trace Divergence (DCTD) is a metric introduced to quantify the behavioral variance of multiple functionally correct code generations produced by LLMs when executed on a suite of test cases. Unlike traditional correctness evaluation, which only determines whether generated code yields correct outputs, DCTD directly characterizes how much the actual runtime behaviors of these solutions can diverge, illuminating distinctions in algorithmic efficiency and control flow that are consequential in production environments. DCTD operates by capturing and comparing the distribution of Python bytecode opcodes executed by each candidate solution under dynamic tracing, allowing measurement of both structural code differences and the computational costs they incur.

1. Formal Definition of DCTD

DCTD models each functionally correct solution as a probability mass function (PMF) over opcode event counts, tracked dynamically at runtime for each private test case. For $m$ solutions passing all public unit tests and $r$ private test cases, the count $c_{j,s,i}$ of each opcode $i$ for solution $s$ on test $j$ is recorded. Two PMFs are derived per solution/test pair:

Structural PMF

$p_{j,s,i} = \dfrac{c_{j,s,i}}{\sum_{k=1}^d c_{j,s,k}}$

Cost-weighted PMF

$q_{j,s,i} = \dfrac{w_i c_{j,s,i}}{\sum_{k=1}^d w_k c_{j,s,k}}$

where $w_i \in \{1, 10, 100\}$ assigns a computational cost to each opcode by type, and $d$ is the opcode vocabulary size.

Two versions of DCTD are defined:

Jensen–Shannon-based DCTD (DCTD_JSD):

For each test $j$ , compute the average pairwise Jensen–Shannon divergence (JSD) among the $m$ solutions’ PMFs:

$\text{avgJSD}_j^\text{struc} = \frac{2}{m(m-1)} \sum_{1 \le s<t \le m} \mathrm{JSD}(p_{j,s}, p_{j,t})$

$\text{avgJSD}_j^\text{cost} = \frac{2}{m(m-1)} \sum_{1 \le s<t \le m} \mathrm{JSD}(q_{j,s}, q_{j,t})$

The aggregate metric is:

$\mathrm{DCTD_{JSD}} = \alpha \left[ \frac{1}{r} \sum_{j=1}^r \text{avgJSD}_j^\text{struc} \right] + (1-\alpha) \left[ \frac{1}{r} \sum_{j=1}^r \text{avgJSD}_j^\text{cost} \right]$

with a mixing coefficient $\alpha \in [0,1]$ .

Variance-trace-based DCTD (DCTD $_\tau$ ):

For each test $j$ , treat $\{p_{j,1}, \dots, p_{j,m}\}$ as $m$ samples drawn from a random vector $X_j$ ; likewise for $q_{j,s}$ and $Y_j$ . The normalized total variance,

$\tau(X_j) = \frac{\mathrm{tr}(\mathrm{Cov}(X_j))}{1-\|\mathbb{E}[X_j]\|_2^2}$

and analogously $\tau(Y_j)$ , are averaged over tests and combined. Both metrics lie in $[0,1]$ by construction.

2. Data Collection and Processing Pipeline

The computation of DCTD is supported by a standardized data preparation workflow:

Step	Description	Note
Generation	Sample $n$ candidate Python functions at chosen temperature	LLM code generation
Correctness filtering	Retain $m$ solutions that pass all public unit tests	Ensures functional correctness
Static analysis (SCTD)	Compile $m$ solutions with `dis`, count bytecode opcodes	For static divergence
Dynamic tracing (DCTD)	Run each solution on $r$ private tests under `sys.settrace`	Record opcode event counts
Normalization	Transform event counts to PMFs $p_{j,s}$ , $q_{j,s}$	Uses shared opcode vocab

This pipeline ensures precise isolation of both structural (static) and behavioral (dynamic) code differences for subsequent divergence quantification.

3. Algorithmic Recipe and Pseudocode

The computation of DCTD_JSD proceeds via the following pseudocode, which operationalizes the above definition:

inputs: list of m correct solutions sol[1..m], private tests test[1..r]
        opcode_weights w[1..d], hyperparameter alpha

initialize sum_pairwise_struc[j]=0, sum_pairwise_cost[j]=0 for j=1..r

for j in 1..r:
  # collect PMFs
  for s in 1..m:
    counts = run_and_trace(sol[s], test[j])   # returns vector c[1..d]
    total = sum(counts)
    for i in 1..d:
      p[s][i] = counts[i]/total
    weighted_total = sum_over_i(w[i]*counts[i])
    for i in 1..d:
      q[s][i] = (w[i]*counts[i]) / weighted_total
  # compute average pairwise JSD among m PMFs
  pair_struc = 0; pair_cost = 0; pairs=0
  for s in 1..m:
    for t in s+1..m:
      pair_struc += JSD(p[s], p[t])
      pair_cost  += JSD(q[s], q[t])
      pairs += 1
  sum_pairwise_struc[j] = (2/pairs)*pair_struc
  sum_pairwise_cost[j]  = (2/pairs)*pair_cost

DCTD_JSD = alpha * (sum_j sum_pairwise_struc[j]/r)             + (1-alpha)*(sum_j sum_pairwise_cost[j]/r)
output DCTD_JSD

This procedure outputs a scalar in $[0,1]$ representing the average dynamic divergence of functionally correct solutions.

4. Relationship with SCTD and Behavioral Expression Factor (BEF)

DCTD is closely linked to Static Canonical Trace Divergence (SCTD) and the Behavioral Expression Factor (BEF):

SCTD: Derived identically to DCTD but uses opcode PMFs from static compilation (via dis) rather than dynamic execution. It quantifies diversity in algorithmic structure regardless of runtime path coverage.
DCTD: Measures true behavioral divergence under actual inputs as executed in an instrumented environment.
BEF:

$\mathrm{BEF} = \frac{\mathrm{SCTD}}{\max(\mathrm{DCTD},\epsilon)},\quad \epsilon=10^{-9}$

Interpretive thresholds: - $\mathrm{BEF} \gg 1$ : High static diversity, low dynamic divergence (suggesting redundant differences not exercised by tests). - $\mathrm{BEF} \approx 1$ : Structure and behavior are aligned. - $\mathrm{BEF} \ll 1$ : Small structural differences yield large runtime effects, flagging instability.

This triad—SCTD, DCTD, BEF—forms a diagnostic suite for code generation evaluation surpassing nominal correctness.

5. Empirical Characterization and Observed Trends

Empirical evaluation on BigOBench and CodeContests reveals characteristic DCTD behaviors:

On BigOBench (temperature=0.7), DCTD_JSD typically has a median of ~0.01 (1% of max), with IQR 0.5–1.5%. CodeContests records slightly higher medians (~1–2%) due to more varied control flow.
DCTD_JSD increases monotonically with sampling temperature:

| Temperature | DCTD_JSD Range | |-------------|--------------------| | 0.0 | 0.005 – 0.008 | | 0.7 | 0.010 – 0.015 | | 0.95 | 0.015 – 0.020 |

This trend denotes a "penalty of instability"—the pursuit of higher solution diversity and correctness rates comes at the cost of increased behavioral inconsistency.

Occasional outlier runs yield DCTD_JSD up to 10–15%, indicating that certain generation parameters or prompts can produce solutions with widely divergent or potentially catastrophic runtime profiles.

A plausible implication is that relying solely on functional correctness metrics can obscure latent instability risks in LLM-generated code, especially when sampling-driven diversity is emphasized.

6. Practical Guidelines for Interpretation and Evaluation Integration

Practical use of DCTD involves threshold-based interpretation and automatic workflow adaptation:

DCTD Value Range	Meaning
$\approx 0.0$	Perfect runtime stability
$< 0.02$	Low to acceptable instability; suitable for production
$0.02$ to $< 0.10$	Moderate concern; recommend detailed profiling or more rigorous tests
$\geq 0.10$	Critical risk; possible severe performance divergence

Recommended actions when DCTD exceeds policy thresholds (e.g., 0.05):

Increase private-test coverage to better probe behavioral diversity.
Lower LLM sampling temperature to encourage more consistent algorithms.
Rerank or filter solutions using dynamic cost-based PMF minima.
Incorporate DCTD directly as an objective during fine-tuning or reinforcement learning to bias toward behavioral stability.

Integrating DCTD in the code-generation pipeline extends evaluation beyond binary pass/fail outcomes, capturing whether correct code will behave predictably and efficiently in real-world deployments. Combining SCTD, DCTD, and BEF enables comprehensive assessment of both structural and dynamic dimensions of LLM-generated code.

7. Significance and Context in LLM Code Generation Evaluation

DCTD addresses a critical limitation in prevailing LLM code-generation benchmarks and methodologies, which focus on correctness but neglect the considerable variability in runtime behavior and computational cost latent within correct solutions. With empirical evidence showing that elevated DCTD accompanies popular strategies for improving accuracy (such as higher sampling temperatures), these findings advocate for stability-aware objectives, updated benchmarks incorporating asymptotic test cases, and explicit measurement of behavioral divergence as prerequisites for robust, production-oriented evaluation. By providing a systematic, principled approach to behavioral analysis, DCTD constitutes a core advancement in the quantitative assessment of code generation by LLMs (Rajput et al., 7 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Dynamic Stability of LLM-Generated Code (2025)

Follow Topic

Get notified by email when new papers are published related to Dynamic Canonical Trace Divergence (DCTD).