Semi-Structured DSL Traces
- Semi-structured DSL traces are a specialized format with explicit step-type labeling and input-output annotations to ensure clear, auditable execution logs.
- They employ a constrained grammar using elements like <think>, <partial_program>, and <program_trace> to support both symbolic and statistical auditing methods.
- Empirical results reveal that these traces enhance debugging performance in models such as SSRMs and tools like Walrus, demonstrating improved accuracy and audit reliability.
Semi-structured DSL traces are a class of program traces designed for maximum analyzability and auditability, balancing machine-readability with the needs of complex or human-in-the-loop reasoning. They provide restricted, task-specialized structure—eschewing the full generality (and executability) of a traditional programming language in favor of explicit step-type labeling, input-output annotations, and uniform, parseable formatting. These traces play a central role in the transparency and debuggability of systems ranging from LLMs with internalized reasoning (“Semi-Structured Reasoning Models,” SSRMs) (Leng et al., 30 May 2025) to relational programming environments such as Walrus (Cuéllar et al., 2 Oct 2025).
1. Formal Structure and Syntax of Semi-Structured DSL Traces
Semi-structured DSL traces are defined by a constrained grammar that exposes explicit: (a) reasoning step types (e.g., function names), (b) corresponding inputs/outputs, and (c) logical groupings. For SSRMs (Leng et al., 30 May 2025), the syntax is formally specified with extended Backus-Naur Form (EBNF) as follows:
- Session block: Encapsulated by
> ..., partitioned into:<partial_program>: Declarations of @traced functions, with explicit type signatures and brief docstrings describing their semantics. Function names encode the atomic reasoning step.<program_trace>: A list of paired step invocations, each “Calling FnName(args)...” followed by “…FnName returned result”, forming a flat—but fully sequential—trace of reasoning.<answer>: The model’s final answer, outside the trace, for task assessment.
- Restrictive Vocabulary: Only the declared @traced functions are admissible in the
<program_trace>. Each invocation explicitly lists arguments and result, typically as string or tuple types. - Compositionality: Traces can encode hierarchical dependencies and data-flow using consistent calling/returning of step results, preserving the provenance of each intermediate value.
For relational programming (as in Walrus (Cuéllar et al., 2 Oct 2025)), semi-structured traces are realized as a ledger of tagged events, each carrying a “Path” (a list of branch indices per disjunction) and event type (function call, unification, disjunction entry, or user-tag). The flat ledger is then converted into a rose tree that mirrors the logical call structure.
2. Internalization and Generation in Model-Based Reasoners
In SSRMs (Leng et al., 30 May 2025), the semi-structured trace DSL is internalized by the LLM via tokenization (ensuring all structural tags and function names are reserved in the vocabulary), supervised fine-tuning (enforcing the tag-delineated format with cross-entropy loss), and reinforcement learning with verifiable rewards (RLVR). RLVR constrains the output format by rewarding only traces that:
- Balance all structural tags.
- Declare at least three @traced functions.
- Restrict all trace-invoked functions to those declared in the partial program.
Empirically, the attention patterns in trained models tend to attend to the most recent function call when emitting the corresponding return, ensuring strong local coupling of calls/returns and supporting reliable trace parsing and auditing.
In Walrus (Cuéllar et al., 2 Oct 2025), trace entries are generated dynamically as the engine explores the relational search space. Events like tagged calls, unifications, and disjunctions are logged along with their path prefix, reconstructing the nesting structure post-hoc for tree-based navigation and filtering.
3. Vocabulary and Step Labeling
The power of the semi-structured approach derives from the explicit, task-aligned vocabulary of reasoning steps. In SSRMs (Leng et al., 30 May 2025), typical examples include:
| Function name | Signature | Purpose |
|---|---|---|
| analyze_input | (input_str: str) → tuple[str,…] | Extracts logical “rules” and datapoints |
| extract_patient_data | (input_str: str) → tuple[str,…] | Domain-specific data extraction |
| convert_units | (value_with_unit: str) → str | Standardizes measurement units |
| evaluate_rule | (rule_str: str, data_str: str) → float | Numeric evaluation of symbolic rules |
| accumulate_score | (current: float, delta: float) → float | Score tracking across multi-step reasoning |
Each function is decorated with @traced and must be accompanied by a docstring summarizing its transformation semantics. Only these, and no others, appear in the trace for that reasoning session. This restriction ensures both full transparency of the reasoning process and uniformity for downstream auditing.
In Walrus (Cuéllar et al., 2 Oct 2025), step types are not restricted to function calls but also include unification and disjunction choice events, further generalizing the form of traceable computations.
4. Auditing Methods: Structured and Statistical Analysis
The distinctive feature of semi-structured DSL traces is their auditability:
- Hand-crafted structured audits (Leng et al., 30 May 2025): Traces are converted into tabular representations (e.g., pandas DataFrame with columns for step function, inputs, output). Logical predicates are then applied, e.g., ensuring each rule is evaluated once, checking that unit conversion precedes rule evaluation, or summing all score deltas to match the final answer. Failing these predicates flags the trace as suspect.
- Learned typicality audits: Each trace is considered a sequence of step-names. Probabilistic models (n-gram multinomial [probability ] or HMMs with state transitions and emissions) are fit to a corpus of training traces. An anomaly score is computed for each test trace; traces with high anomaly scores are deemed taxonomically “atypical” and likely erroneous. Thresholding or tertile splits are used for binary or three-way audit pass/fail classifications.
In Walrus (Cuéllar et al., 2 Oct 2025), the tree-structured traces admit recursive pattern queries (e.g., filtering for specific function calls or examining all unification events beneath a subgoal), supporting both manual and automated trace audits.
5. Example Traces and Data-Flow
A representative SSRM trace for a medical-rule task demonstrates compositional data-flow:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
<think>
<partial_program>
@traced
def analyze_input(note: str) -> tuple[list[str],list[str]]:
"""Extracts (rules, patient_measurements)."""
...
@traced
def get_data(item: str) -> str: ...
@traced
def convert_units(value_str: str) -> str: ...
@traced
def evaluate_rule(rule: str, value: str) -> float: ...
@traced
def accumulate_score(score: float, delta: float) -> float: ...
</partial_program>
<program_trace>
Calling analyze_input("…patient note…")…
…analyze_input returned (["RuleA","RuleB","RuleC"],["12 mg/dL","7 mg/dL"])
Calling get_data("RuleA")…
…get_data returned "threshold=10 mg/dL"
Calling convert_units("12 mg/dL")…
…convert_units returned "0.012 g/dL"
Calling evaluate_rule("threshold=10 mg/dL","0.012 g/dL")…
…evaluate_rule returned 1.0
Calling accumulate_score(0.0,1.0)…
…accumulate_score returned 1.0
…(similar for RuleB and RuleC)…
</program_trace>
</think>
<answer>
2.0
</answer> |
The explicit step-type labeling and flat structure preserve all intermediate results, making omitted or logically misconstrued steps readily identifiable both to humans and auditing algorithms.
In Walrus, an example trace appears as a rose tree with nodes labeled by function call names, disjunctions, and unifications, e.g.,
1 2 3 4 5 6 7 8 9 10 |
TreeNode { nodeLabel = Just "binLists", nodeEvents = [],
nodeChildren =
[ TreeNode { nodeLabel = Just "Σ₂", nodeEvents = [EDisj 2],
nodeChildren = [
TreeNode { nodeLabel = Nothing, nodeEvents = [EUnify "_0" "[]"], nodeChildren = [] },
TreeNode { nodeLabel = Just "Σ₂", nodeEvents = [EDisj 2], ... }
]
}
]
} |
6. Expressiveness, Limitations, and Trade-offs
Semi-structured DSL traces offer multiple advantages (Leng et al., 30 May 2025, Cuéllar et al., 2 Oct 2025):
- Generality: Reasoning steps can consume/produce arbitrary strings, supporting both symbolic and numeric computations.
- Parseability: The call/return pattern and structural keying enable robust parsing and audit tooling.
- Debuggability: Omitted or misplaced steps are visually and algorithmically salient due to explicit labeling.
However, limitations are inherent:
- Not executable: Steps are pseudocode and generally not runnable as code objects; mechanized re-execution is impossible.
- Partial verification: Auditing focuses on structure (e.g., which steps occurred and in what order), not on the semantic correctness of arbitrary inner computations.
- Coverage gap: Handwritten audits may not anticipate all potential reasoning flaws, though statistical anomaly detection mitigates some of this gap.
The overall design reflects a trade-off: sacrificing arbitrary executability in favor of a highly analyzable, lightweight format that is tightly aligned with model- or search-engine output and suitable for both symbolic and data-driven assessment.
7. Empirical Results and Practical Impact
Key empirical results for SSRMs (Leng et al., 30 May 2025) demonstrate substantial gains in both performance and auditability:
- Performance: SSRM (7B model) with SFT+RL achieves 75.9% on MedCalcV2 formulas (vs. unstructured CoT 52.4%), and 38.9% on MedCalcV2 rules (vs. CoT 27.4%).
- Audit outcomes: Structured audits on MedCalcV2 Rules yield ≈ 20% failure rate; accuracy drops by Δ≈ 0.24–0.26 when an audit fails (p < 0.05).
- Statistical auditing: HMM‐based typicality audits yield Kendall τ≈ 0.23 correlation with correctness on Formulas, τ≈ 0.15 on Rules; split by tertile, the top vs. bottom tertile shows Δ≈ 0.25 accuracy gap.
- Audit-guided generation: Audit-guided self-consistency reduces generated samples by ≈ 45% while matching or slightly improving aggregate accuracy.
A plausible implication is that semi-structured DSL traces serve as a critical substrate for robust, scalable auditing and debugging of both symbolic programming environments and LLM-based reasoners, supporting both deterministic and probabilistic validation and surfacing errors that would be concealed in unstructured text traces (Leng et al., 30 May 2025, Cuéllar et al., 2 Oct 2025).