Secure Transpiler & Executor (STELP)

Updated 16 January 2026

The paper demonstrates STELP’s formal safe execution approach by blocking known vulnerabilities and enforcing strict resource controls on LLM-generated code.
STELP employs a modular pipeline with AST validation, sandboxed execution, and feedback-driven auto-repair to maintain code integrity.
Empirical results show perfect blocking of unsafe code, near-perfect safe code allowance, and minimal latency overhead in automated software development.

The Secure Transpiler and Executor of LLM-Generated Program (STELP) is a formally grounded system for transpiling and securely executing code produced by LLMs, combining static analysis, dynamic instrumentation, and feedback-driven repair. STELP addresses fundamental safety, correctness, and resource control problems inherent in direct execution of LLM-generated code, particularly in the context of automated software development, headless code execution, and multi-agent systems (Shinde et al., 9 Jan 2026). Its architecture generalizes and extends approaches originally developed for automated, enclave-based code partitioning and protection, notably those used in AutoTEE (Han et al., 19 Feb 2025).

1. Motivation, Threat Model, and Formal Guarantees

LLMs frequently generate code containing correctness defects, resource abuse vectors, and vulnerabilities drawn from known attack surfaces (e.g., CWE classes). Direct execution risks include unauthorized file/network access, arbitrary code execution via reflective constructs (such as exec or __import__), privilege escalation, and denial-of-service via infinite loops or excessive recursion. Traditional mitigations such as manual review and conventional fuzz testing are often impractical or untrustworthy in AI-driven, fully automated workflows (Shinde et al., 9 Jan 2026).

STELP adopts the threat model of an adversarial code generator. Given a user prompt $p$ , an LLM may produce a code snippet $c \sim LLM(p)$ , possibly containing fragments from a known vulnerability set $V$ . STELP enforces a safe grammar $\mathcal{G}_{safe} \subseteq \mathcal{G}_{Python}$ , excluding constructs with known vulnerabilities. A snippet $c$ is declared “unsafe” if $c \notin L(\mathcal{G}_{safe})$ or contains statically detected vulnerability patterns (Shinde et al., 9 Jan 2026).

Safe typing is formalized as

$\frac{\Gamma\vdash e:T\;\;\;\;\mathit{safe}(T)}{\Gamma\vdash_{safe}e}$

where $\Gamma\vdash e:T$ is a standard typing judgment and $\mathit{safe}(T)$ denotes permit status for type or effect $T$ .

2. System Architecture and Pipeline

STELP is structured as a modular pipeline consisting of:

Abstract Syntax Tree (AST) Processor: Input code is parsed into an AST and validated nodewise against $c \sim LLM(p)$ 0. All nodes outside the permitted grammar are rejected. Configuration covers allowed built-ins/imports, tool invocations, loop/recursion bound, and tool-call timeouts and retry policy.
Safe Code Generator & Executor: A validated AST undergoes transformation for runtime protection: loops are instrumented with iteration bounds, external calls are wrapped with proxy stubs (with timeouts/auto-retries), and I/O/system interactions are redirected through a sandboxed microservice. The resulting AST is compiled to bytecode and executed statement-wise with enforcement of bounds and policies. Any rule violation causes execution halt via a custom exception (STELPException).
Feedback Generator: If execution is blocked, exception metadata (type, violated rule, stack trace) is collected. An LLM-powered critic (e.g., Llama 3 70B) produces human-readable repair suggestions, which can be automatically cycled back into the candidate snippet to facilitate self-repair, closing the loop.

A formal grammar underpinning STELP includes production rules for safe statements (if, for, assignments), restricted function calls, and expressions, explicitly omitting constructs such as eval, dynamic imports, or metaprogramming (Shinde et al., 9 Jan 2026).

3. Secure Execution, Sandboxing, and Instrumentation

STELP enforces sandboxing and resource control at execution:

All external tool and function calls are proxied through an isolated microservice container. Network and filesystem access are sandboxed, and system-call filtering enforces policy.
For and while loops are instrumented to enforce iteration count $c \sim LLM(p)$ 1, and a global wall-clock time budget $c \sim LLM(p)$ 2 applies.
On violation (e.g., exceeding loop bounds, triggering a forbidden call or timeout), execution is forcibly terminated with informative diagnostics.
Memory safety is assured by

$c \sim LLM(p)$ 3

Resource bounding formally is enforced by

$c \sim LLM(p)$ 4

STELP's instrumentation is designed to preserve the semantics of pure computations, modifying only the execution context and resource usage.

Empirical execution overhead is minimal: mean increase $c \sim LLM(p)$ 5 ms, median $c \sim LLM(p)$ 6 ms per program, with statement-level instrumentation overheads well below $c \sim LLM(p)$ 7 ms for assignment, loops, and calls (Shinde et al., 9 Jan 2026).

4. Dataset, Benchmarking, and Empirical Results

Evaluation leverages the InjectedHumanEval (IHE) benchmark, based on an extended HumanEval dataset (634 samples: 164 safe, 470 unsafe). Unsafe samples cover CWEs including code injection, unsafe reflection, deserialization, path traversal, resource exhaustion, and untrusted code execution—all LLM-injected and human-validated.

STELP is assessed using:

True Block Rate (TBR):

$c \sim LLM(p)$ 8

True Allow Rate (TAR):

$c \sim LLM(p)$ 9

Correctness (Python execution outputs):

$V$ 0

Latency Overhead: Aggregate (mean, median, SD, IQR).

Results:

Method	TBR	TAR	Correctness	Mean ΔT (ms)
CodeShield	0.68	0.93	-	-
STELP	1.00	0.981	1.00	4.93

On IHE, STELP achieved perfect blocking (TBR=1.00), near-perfect allowance of safe code (TAR=0.981), zero functional regression (correctness = 1.0 on 361 evaluation samples), and sub-millisecond median overhead. Feedback-driven auto-repair successfully fixed 90.2% of blocked samples within two iterations (Shinde et al., 9 Jan 2026).

5. Design Principles from AutoTEE to STELP

STELP generalizes core principles pioneered in AutoTEE (Han et al., 19 Feb 2025):

Sensitive-Code Identification: STELP employs LLM-based sensitive-function detection, retaining a ReAct-style, multi-prompt approach originally developed for enclave partitioning (cryptography, serialization, etc.).
Partitioning: Moving beyond leaf-function identification, STELP uses closure over the call graph to recursively inline or port all helper routines that participate in sensitive computations.
Transformation: Adopts AutoTEE's iterative, LLM-assisted transformation, but with a plugin architecture supporting Rust, WebAssembly (WASM), and WASI targets. The pipeline includes control-flow/type conversion, security boundary annotation, and automatic marshaling code generation.
Portability and Execution: STELP automates enclave creation for SGX (via Fortanix/rust-sgx), SEV (via micro-VM/image), and WASI (via Wasmtime), supporting dynamic linking to enclave APIs.

STELP formalizes its workflow with F₁ and transformation rates inherited from AutoTEE: F₁ = 0.91, transformation success 90% (Java) and 83% (Python) (Han et al., 19 Feb 2025).

A technical capabilities comparison with Meta’s CodeShield highlights:

Capability	CodeShield	STELP
Enforce allowed grammar	✗	✓
Restrict built-ins & imports	✗	✓
Loop/stack-depth bounds	✗	✓
Tool-call timeouts & retries	✗	✓
Feedback + auto-repair	✗	✓

STELP’s configuration architecture enables fine-grained policy definition, while CodeShield lacks enforceable grammars and resource policy.

7. Limitations and Future Directions

STELP presently supports only Python; support for Java, SQL, and other languages is forthcoming. Vulnerability repair relies on iterative feedback rather than deep AST rewriting. Deeper formal verification (e.g., concurrency safety) and static analysis integration remain open areas. Anticipated future work includes code optimization, concurrent function-scheduling, richer type enforcement, and multi-language transpilation (Shinde et al., 9 Jan 2026).

A plausible implication is that STELP’s checker–transformer–porter pattern, combined with a feedback-driven repair loop and sandboxed enforcement, constitutes a standardized pathway for integrating LLM-driven code generation into high-assurance systems.

References

"STELP: Secure Transpilation and Execution of LLM-Generated Programs" (Shinde et al., 9 Jan 2026)
"AutoTEE: Automated Migration and Protection of Programs in Trusted Execution Environments" (Han et al., 19 Feb 2025)

Markdown Report Issue Upgrade to Chat

References (2)

STELP: Secure Transpilation and Execution of LLM-Generated Programs (2026)

AutoTEE: Automated Migration and Protection of Programs in Trusted Execution Environments (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Secure Transpiler and Executor of LLM-Generated Program (STELP).

Secure Transpiler & Executor (STELP)

1. Motivation, Threat Model, and Formal Guarantees

2. System Architecture and Pipeline

3. Secure Execution, Sandboxing, and Instrumentation

4. Dataset, Benchmarking, and Empirical Results

5. Design Principles from AutoTEE to STELP

7. Limitations and Future Directions

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Secure Transpiler & Executor (STELP)

1. Motivation, Threat Model, and Formal Guarantees

2. System Architecture and Pipeline

3. Secure Execution, Sandboxing, and Instrumentation

4. Dataset, Benchmarking, and Empirical Results

5. Design Principles from AutoTEE to STELP

6. Comparison with Related Approaches

7. Limitations and Future Directions

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research