PolyglotPiranha: Scalable Code Transformation DSL
- PolyglotPiranha is a language-agnostic, declarative DSL designed for high-performance, human-readable code transformations across multiple programming languages.
- It features a modular architecture with a parsing front end, AST matcher, graph-structured rule engine, and code rewriter to efficiently integrate into CI/CD pipelines.
- LLM-powered agents like SPELL synthesize reusable migration scripts, achieving high validation rates and outperforming previous systems in automated API migrations.
PolyglotPiranha is a language-agnostic, declarative domain-specific language (DSL) and transformation engine developed to enable large-scale, automated source-to-source program modifications. Originally engineered at Uber, PolyglotPiranha’s primary goal is to facilitate high-performance, human-readable code transformations—particularly for software refactoring and API migration—across multiple programming languages. In the context of automated API migration, it is the transformation target for synthesized migration logic produced by agents such as SPELL, which leverage LLMs to extract behavioral equivalence and generalize code rewrites into reusable scripts. The engine is architected to operate efficiently in CI/CD pipelines, processing thousands of source files within seconds, and supports maintainability, extensibility, and verifiable transformation outcomes (Ramos et al., 1 Feb 2026).
1. Architecture and Design Principles
PolyglotPiranha’s architecture is modular, supporting composability and performance within heterogeneous codebases. Its primary pipeline components are:
- Parsing Front End: Utilizes a language-agnostic, parser-combinator framework (built atop comby) to produce concrete-syntax trees (CSTs) or abstract-syntax trees (ASTs) with token and location annotations. Parsers for different languages (e.g., Python, Java) are interchangeable, granting the system cross-language (“polyglot”) adaptability.
- AST Matcher / Pattern Engine: Transformation rules are expressed as concrete-syntax patterns containing named template variables (“holes”). The matcher scans CSTs for unifications with these patterns, enabling robust, syntax-aware identification of rewrite targets.
- Rule Graph / Control-flow Engine: Rules are structured as nodes within a directed, labeled graph. Edges denote sequential and scoped application, such as instructing a rewrite only after an import substitution within a file, followed by targeted rewrites in function or class scopes.
- Code Rewriter / Emitter: Matches trigger the replacement of code segments using “after” patterns, typically preserving original formatting and comments unless explicitly altered.
- Runtime / Fixpoint Loop: Rule application follows a depth-first traversal over the rule graph, enqueuing follow-up rules via labeled edges and iterating until a steady state is reached (no further changes). This enables high-throughput, in-memory, multi-file rewriting well suited for CI/CD workflows.
Design goals emphasize declarative transformations, modularity via graph-composed rules, polyglot operation, rapid execution at enterprise scale, and scripts that are readable, testable, and maintainable.
2. Automated Migration Data Distillation Using LLMs
SPELL integrates LLMs in a structured workflow to collect migration data suitable for synthesis into PolyglotPiranha scripts. The workflow proceeds as follows:
- Generation of Raw Migration Triples:
where is an LLM-generated program using source library , a migrated implementation for target library , and a test suite expected to validate equivalence.
- Validation and Filtering:
retaining only triples where both and compile, execute, and pass with ≥ 60% line coverage.
Prompts are staged to generate (1) abstract migration scenarios, (2) diverse implementations, (3) test suites, and (4) migrated versions, resulting in validated migration data representing behavioral equivalence between source and target APIs.
3. Synthesis of Transformation Scripts: Anti-Unification and Agentic Coordination
Validated code pairs are converted into PolyglotPiranha scripts in a two-phase process:
- Atomic Rule Inference (Anti-Unification):
Diff hunks are computed, and each hunk is processed by an anti-unification algorithm (adapted from the MELT system) to derive candidate rewrite rules:
where and designate removed and added lines, respectively, yielding initial pairings of match and replace patterns with variable abstractions.
- Agent-based Script Synthesis:
A small LLM (e.g., GPT-4.1), receiving the PolyglotPiranha DSL specification and debugging context, iteratively refines and composes rules into a rule graph: 1. Proposes a candidate graph-structured Piranha script. 2. Executes the script and receives diagnostic feedback. 3. Refines rules, adjusts scoping or ordering, and revalidates using the migration’s test suite. 4. Iterates up to ten times or until behavioral equivalence is confirmed.
This orchestrated process yields concise, scoped, and reusable PolyglotPiranha scripts capable of generalizing beyond individual migration examples.
4. Illustrative Example: API Migration Script
A concrete instance is provided for migrating from cryptography.fernet to pycryptodome:
- Original Code (cryptography.fernet):
1 2 3 4 5 |
from cryptography.fernet import Fernet def encrypt_document(document: str, key: bytes) -> bytes: cipher = Fernet(key) encrypted = cipher.encrypt(document.encode()) return encrypted |
- Migrated Code (pycryptodome):
1 2 3 4 5 6 7 8 |
from Crypto.Cipher import AES from Crypto.Util.Padding import pad def encrypt_document(document: str, key: bytes) -> bytes: cipher = AES.new(key, AES.MODE_CBC) padded_data = pad(document.encode(), AES.block_size) encrypted = iv + cipher.encrypt(padded_data) return encrypted |
- Synthesized PolyglotPiranha Script (excerpt):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
- name: replace_import match: from cryptography.fernet import Fernet replace: | from Crypto.Cipher import AES from Crypto.Util.Padding import pad after: - scope: File rule: replace_decl - name: replace_decl match: *:[var] = Fernet(key) replace: cipher = AES.new(key, AES.MODE_CBC) after: - scope: Function rule: replace_encrypt - name: replace_encrypt match: *:[var] = *:[var2].encrypt(*:[data]) replace: | padded_data = pad(*:[data], AES.block_size) *:[var] = iv + *:[var2].encrypt(padded_data) |
5. Empirical Evaluation and Comparative Performance
SPELL’s approach, leveraging PolyglotPiranha scripts, was benchmarked across ten popular Python library migrations against MELT—a prior anti-unification-based system. For each use case, success metrics included the number of validated test-triples, first-try script synthesis rates, and “sibling success” (the proportion of alternative test cases solved by the same script).
| Migration | Valid Triples | SPELL Success | MELT Success | Sibling % |
|---|---|---|---|---|
| argparse → click | 215 | 44.2% | 17.2% | 53.5% |
| json → orjson | 269 | 96.7% | 57.6% | 88.6% |
| logging → loguru | 114 | 85.1% | 72.8% | 68.8% |
| cryptography → pycryptodome | 79 | 48.1% | 6.3% | 6.5% |
| — | — | 61.6% (avg.) | 22.9% (avg.) | 63.3% (avg.) |
SPELL with PolyglotPiranha consistently outperformed MELT in one-shot synthesis and generalization to sibling use cases. In real-world applicability studies across 18 open-source repositories, scripts triggered from a handful to hundreds of rewrites per project, preserving ≥ 90% of existing test coverage in many instances (Ramos et al., 1 Feb 2026).
6. Limitations and Prospective Directions
Identified limitations of PolyglotPiranha and its automated synthesis workflow include:
- Coverage and Bias: LLM-distilled migration examples privilege common idioms, potentially neglecting corner cases and atypical error handling paths.
- DSL Expressivity: PolyglotPiranha cannot rewrite code embedded inside string literals, templates, or arbitrary embedded DSL fragments (e.g., Jinja), resulting in lower success on code using such constructs.
- Validation Granularity: Reliance on test suite pass rates and ≥ 60% line coverage as proxies for behavioral equivalence can allow semantically insufficient or under-specified transformations, with more rigorous oracle mechanisms (e.g., mutation testing) suggested as avenues for future work.
- Partial Migration: Some synthesized scripts address only frequently observed patterns and may not cover edge variants; integration of test-failure feedback or further LLM-based rule generalization is proposed (e.g., via Pycraft techniques).
A plausible implication is that extending PolyglotPiranha with multi-language embedded DSL parsing and integrating richer behavioral validation could further broaden its applicability in practical migration workflows.
7. Significance in Program Transformation Ecosystems
PolyglotPiranha exemplifies the shift toward modular, maintainable, and high-throughput transformation engines in software engineering. By serving as a target language for LLM-driven, agentic code migration frameworks like SPELL, it bridges the gap between data-driven synthesis and scalable, enterprise-grade deployment. Compared to systems that couple anti-unification directly to code rewriting (e.g., MELT), PolyglotPiranha’s graph-structured, declarative approach and runtime efficiency facilitate broader adoption in heterogeneous and rapidly evolving software ecosystems (Ramos et al., 1 Feb 2026).