Rewrite-Based Detection Algorithm
- Rewrite-Based Detection Algorithm is a technique that applies systematic rewrite rules to identify patterns and classify data transformations across code, text, and symbolic representations.
- It integrates pattern matching and transformation by using automata and rule-based edits to uncover structural properties and improve analyses in program optimization and adversarial detection.
- Empirical validations show its robustness, scalability, and high precision in applications ranging from static code analysis to the detection of LLM-generated content.
A rewrite-based detection algorithm leverages the concept of syntactic or semantic transformation—via application of rewrite rules or automatically induced edits—to determine the presence or classification of structures, behaviors, or provenance in data. Empirically and theoretically, these algorithms underpin a diverse array of detection, classification, and transformation tasks spanning program analysis, textual provenance, adversarial pattern identification, and static/dynamic code analysis. Their defining characteristic lies in using rewriting (as transformation or probing) as the primary mechanism to realize detection or discrimination.
1. Foundational Principles of Rewrite-Based Detection
Rewrite-based detection fundamentally synthesizes two archetypal computational techniques: pattern matching (to find rewrite triggers, such as "redexes" in symbolic rewriting or structural isomorphisms in code graphs) and transformation (the explicit act of rewriting or simulating transformation to expose latent attributes). The classical architecture comprises:
- A set of rewrite rules or patterns, which may be supplied by domain experts (as in term rewriting systems, program transformation, or compiler idiom raising), or learned automatically (as in neural or probabilistic rewriters).
- A matching or search phase that detects candidate locations for rewrite application—a "detection by rewriteability" paradigm.
- An (optional) transformation or evaluation phase, applying rewrite(s) to observe side effects, measure edit distances, or establish canonical forms.
This framework supports both direct detection (e.g., rewriting exact matches for static code analysis or malware detection) and indirect detection (e.g., using measured rewrite sensitivity as a proxy for class membership, such as LLM-generated text detection via minimal edits under a rewriting model) (Mao et al., 2024, Li et al., 2024, Couto et al., 2022, Bouwman et al., 2022).
2. Algorithmic Instantiations Across Domains
Rewrite-based detection algorithms manifest in several distinct application domains:
(a) Program Analysis and Optimization
- Pattern-Based Rewriting: Systems such as Source Matching and Rewriting (SMR) employ automaton-based matching—first filtering by control structure (control-dependency graph, CDG), then confirming by precise data and flow patterns (data-dependency graph, DDG)—to detect, extract, and replace idiomatic constructs by rewrites, e.g., BLAS call insertion in loop nests (Couto et al., 2022).
- Term Rewriting Systems (TRS): Set-automaton-based rewriting, as in SABRE, detects all left-hand-side pattern redexes in symbolic terms using an efficient automaton, integrating matching and rewriting under a unified, strategy-sensitive configuration tree (Bouwman et al., 2022).
- Compression-Aware and DAG-Based Matching: Redex detection under singleton tree grammar (STG) compression enables polynomial-time matching for left-linear rules, exploiting periodicity in encoded subcontext matches for efficient and scalable detection in compressed term spaces (Schmidt-Schauss, 2013).
(b) Security and Static Code Modification
- Privacy-Harming Code Detection: The Unbundle-Rewrite-Rebundle (URR) system unbundles JavaScript modules, matches sub-ASTs via Merkle-style hash fingerprints against a database of known privacy-harming libraries, and rewrites matched subtrees to benign stubs that preserve API signatures. Detection quality is measured via precision and recall, reaching 1.00 and 0.95 respectively for key targets (Ali et al., 2024).
(c) Textual Provenance and LLM-Generated Content Detection
- Rewrite-Based Text Provenance: Methods such as Raidar and Learning2Rewrite use the observation that when asked to rewrite inputs, LLMs minimally modify their own outputs but make substantially more changes to human-written text. Detection is thus cast as a rewrite-minimality test, quantifying per-instance similarity (e.g., inverse normalized Levenshtein distance) between input and rewritten output (Mao et al., 2024, Li et al., 2024, Zhou et al., 29 Jan 2026).
- Distance Learning Approaches: Learn-to-Distance introduces adaptive learning of the rewrite distance itself, parameterizing it via a LLM and fine-tuning margin-based objectives to maximize the human-to-AI rewrite distance gap, further improving detection robustness and generalization (Zhou et al., 29 Jan 2026).
3. Core Algorithmic Components and Strategies
(a) Pattern Matching Engines
Automata (tree/set automata, Aho-Corasick automata) are used to efficiently traverse terms, ASTs, IR graphs, or textual representations and identify substructures matching rewrite targets. Dependency and position relations (e.g., outermost-/innermost-preserving, parallel) control which matches are considered based on the overarching rewriting strategy (innermost, outermost, context-sensitive, parallel) (Bouwman et al., 2022, Couto et al., 2022, Schmidt-Schauss, 2013).
(b) Interleaving Matching and Rewriting
Algorithms such as that in SABRE preserve partial match search trees across rewrites, updating only the necessary subtree(s) after a rewrite to amortize matching effort (Bouwman et al., 2022). For compressed representations, prefix-tables and periodicity compaction (in STG-based detection) ensure deterministic polynomial complexity (Schmidt-Schauss, 2013).
(c) Rewrite Metrics and Detection Features
LLM-based detection algorithms measure edit distances post-rewriting to serve as discriminative features. These metrics may be naive (fixed Levenshtein distance), marginized (difference between human and AI samples), or adaptive (learned via gradient-based objectives over model likelihoods or proxy distances) (Mao et al., 2024, Li et al., 2024, Zhou et al., 29 Jan 2026).
(d) Strategy-Dependent Loop and Non-Termination Detection
Detection algorithms for dis/proving termination (or non-termination) under concrete strategies (leftmost, innermost, outermost, parallel, forbidden patterns) systematically reduce the problem to the nonexistence of "blocking" redexes, codified as matching or extended matching problems over terms and contexts (Thiemann et al., 2010).
4. Theoretical Properties and Correctness
Rewrite-based detection algorithms are often accompanied by formal guarantees regarding soundness, completeness, and polynomiality:
- Polynomial-Time Redex Detection (STG Compression): For left-linear rules, submatching and detection are proven to be polynomial in the grammar size due to bounded prefix-table sizes and periodicity compaction (Schmidt-Schauss, 2013).
- Automaton Completeness: The set-automaton construction ensures that every function symbol is scanned exactly once and that no redex is missed, yielding all and only correct matches (Bouwman et al., 2022).
- Strategy Completeness for Loop Detection: Systematic construction of matching problems for loops under various evaluation strategies guarantees that loops accepted by the procedure truly demonstrate non-termination w.r.t. the chosen strategy (Thiemann et al., 2010).
- Detection Gap Maximization: The Learn-to-Distance framework theoretically shows that the optimal detection distance is a binary metric separating LLM manifold from its complement, and that adaptively learned distances approach this ideal (Zhou et al., 29 Jan 2026).
- Edit-Distance Robustness: Empirical and theoretical justifications demonstrate that rewrite-minimality is robust to prompt perturbations as long as rewrite-induced noise does not swamp the human-vs-LLM manifold gap (Zhou et al., 29 Jan 2026, Mao et al., 2024).
5. Experimental Validation and Performance
Rewrite-based detectors are empirically evaluated via application-specific accuracy, robustness, and efficiency metrics:
| System/Domain | Key Metric(s) | Representative Results |
|---|---|---|
| Raidar (LLM provenance) (Mao et al., 2024) | F1, OOD robustness | +29 F1 improvement; F1 > 80 |
| URR (privacy code) (Ali et al., 2024) | Precision/Recall | 1.00 / 0.95 on JS bundles |
| L2R (Learning-to-Rewrite) (Li et al., 2024) | AUROC/F1 | +20% AUROC, +9.2 F1 |
| Learn-to-Distance (Zhou et al., 29 Jan 2026) | AUC gain (rel.) | 57–81% vs. baselines |
| SABRE (term rewriting) (Bouwman et al., 2022) | Benchmarks solved | 98.7% solved vs. ~88%–92% |
Additional performance characteristics:
- Rewrite-based detectors for LLM-generated text maintain detection accuracy under out-of-domain and adversarial (prompt-bypassing) settings, as multi-prompt or adaptive learning strategies generalize beyond memorized cues (Mao et al., 2024, Li et al., 2024, Zhou et al., 29 Jan 2026).
- Automaton-based rewrite matching scales efficiently with input size and the number of patterns, with construction and traverse complexity guarantees parameterized on symbol, state, and transition counts (Bouwman et al., 2022, Couto et al., 2022).
6. Extensions, Generalizations, and Limitations
Extensions
- Generalization Across Strategies: The core interleaving machinery (configuration trees, automata, matching-problem reduction) extends from term rewriting to parallel, forbidden-pattern, and context-sensitive strategies (Thiemann et al., 2010, Bouwman et al., 2022).
- Rewrite-Based Metrics: Adaptive and contrastive metrics can replace heuristic distances, improving detection in neural rewriting applications (Zhou et al., 29 Jan 2026).
- Hybrid Approaches: Algorithmic frameworks can combine rewriting with watermarking, probabilistic certificate generation, and API-only black-box queries for increased robustness (Mao et al., 2024, Ali et al., 2024).
Limitations
- Rewrite-based provenance detection may be bypassed if an adversary crafts content to force high edit distances under the employed prompts, though multi-prompt approaches can mitigate this (Mao et al., 2024, Li et al., 2024).
- Computational overhead arises for generating multiple rewrites per instance and for fine-tuning adaptive models, though asynchronous and low-rank adaptation pipelines can mitigate these costs (Zhou et al., 29 Jan 2026, Li et al., 2024).
- In compressed or DAG representations, polynomiality may fail for non-left-linear rules or elaborate context-sharing, constraining the application of efficient techniques to certain grammars (Schmidt-Schauss, 2013).
7. Theoretical and Practical Impact
Rewrite-based detection algorithms have established themselves as core methodologies in systems requiring high-assurance structural matching, robust content provenance detection, and efficient large-scale symbolic rewriting. Their principled combination of pattern matching, structural or semantic rewriting, and feature extraction offers interpretability, strong theoretical guarantees, and empirically validated performance across domains from compilers and program analysis to modern LLM-powered systems and security (Couto et al., 2022, Bouwman et al., 2022, Mao et al., 2024, Ali et al., 2024, Schmidt-Schauss, 2013, Zhou et al., 29 Jan 2026, Li et al., 2024, Thiemann et al., 2010). The adaptive and data-driven extensions now emerging further strengthen their utility for future open-world AI detection and secure automated reasoning tasks.