Resiliparse Parser: Robust PEG Error Recovery
- Resiliparse Parser is a PEG engine enhanced with labeled failures and recovery expressions, enabling robust syntax error recovery and partial AST construction.
- It incorporates error logging, token skipping, and placeholder insertion to maintain AST integrity in interactive development environments.
- Empirical studies show that Resiliparse achieves significant speed improvements and more targeted error reporting compared to traditional PEG and ANTLR parsers.
A Resiliparse parser is defined as a parsing engine for Parsing Expression Grammars (PEGs) that augments traditional deterministic top-down parsing with labeled failures and per-label recovery expressions, enabling robust syntax error recovery and partial Abstract Syntax Tree (AST) construction suitable for integrated development environments (IDEs) and other interactive tooling (Medeiros et al., 2018). The approach modifies PEG formalism to allow error reporting and fine-grained resynchronization, overcoming intrinsic limitations of conventional PEG-based parsers that typically abort or rely on ad-hoc extensions upon encountering invalid or incomplete input.
1. Formal Definition of Parsing Expression Grammars
Parsing Expression Grammars are formally defined as where:
- is the finite set of non-terminals,
- is the finite set of terminals,
- maps each to a parsing expression ,
- is the start expression.
The repertoire of parsing expressions consists of (empty), (terminal), (non-terminal), (sequence), (ordered choice), (zero-or-more), and (negative predicate). PEGs define a deterministic, top-down, backtracking parsing process. Match failures conventionally trigger backtracking, not immediate syntax errors. In the absence of explicit recovery, PEG-based parsers halt upon unresolvable failure, rendering them inadequate for scenarios requiring partial ASTs (e.g., during code editing or completion in IDEs) (Medeiros et al., 2018).
2. Labeled Failures and Recovery Expressions
To equip PEGs with structured error recovery, the Resiliparse methodology introduces a throw operator $\,^{\wedge}l\,$, where is an error label. The resulting grammar is extended as , with a finite set of labels disjoint from the distinguished used for “ordinary” failure.
Semantically, $\,^{\wedge}l\,$ immediately signals label . If the label is propagated—ordered choice () does not intercept it, thereby denoting a true syntax error. A recovery map assigns to each an auxiliary parsing expression (the recovery expression). When $\,^{\wedge}l\,$ is thrown and exists, the parser:
- Logs ,
- Invokes to skip tokens and resynchronize,
- Resumes parsing the remainder of the grammar.
For example, a block-ending label in a Java-like grammar:
1 2 3 |
BlockStmt ← LCUR (Stmt)* [RCUR]^{rcblk}
recovery(rcblk) = SkipToRCUR
SkipToRCUR ← (!RCUR (LCUR SkipToRCUR / .))* RCUR |
3. Operational Semantics and Inference
Resiliparse parsers formalize error recovery with operational semantics. Parsing is denoted as
or if unrecoverable. Key rules:
- Throw with no recovery:
If ,
$G[\,^{\wedge}l\,]\,R\,x \Longrightarrow \mathrm{error}(l)$
- Throw with recovery:
If and recovery parses to ,
$G[\,^{\wedge}l\,]\,R\, x y \Longrightarrow (y, f, (l,x)::L)$
- Ordered choice:
Only failures with label trigger the alternate, non- labels propagate as errors.
This framework preserves deterministic parsing while tracking error positions and labels, facilitating downstream error reporting and AST placeholder insertion (Medeiros et al., 2018).
4. Design of Recovery Expressions
Recovery expressions are designed to advance parsing to a synchronization point, typically determined via FOLLOW-set tokens (e.g., semicolon, closing brace). Nesting must be respected to avoid misalignment in structured constructs. Whenever possible, attempts are made to salvage partial parses of subexpressions, increasing AST fidelity.
Examples include:
- Skip until next semicolon:
- Skip to matching ‘}’ (handling nested blocks):
- For subexpression failures with known FOLLOW:
A plausible implication is that precise recovery expressions reduce loss of valid AST subtrees and mitigate error cascades, especially in nested or recursive syntactic constructs (Medeiros et al., 2018).
5. Implementation in the Lua Parser
A case study in (Medeiros et al., 2018) implemented Resiliparse principles via the LPegLabel extension of the Lua grammar. The grammar, based on the Lua reference manual, employed approximately 75 labels, each annotated following the heuristic “every symbol whose failure cannot sensibly backtrack.” The parser architecture utilizes a packrat-style engine that tracks the farthest failure, manages an R-map of recovery expressions, and logs error-label-position pairs.
AST construction continues even after recovery, inserting placeholder nodes (such as “MissingSemicolon”) to ensure downstream static analyses obtain structurally valid trees. Error recovery and reporting for IDE scenarios benefit from this method, yielding immediate and localized user feedback on syntax errors (Medeiros et al., 2018).
6. Empirical Evaluation and Comparative Performance
Recovery quality on 180 invalid Lua programs was rated as follows:
| Rating | Number of Programs | Percentage |
|---|---|---|
| Excellent | 100 | 56% |
| Good | 63 | 35% |
| Poor | 17 | 9% |
| Failed | 0 | 0% |
A direct comparison with an ANTLR-generated Lua parser on the same input corpus:
| Condition | Number of Files | Percentage |
|---|---|---|
| ANTLR reports more errors | 56 | 31% |
| PEG reports more errors | 14 | 8% |
| Same number of errors | 110 | 61% |
The ANTLR parser was observed to emit more spurious (irrelevant) errors, while the PEG-based approach produced more targeted recoveries. Regarding performance (measured in ms, averaged over 20 runs):
| File | LPegLabel | ANTLR |
|---|---|---|
| broke.lua | 14 | 89 |
| Lua test suite | 94 | 647 |
The PEG-based parser demonstrated approximately speedup and avoided costly re-parsing on syntax errors (Medeiros et al., 2018).
7. Guidelines and Best Practices for Resiliparse Parser Construction
Key guidelines for building a Resiliparse parser include:
- Labeling strategy: Annotate every grammar symbol where failure represents an actionable syntax error, not a point for ordinary backtracking. Consistent naming of labels (e.g., “semia”, “rcblk”, “condw”) supports precise, user-friendly error messages.
- Recovery expression definition: Base “default” recovery on FIRST/FOLLOW analysis (consume until a FOLLOW token). Employ custom recovery for block boundaries and complex nested constructs, preferring conservative skips close to the error site to maintain AST coverage.
- AST consistency: On recovery, insert placeholder nodes. This ensures that the AST remains structurally complete for static analysis or code tooling.
- Tooling: Integrate label/position with farthest-failure information to provide IDE hooks and fallback messaging. Ensure that all recovery expressions are total (terminate on all inputs), avoiding left recursion and infinite skips.
By adhering to these principles—labeled failures, expressive and context-sensitive recoveries, and careful synchronization—a Resiliparse parser can achieve actionable syntax error reporting, near-complete AST generation for incomplete/invalid code, and high parsing performance, robustly addressing limitations of naïve PEG-based tools (Medeiros et al., 2018).