LLM-as-Parser: Structured Parsing with LLMs

Updated 16 March 2026

LLM-as-Parser is a paradigm that repurposes large language models to emit structured outputs like parse trees, semantic graphs, and formal logic rules.
The approach employs in-context learning, LoRA/QLoRA fine-tuning, and few-shot strategies to enforce formal constraints and improve performance.
Empirical results show significant gains (e.g., up to 20.3 F1 improvement in syntactic parsing) and reduced manual engineering across various domains.

LLM-as-Parser refers to the use of LLMs as direct parsers for structured representations spanning syntactic, discourse, semantic, formal-logic, and semi-structured data domains. Rather than serving as mere sequence generators, LLMs are repurposed to emit parse trees, graphs, logic rules, or templates, leveraging prompt engineering, retrieval, fine-tuning, and correction mechanisms to enforce formal constraints and maximize structural validity. This paradigm is evaluated across constituent parsing, discourse parsing, meaning representation, domain-specific formal parsing, and log parsing, where it delivers competitive or state-of-the-art results with reduced engineering overhead and promising generalizability.

1. Core Paradigm: LLM-as-Parser Across Domains

LLM-as-Parser describes workflows where a pretrained or minimally adapted LLM serves as the central module for parsing input text into structured symbolic outputs. Unlike classical pipeline architectures, LLMs are pushed beyond unstructured generation to outputs exactly matching formal grammars, semantic graph notations, treebanks, or domain-specific rule languages. Domains and parsing tasks explored include:

Constituency and syntactic parsing (e.g., Penn Treebank): LLMs generate bracketed trees consistent with context-free grammars, with posthoc self-correction via treebank-derived constraints (Zhang et al., 19 Apr 2025).
Semantic parsing: Autoregressive generation of linearized AMR graphs via instruction-tuned LLMs (Ho, 7 Aug 2025).
Formal logic/rule induction: Prompt-based generation of first-order logic rules and executable predicates from domain-specific grammars (He et al., 3 Nov 2025).
Discourse parsing: Sequence-to-sequence LLMs, such as LLaMIPa, generating updates to an evolving discourse graph in SDRT style (Thompson et al., 2024).
Log parsing: Extraction of templates and variable fields from semi-structured logs using prompt engineering, few-shot, and refinement strategies (Beck et al., 7 Apr 2025).

The unifying insight is the framing of parsing as high-fidelity structured prediction, orchestrated through prompt design, error correction, and feedback.

2. Methodologies and Architectures

2.1 Prompting and In-Context Learning

LLMs are configured with an initial system prompt (role specification, instructions, grammar fragments) and, depending on the domain, one or more few-shot exemplars, context explanations, and output constraints. For AMR and log parsing, prompts are systematically constructed to delimit demonstration pairs and enforce output constraints (e.g., JSON blocks, bracketed templates) (Ho, 7 Aug 2025, Beck et al., 7 Apr 2025). For formal logic, the prompt includes the exact ANTLR grammar and available predicate APIs (He et al., 3 Nov 2025).

2.2 Fine-Tuning and Adapter Methods

Performance gains are realized via parameter-efficient fine-tuning (LoRA, QLoRA) where necessary. In AMR parsing, all architectures are trained using LoRA with fine-grained optimization targeting projection matrices (Ho, 7 Aug 2025). LLaMIPa is fine-tuned via QLoRA adapters, leaving the LLaMA backbone frozen, to enable autoregressive decoding of relation structures in discourse (Thompson et al., 2024).

2.3 Self-Correction and Iterative Revision

Self-correction leverages treebank-extracted production rules to identify and repair invalid parses automatically (Zhang et al., 19 Apr 2025). The pipeline involves:

Detecting non-matching leaf sequences (unmatch correction).
For subtrees with productions not found in the treebank, error-type specific candidate rule retrieval.
Hint selection via longest common subsequence and corpus frequency ranking.
Prompt-driven correction calls to guide the LLM in revising erroneous substructures.

This iterative process systematically enforces structural consistency, with empirical ablations confirming large F₁ improvements from both unmatch and structure correction stages.

3. Evaluation Metrics and Benchmarks

LLM-as-Parser evaluations consistently adopt domain-standard structural and semantic metrics:

Constituency Parsing: Labeled bracket F₁ as computed by EVALB, reflecting span and label agreement (Zhang et al., 19 Apr 2025).
Semantic Parsing (AMR): SMATCH F₁, measuring triple overlap between predicted and gold graphs after optimal variable alignment (Ho, 7 Aug 2025).
Log Parsing: Group accuracy (GA), template-level precision (PTA), recall (RTA), F1-scores, parsing accuracy, edit distance (ED), and normalized edit distance (NED) (Beck et al., 7 Apr 2025).
Discourse Parsing: Link prediction F1 (micro-averaged over parent assignments), and weighted Link + Relation F1 over typed links (Thompson et al., 2024).
Domain-specific Logic: Precision, recall, and runtime grammar compliance for logic rule and predicate generation (He et al., 3 Nov 2025).

Datasets include PTB, CTB5, MCTB (syntactic), LDC2020T02 AMR3.0 (semantic), SDRT-annotated dialog corpora (discourse), LogHub variants (semi-structured logs), and synthetic evaluation suites for formal logic in specific verticals.

4. Empirical Results and Error Analysis

LLM-as-Parser frameworks have demonstrated:

Constituency Parsing: Self-correction yields up to 20.3 F₁ improvement on CTB5; GPT-4 achieves 83.5 F₁ on PTB after correction (baseline 73.4). Open-source models (LLaMA-8B, Qwen-72B, DeepSeek-v3) exhibit 7–22 F₁ gains with the same pipeline. Error breakdowns show span, label, flatness, and deepness errors are all reduced by >50% (Zhang et al., 19 Apr 2025).
AMR Parsing: Fine-tuned LLaMA-3.2 attains SMATCH F₁=0.804, matching advanced pipeline parsers (APT+Silver, IBM) (Ho, 7 Aug 2025). Structural validity is highest in Phi-3.5 (0.3 errors/example) and semantic F₁ highest in LLaMA-3.2.
Log Parsing: Dynamic retrieval-based few-shot and CoT prompting enhance template extraction accuracy; group accuracy and FTA scores exceed those of classical statistical and neural approaches on LogHub datasets (Beck et al., 7 Apr 2025).
Discourse Parsing: LLaMIPa3 achieves increases of 9–10 F₁ points over encoder-only baselines for both unlabeled and labeled link prediction on MSDC and STAC-Sit, with especially superior performance for long-range or multi-parent discourse relations (Thompson et al., 2024).
Logic Rule Parsing: LLM-generated rules achieved 100% precision and recall in defect detection (30/30 synthetic cases), with human-in-the-loop integration time reduced by ≈80% (He et al., 3 Nov 2025).

Ablation studies consistently show that structurally guided prompting, candidate retrieval from gold data, and error-specific correction yield substantive improvements over random or unstructured prompting.

5. Representative Workflows and Design Patterns

Below is a comparative overview of several paradigmatic LLM-as-Parser solutions:

Domain	Input/Prompt Structure	Output Structure	Key Technique/Correction
Constituency Syntactic	Few-shot + error hints	Bracketed parse trees	Treebank-driven self-correction
Semantic Graph	Chat template + user msg	Linearized AMR graphs	LoRA fine-tuning, chain-of-thought
Log Parsing	Instruction + demos	Templates, variables	Retrieval-augmented few-shot, CoT
Formal Logic/Verification	Grammar prompt + spec	FOL rules + predicates	Grammar embedding, human-in-review
Discourse Parsing	Partial graph + window	Relation seq. updates	QLoRA adapters, rolling context

Scalable implementations rely on batching, API cost mitigation, caching for template lookup, prompt constraints, and modular pipelines supporting revision and human-in-the-loop review.

6. Practical Considerations, Limitations, and Future Directions

Key limitations and open questions for LLM-as-Parser include:

Grammar and Data Coverage: Self-correction and retrieval depend on treebank or benchmark coverage; out-of-domain shifts can degrade repair efficacy (Zhang et al., 19 Apr 2025).
API and Runtime Cost: Multiple LLM calls per parse instance (especially for deep trees) increase computation and latency; batching and caching are partial mitigations (Zhang et al., 19 Apr 2025, Beck et al., 7 Apr 2025).
Semantic vs. Structural Trade-offs: Models may favor structural validity at the expense of fine-grained semantic accuracy or vice versa, as observed in AMR experiments (Ho, 7 Aug 2025).
Reproducibility and Generalization: Proprietary model reliance and non-standardized evaluation threaten reproducibility; future work is needed in standardizing prompt templates, metrics, and open-source baselines (Beck et al., 7 Apr 2025, He et al., 3 Nov 2025).
Scaling and Adaptation: Adapting workflow and prompt design to handle evolving grammars, very deep or multi-turn structures, or novel application domains remains a target for research (Beck et al., 7 Apr 2025, He et al., 3 Nov 2025).
Integration of Subgraph and Soft-Matching: Prospective directions include soft rule embeddings, subgraph-matching metrics, and explicit chain-of-thought explanations for structural transparency (Zhang et al., 19 Apr 2025).

In summary, LLM-as-Parser constitutes a convergent paradigm where LLMs, with minimal task-specific tuning and orchestrated via robust prompting, serve as grammar-constrained, domain-adapted, and structure-aware generative parsers across broad classes of symbolic tasks. This approach achieves strong empirical results across domains while reducing pipeline complexity and manual engineering effort.