Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 69 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 37 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 119 tok/s Pro
Kimi K2 218 tok/s Pro
GPT OSS 120B 456 tok/s Pro
Claude Sonnet 4.5 33 tok/s Pro
2000 character limit reached

Inferring Input Grammars from Code with Symbolic Parsing (2503.08486v1)

Published 11 Mar 2025 in cs.SE and cs.FL

Abstract: Generating effective test inputs for a software system requires that these inputs be valid, as they will otherwise be rejected without reaching actual functionality. In the absence of a specification for the input language, common test generation techniques rely on sample inputs, which are abstracted into matching grammars and/or evolved guided by test coverage. However, if sample inputs miss features of the input language, the chances of generating these features randomly are slim. In this work, we present the first technique for symbolically and automatically mining input grammars from the code of recursive descent parsers. So far, the complexity of parsers has made such a symbolic analysis challenging to impossible. Our realization of the symbolic parsing technique overcomes these challenges by (1) associating each parser function parse_ELEM() with a nonterminal <ELEM>; (2) limiting recursive calls and loop iterations, such that a symbolic analysis of parse_ELEM() needs to consider only a finite number of paths; and (3) for each path, create an expansion alternative for <ELEM>. Being purely static, symbolic parsing does not require seed inputs; as it mitigates path explosion, it scales to complex parsers. Our evaluation promises symbolic parsing to be highly accurate. Applied on parsers for complex languages such as TINY-C or JSON, our STALAGMITE implementation extracts grammars with an accuracy of 99--100%, widely improving over the state of the art despite requiring only the program code and no input samples. The resulting grammars cover the entire input space, allowing for comprehensive and effective test generation, reverse engineering, and documentation.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.