Synthesizing Program Input Grammars (1608.01723v2)

Published 5 Aug 2016 in cs.PL

Abstract: We present an algorithm for synthesizing a context-free grammar encoding the language of valid program inputs from a set of input examples and blackbox access to the program. Our algorithm addresses shortcomings of existing grammar inference algorithms, which both severely overgeneralize and are prohibitively slow. Our implementation, GLADE, leverages the grammar synthesized by our algorithm to fuzz test programs with structured inputs. We show that GLADE substantially increases the incremental coverage on valid inputs compared to two baseline fuzzers.

Citations (175)

View on Semantic Scholar

Summary

The paper presents a practical algorithm for synthesizing context-free grammars from program inputs using examples and blackbox access, addressing limitations of existing grammar inference techniques.
The algorithm operates in two phases, first synthesizing a regular expression from examples and then inducing recursive properties to form a context-free grammar, using a membership oracle for precision.
Empirical results show the synthesized grammar significantly increases fuzzer coverage (up to six times) compared to baselines, demonstrating utility for fuzzing, reverse engineering, and input whitelisting.

Synthesizing Program Input Grammars: A Detailed Overview

This paper presents an innovative approach to synthesizing context-free grammars that encode the language of valid program inputs. The approach is based on a novel algorithm that leverages a set of input examples alongside blackbox access to the program. The proposed algorithm aims to overcome significant limitations inherent in existing grammar inference algorithms, which tend to either severely overgeneralize or suffer from prohibitively slow performance.

Summary of the Paper

The primary contribution of the paper is a practical algorithm designed to synthesize context-free grammars from program inputs. The algorithm constructs a grammar by first transforming a given set of seed inputs into a series of increasingly general languages. The algorithm utilizes a membership oracle to ensure that these generated languages remain precision-preserving. In essence, the approach addresses the challenge of synthesizing inputs that are both sufficiently general and precise within a blackbox setting.

Methodology

The algorithm proceeds through two distinct phases:

Regular Expression Synthesis: The first phase transforms the seed input into a regular expression that encodes repetitive and alternative constructs. This transformation leverages a meta-grammar approach that systematically applies generalization steps anchored to the presence of repetitions and alternations in the example inputs.
Recursive Property Induction: In the second phase, the algorithm elevates the synthesized regular expression to a context-free grammar by identifying and merging nonterminals that correspond to repeated subexpressions. This merging process effectively captures recursive structures often present in program input languages like XML or programming language syntax.

The paper also details intricate procedures for constructing candidates in each generalization step and crafting checks to verify precision-preserving qualities of those candidates. By ensuring that only the first precision-preserving candidate is selected from a ranked list, the algorithm efficiently navigates the search space.

Results

Empirical evaluations show that the algorithm substantially increases the incremental coverage of valid inputs when applied to fuzz testing scenarios. Compared to two baseline fuzzers, the synthesized grammar markedly improves line coverage using valid inputs by up to six times. This demonstrates the algorithm’s effectiveness for generating precise program inputs that uncover deeper paths of execution.

Implications and Future Directions

The practical implications of this research span several domains:

Fuzz Testing: The synthesized input grammars can significantly enhance fuzz testing routines for programs requiring structured inputs, thus facilitating more comprehensive bug discovery.
Input Format Reverse Engineering: By automatically producing grammars, analysts can reverse engineer undocumented input specifications, potentially revealing security vulnerabilities.
Whitelisting Inputs: The approach could serve as a foundation for input whitelisting mechanisms, alleviating risks associated with certain exploits.

The paper further speculates that the algorithm could be extended to learn larger classes of grammars, potentially incorporating elements from Bayesian learning frameworks to refine the precision and recall of synthesized languages. These developments may represent promising directions for future research, particularly in enhancing the adaptability and scalability of the approach to more complex program input languages.

In summary, the algorithm provides a robust framework for synthesizing program input grammars and stands out due to its active learning strategy and ability to synthesize recursive, non-regular language constructs. Its application in various practical settings underscores its potency and potential for further advancement in automatic grammar synthesis for programs.