Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating strategies from grammars #170

Open
DRMacIver opened this issue Sep 25, 2015 · 9 comments
Open

Creating strategies from grammars #170

DRMacIver opened this issue Sep 25, 2015 · 9 comments
Labels
new-feature entirely novel capabilities or strategies

Comments

@DRMacIver
Copy link
Member

I’d like to be able to generate strings matching some grammar. I’ve previously implemented this in a very early version of Hypothesis for regular grammars but never finished it off.

@DRMacIver DRMacIver added this to the Hypothesis 2.0 milestone Sep 25, 2015
@DRMacIver DRMacIver added the enhancement it's not broken, but we want it to be better label Sep 25, 2015
@Zac-HD Zac-HD added new-feature entirely novel capabilities or strategies and removed enhancement it's not broken, but we want it to be better labels Jun 10, 2017
@Zac-HD
Copy link
Member

Zac-HD commented Aug 28, 2017

@DRMacIver, is this closed by from_regex() in #792? If not, what additional capabilities do you want and what would the arguments to such a strategy be?

@DRMacIver
Copy link
Member Author

It's definitely not closed by #792.

I don't think this is necessarily a single strategy function. But an example of a capability that we don't currently support and that I'd like to is to be able to take a grammar definition in EBNF and turn it into a strategy.

@Zac-HD Zac-HD removed this from the Hypothesis 2.0 milestone Mar 1, 2018
@Zac-HD
Copy link
Member

Zac-HD commented Dec 14, 2018

Having looked into grammar definition languages, the clear winner for context-free grammars is ABNF. This is equivalent to the many, many EBNF description syntaxes, but has actually been standardised by RFC 5234 (and case-sensitivity added in RFC 7405).

ABNF is easy to write, and even easier to copy from an RFC if you're implementing a standard protocol!

Sadly I couldn't find a finished ABNF definition parser in Python, but it's easy enough to write one as the structure is pretty simple.

@DRMacIver
Copy link
Member Author

Another thing to consider is ANTLR grammar definitions, mostly on grounds that there are a lot of them! picireny uses these from Python to do test-case reduction, although I haven't looked into the details.

We probably don't want people writing ANTLR grammar definitions by hand, but it might be a nice starting point for getting a lot of strategies almost for free.

@Zac-HD
Copy link
Member

Zac-HD commented Dec 14, 2018

We could support both pretty easily; I think the ANTLR and ABNF syntax is totally disjoint so we can even autodetect which one has been passed to a single from_grammar function. For bonus points convenience I'd add a function to fetch grammars from antlr/grammars-v4 by name - caching the .g4 files in .hypothesis/grammars/ for obvious performance reasons.

def from_grammar(
    production: Union[str, tuple[str, ...]],  # Name or names of the production(s) to produce
    grammar: str,  # The grammar definition
) -> SearchStrategy[str]: ...

def get_grammar_definition(
    name: str,  # A name from the ANTLR grammars repo
) -> str: ...

Heck, we can probably just generate the parsers for ANTLR and ABNF grammar definitions with antlr4 and stuff them in the hypothesis.vendor namespace!

Thoughts on that as an API? I've deliberately gone minimalist for a first pass as it's easier to add arguments than to take them away...

@Zac-HD
Copy link
Member

Zac-HD commented Dec 28, 2018

Aaaand, ugh, I couldn't find an ABNF or Antlr parser that exposes a nice structured representation of the grammar 😭

On the upside, Lark looks very nice and does. A lark.lark.Lark (...lark) object can be constructed from an EBNF grammar, and has a tree representation thereof along with the name of the start node. So I'd be all in favor of accepting one of those, and making all the validation happen before the problem gets passed off to Hypothesis.

That would make the function hypothesis.extra.grammar.from_lark, I think.

@Zac-HD
Copy link
Member

Zac-HD commented Jan 8, 2019

As a follow-up to close this issue, it should be fairly straightforward to write a Lark EBNF grammar for rfc5234 ABNF grammars, and use Lark's parse tree transformer functions to go translate directly to a strategy tree. We could use the same trick for ANTLR definitions too!

Translating from RFC 5234 to Lark gives me the following, which works in the prototype from_lark to generate (syntactically, but usually not semantically) valid ABNF grammars:

r"""
// The production rules for ABNF grammar descriptions
rulelist        : ( rule | (c_wsp* c_nl) )+
rule            : rulename defined_as elements c_nl
rulename        : ALPHA (ALPHA | DIGIT | "-")*
defined_as      : c_wsp* ("=" | "=/") c_wsp*
elements        : alternation c_wsp*

c_wsp           : c_nl? WSP
c_nl            : comment | CRLF
comment         : ";" (WSP | VCHAR)* CRLF

alternation     : concatenation (c_wsp* "/" c_wsp* concatenation)*
concatenation   : repetition (c_wsp+ repetition)*
repetition      : repeat? element
repeat          : DIGIT+ | DIGIT* "*" DIGIT*
element         : rulename | group | option | char_val | num_val | prose_val
group           : "(" c_wsp* alternation c_wsp* ")"
option          : "[" c_wsp* alternation c_wsp* "]"

char_val        : DQUOTE (SP | VCHAR)* DQUOTE
num_val         : "%" (bin_val | dec_val | hex_val)
bin_val         : "b" BIT+ ( ("." | "-") BIT+ )?
dec_val         : "d" DIGIT+ ( ("." | "-") DIGIT+ )?
hex_val         : "x" HEXDIG+ ( ("." | "-") HEXDIG+ )?
prose_val       : /<[\x20-\x3d\x3f-\x7e]*>/

// Standard terminals defined for all ABNF grammars
ALPHA           : /[A-Za-z]/
BIT             : "0" | "1" 
CHAR            : /[\x01-\x7f]/
CR              : "\r"
CRLF            : "\r\n"
CTL             : /[\x00-\x1f\x7f]/
DIGIT           : /[0-9]/
DQUOTE          : "\""
HEXDIG          : /[0-9A-Fa-f]/
HTAB            : "\t"
LF              : "\n"
LWSP            : /((\r\n)?[ \t])+/
OCTET           : /[\x00-\xFF]/
SP              : " "
VCHAR           : /[\x21-\x7e]/
WSP             : " " | "\t"
"""

@Zac-HD Zac-HD removed their assignment Jan 26, 2019
@Zac-HD
Copy link
Member

Zac-HD commented Feb 4, 2019

On further investigation, I think our best option would actually be to handle this upstream by adding a Lark.from_abnf constructor method. I'm therefore closing this as implemented by #1740.

@DRMacIver
Copy link
Member Author

On further investigation, I think our best option would actually be to handle this upstream by adding a Lark.from_abnf constructor method. I'm therefore closing this as implemented by #1740.

I think it's worth having a wrapper from_abnf function on our side as well. It can just be a thin shim over lark, but it's nice to have an API that doesn't tie us to too tightly to the lark implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-feature entirely novel capabilities or strategies
Projects
None yet
Development

No branches or pull requests

2 participants