-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full support for all grammars in Lark examples #1804
Comments
I would like to add that ignored tokens are usually needed to actually generate valid (parseable) source. The following little grammar is extracted from a bigger one (resembling a Haskell-like syntax): >>> from hypothesis.extra.lark import from_lark
>>> from lark import Lark
>>> source = r"""
... PADDING : " "+
... %ignore PADDING
...
... KEYWORD_FORALL.130 : /\bforall\b/
... LOWER_IDENTIFIER.80 : /[a-z][a-z_]*/
...
... type_expr : KEYWORD_FORALL LOWER_IDENTIFIER+ "." LOWER_IDENTIFIER+
... """
>>> grammar = Lark(source, lexer="standard", start="type_expr")
>>> grammar.parse("forall a. a")
Tree(type_expr, [Token(KEYWORD_FORALL, 'forall'), Token(LOWER_IDENTIFIER, 'a'), Token(LOWER_IDENTIFIER, 'a')])
>>> example = from_lark(grammar, start="type_expr").example()
>>> example
'forallaffhbc.in' The generated example doesn't parse because we're there's no space between the keyword 'forall' and the identifier(s) generated -- I wouldn't know if hypothesis generated more than one identifier. Notice that the keyword "forall" is enclosed by |
Hmm, I see what you mean, though 'usually' will depend on your application area. Our current token generation is pretty well tuned for common use-cases. Unclear what we should do here - an option to force a specific thing to be drawn between every other terminal seems awkward but might be the best we can do.
|
I was trying to come up with an "easy" solution. There are two things that we'd need to balance:
The full grammar I'm working has several things that make hypothesis generate invalid code. I just copied the current code of hypothesis' lark with a one-char change. But then I had to make my terminal COMMENT explicit because I assume, the presence of '^' and '$' in my regexp allows hypothesis to insert it everywhere. Other particular pain would be the generation of application, which in that grammar is just, "fn arg1 arg2 ...". To solve these problems I was thinking that we could generate a stateful machine that the programmer could inherit from and tweak whatever he needs. Each terminal would get a bundle named "terminal_MYTERMINAL", and a rule named "create_terminal_MYTERMINAL" with the following code-schema: @rule(target=terminal_MYTERMINAL, value=from_regexp(MYTERMINAL_regexp))
def create_terminal_MYTERMINAL(self, value):
return value (I would say that ignored and declared terminals are never generated, so the programmer fills the gaps.) each non-terminal would get another bundler "nonterminal_rulename" and a rule "create_nonterminal_rulename" following the shape of the grammar rule. For instance, the rule _datatype_deriving : _NL? KEYWORD_DERIVING _LPAREN _derivations_list _RPAREN
| _NL? KEYWORD_DERIVING UPPER_IDENTIFIER would generate something like: @rule(target=nonterminal__dataytpe_deriving, value=(
st.tuples(terminal__NL, terminal_KEYWORD_DERIVING, terminal__LPAREN, nonterminal__derivation_list, _RPAREN).map(lambda args: u"".join(args)) |
st.tuples(terminal_KEYWORD_DERIVING, terminal__LPAREN, nonterminal__derivation_list, _RPAREN).map(lambda args: u"".join(args)) |
st.tuples(terminal__NL, terminal_KEYWORD_DERIVING, terminal_UPPER_IDENTIFIER).map(lambda args: u"".join(args)) |
st.tuples(terminal_KEYWORD_DERIVING, terminal_UPPER_IDENTIFIER).map(lambda args: u"".join(args))))
def create__datatype_deriving(self, value):
return value Maybe (most likely) this schema is both too hard and the usage of the stateful machinery is a hack. But that's the first thing that comes to my mind. Creating a class would allow the programmer to inherit from it and/or manipulate the rules he wants to tweak. What do you think? |
I've just realized that the map in the code schema would (again) create unparsable code by joining the KEYWORD_DERIVING with the UPPER_IDENTIFIER. So inheriting would be too hackish. Another possibility to do this in a command-line tool that generates the stateful machine code. This way the programmer changes the code and updates it to keep it in sync with grammar changes. |
The explicit strategy to override comments is pretty much exactly what that option is designed for. Note that the regex pattern for a terminal is within that terminal, and uses I don't think that a state machine is a good intermediate representation - there's way too much indirection and implementation complexity for subclassing to make a good API. In the limit, you might have noticed that Hypothesis strategies are parser combinators? So writing out your grammar as strategies is usually a pretty close match (via |
I'm not really sure I follow your line of thought in this regard. I know that without the
(I haven't tested the
I haven't thought about. But, yes! They can be regarded as such. How do you think we can proceed? Bottling all the magic within a call to |
Yeah, the bottled-magic problem is hard. Unfortunately I haven't yet found a way to factor it out and allow parts to be overridden without compromising the simple case. My Python-generating strategy may be of interest? On the other hand I've basically hit the limits of grammar-based generation there, and I'm planning to generate and unparse a typed AST in the next round of work on it. |
Even though I would like to, I can make any promises of working on this. If I can, I will notify you. |
Hello! Is there any progress regarding this issue?
I am facing exactly the same problem - of Hypothesis generating strings without white space between tokens - and I would like to know if there is any approach that is easier than (essentially) replicating the grammar with custom strategies. The discussion above suggests that there is no definitive answer. |
Nobody has yet volunteered to overhaul I think your options are to (1) change your grammar, (2) write a custom strategy, or (3) volunteer or pay someone to implement this issue. In the last case I'd be happy to help with design and code review but don't have time to do it myself at the moment. |
Thanks for the reply, Zac! How big of an undertaking do you think this is? Maybe I could lend a hand.
@cacheable
@defines_strategy(force_reusable_values=True)
def from_lark(
grammar: lark.lark.Lark,
*,
start: Optional[str] = None,
explicit: Optional[Dict[str, st.SearchStrategy[str]]] = None,
ignored: Union[bool, str, Iterable[str]] = False,
) -> st.SearchStrategy[str]:
# if `ignored` is True, use all ignored tokens as possible separators
# if `ignored` is a string, use the specified token as a separator
# if `ignored` is a list of strings, use any of the specified tokens as separators It's relatively easy to imagine how white space would be handled, but a general-purpose solution for any kind of ignored token might be trickier. |
My API design goal here is to avoid adding anything to the API: we should be able to get this working without asking users for help about how to handle ignored tokens. I think the plan below is a nontrivial but feasible undertaking if you'd like to try opening a PR for it. Currently, we decide whether to generate an ignored token following each terminal, and then simply join all of the leaves as our result. The problem in this issue is that sometimes an ignored token is required between two terminals! Instead, we could unconditionally draw a flag and a non-empty ignored token each time in
The performance characteristics here aren't ideal, but I think they'll be OK in practice and certainly much better than the strategy not working at all for complex grammars. |
As of #1740 we support generating strategies from Lark grammars (🎉), but this is currently a bit under-tested, and I suspect doesn't support a reasonably wide range of grammars. Certainly there are features I know it doesn't support because e.g. it doesn't support the Lark example of a Python grammar because we don't currently support contextual parsing.
The obvious thing to do is to turn all of Lark's examples into tests and flush out any bugs that come up. Both in our support and in Lark.
This is a tracking ticket for those bugs - we should make specific issues (or just pull requests) for individual bugs we find.
This ticket will be complete when we have a passing test for each example in the lark examples directory.
The text was updated successfully, but these errors were encountered: