Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to insert whitespace between tokens in lark strategy #2437

Open
windelbouwman opened this issue May 18, 2020 · 11 comments
Open

Add option to insert whitespace between tokens in lark strategy #2437

windelbouwman opened this issue May 18, 2020 · 11 comments
Labels
enhancement it's not broken, but we want it to be better lark issues related to support for Lark grammars

Comments

@windelbouwman
Copy link

Hi,

First off: compliments for the lark strategy, it is fantastic! One feature I would like to request is to be able to inject whitespace, or any token seperator text between the seperate terminals.

Given this python snippet:

from lark import Lark
from hypothesis.extra.lark import from_lark
from hypothesis import given, strategies as st

grammar_text = r"""
start: prog

prog: statement

statement: assignment

assignment: "int" ID "=" expr ";"

expr: NUM
    | expr op expr

op: "+"
  | "-"

%ignore / +/
%declare NUM ID

"""

grammar = Lark(grammar_text)
explicit = {
    'NUM': st.integers().map(str),
    'ID': st.text(alphabet='abcdefghijX', min_size=1)
}

@given(from_lark(grammar, explicit=explicit))
def test_c(prog):
    print(prog)

It generates text like this:

intbX=0-8388864;

Which is not valid C code, since we lack a space between int and bX.

Possible options for the api:

Add an extra keyword to from_lark, something like from_lark(spacing=' ')?

Another option would be to inspect the %ignore tag in the grammar, and generate text matching this ignore tag?

@Zac-HD Zac-HD added enhancement it's not broken, but we want it to be better lark issues related to support for Lark grammars labels May 19, 2020
@Zac-HD
Copy link
Member

Zac-HD commented May 19, 2020

It would definitely be nice to improve this - for Hypothesmith I've just worked around it with a brute-force filter, but that's not a great option.

I think the general solution would be something like "generate the non-ignored parts; while (it doesn't parse to what we generated), generate and insert ignored tokens". Not sure this is feasible and we may need to expose some user-facing control (e.g. preferred ignore rule(s)), but it would be completely automatic in simple cases and ensures that what we generate is always valid.

@windelbouwman
Copy link
Author

Okay, so you propose to reparse the generated string by lark itself, to verify the generated text? This would be a good check indeed, one issue might be the %declare symbols, provided by the explicit keyword. How would this be handled?

I agree on having a solution which works best out of the box as possible.

@Zac-HD
Copy link
Member

Zac-HD commented May 20, 2020

I don't know how it would be handled yet, but unfortunately I suspect it will have to be much more complicated than the PR you opened 😞

(for which: thank you! I really appreciate the way your contributing use-cases and design ideas via this issue, and that you're writing code too. I just suspect that this is going to be a very very tricky feature to get right)

@windelbouwman
Copy link
Author

I did not expect this change to be accepted straight away, in fact, I think this will take long to get right. My pull request was a proposal of some of the places where this might work.

What was the strategy in hypothesmith? Feed the resulting code to the compile function and check if the sourcecode compiles? I do not see any whitespace insertion going on.

Another idea would be to test if two terminals would together result in another single terminal, and if so, insert at least one ignore token between them. Again, the presence of the explicit dictionary makes this hard, since we do not know how the explicit entries parse / lex.

What are your other ideas on how to implement this?

@Zac-HD
Copy link
Member

Zac-HD commented May 20, 2020

What was the strategy in hypothesmith? Feed the resulting code to the compile function and check if the sourcecode compiles? I do not see any whitespace insertion going on.

Yep, that's right - here's the implementation. The trouble with forcing insertion of whitespace is that it would almost certainly mess up the indentation in a few cases, and would also prevent generating interesting valid cases like (a)or[b].

Another idea would be to test if two terminals would together result in another single terminal, and if so, insert at least one ignore token between them. Again, the presence of the explicit dictionary makes this hard, since we do not know how the explicit entries parse / lex.

This is the best idea I have so far - we can probably tell if it's valid even in many cases with explicit strategies, but there is indeed an awkward gap in the API. I'm inclined to at least try that out and see how far we get though; even if that's all we can manage I think it will be a substantial improvement on the status quo.

@windelbouwman
Copy link
Author

windelbouwman commented May 20, 2020

Okay, that sounds good, so the idea would be to loop over pairs of terminals, concatenate them, and them check if they match any terminal.

In the case of lark terminals, this could be done using the regex generated by the terminal terminal.pattern.to_regexp(), so we could pre-compile all those terminal patterns into a re object, and simply match the concatenation. If the concatenation matches, we would need to insert blanks.

As for the explicit set of terminals, is there a way to test if a given string could originate from a draw from the explicit set? Maybe users should be able to specify a regex for the explicit set? But then again, they might as well use the regex strategy.

@Zac-HD
Copy link
Member

Zac-HD commented May 20, 2020

It'll either work, or we'll have a better sense of the problem!

As for the explicit set of terminals, is there a way to test if a given string could originate from a draw from the explicit set?

Nope, unfortunately this is not possible in general. Even the regex strategy doesn't track the pattern it was created from.

@windelbouwman
Copy link
Author

windelbouwman commented May 21, 2020

Okay, regarding the explicit dictionary, we might provide the user with a forbidden or prevent keyword, which is a regular expression, which is used to check if glued terminals result in some forbidden string.

For example:

from_lark(
    grammar,
    explicit={'ID': st.text(alphabet='abcdefghijX', min_size=1)},
    prevent=r'[A-Za-z_]+'
)

This would then generate a[b]() but not inta=2, but instead generate space between int and a since inta matches the prevent regex.

@Zac-HD
Copy link
Member

Zac-HD commented May 22, 2020

Can we use the regexes for our known terminals for this? The loop would look roughly like:

  1. Generate the next terminal to add and append to our list.
  2. Concatenate the last two terminals.
  3. If this concatenation matches any of the terminal regexes, generate an ignored token such that (via filter) inserting it between the last two terminals means there are no terminal patterns which match.
  4. If this worked, insert the ignored token in the appropriate spot.

I think this could work for toy examples, and if so we can probably make it work in more complex cases too.

@windelbouwman
Copy link
Author

Okay, I implemented this idea in #2438 , and I must say, this works like a charm!

Things to note:

  • lark does not require the space between int and a, so inta parses fine into int and a. Compliments to lark for this ;)
  • The explicit symbols are ignored here, so these still can cause issues. Ideas needed for this, but might be postponed to later.
  • I now insert a single space (it works most of the time), since I do not know how to draw from the ignore set, help would be great here, on how to call the gen_ignore function with draw and draw_state arguments. Question might be here, do we want to draw from the ignore set, or insert a space always (if space is in the ignore set)

@cddouma-arm
Copy link

I did a small experiment with the lark strategy, and ran into this problem as well. I fixed this with adding an option to separate terminals with ignored terminals. Basically, in hypothesis-python/src/hypothesis/extra/lark.py change:

     def gen_ignore(self, data, draw_state):
-        if self.ignored_symbols and data.draw_bits(2) == 3:
+        with_ignore = data.draw_bits(2, forced = 3 if self.seperated_by_ignored_symbols else None)
+        if self.ignored_symbols and with_ignore == 3:
             emit = data.draw(st.sampled_from(self.ignored_symbols))
             self.draw_symbol(data, emit, draw_state)

But there might be other options here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement it's not broken, but we want it to be better lark issues related to support for Lark grammars
Projects
None yet
Development

No branches or pull requests

3 participants