Add option to insert whitespace between tokens in lark strategy #2437

windelbouwman · 2020-05-18T17:39:36Z

Hi,

First off: compliments for the lark strategy, it is fantastic! One feature I would like to request is to be able to inject whitespace, or any token seperator text between the seperate terminals.

Given this python snippet:

from lark import Lark
from hypothesis.extra.lark import from_lark
from hypothesis import given, strategies as st

grammar_text = r"""
start: prog

prog: statement

statement: assignment

assignment: "int" ID "=" expr ";"

expr: NUM
    | expr op expr

op: "+"
  | "-"

%ignore / +/
%declare NUM ID

"""

grammar = Lark(grammar_text)
explicit = {
    'NUM': st.integers().map(str),
    'ID': st.text(alphabet='abcdefghijX', min_size=1)
}

@given(from_lark(grammar, explicit=explicit))
def test_c(prog):
    print(prog)

It generates text like this:

intbX=0-8388864;

Which is not valid C code, since we lack a space between int and bX.

Possible options for the api:

Add an extra keyword to from_lark, something like from_lark(spacing=' ')?

Another option would be to inspect the %ignore tag in the grammar, and generate text matching this ignore tag?

The text was updated successfully, but these errors were encountered:

Zac-HD · 2020-05-19T01:05:03Z

It would definitely be nice to improve this - for Hypothesmith I've just worked around it with a brute-force filter, but that's not a great option.

I think the general solution would be something like "generate the non-ignored parts; while (it doesn't parse to what we generated), generate and insert ignored tokens". Not sure this is feasible and we may need to expose some user-facing control (e.g. preferred ignore rule(s)), but it would be completely automatic in simple cases and ensures that what we generate is always valid.

windelbouwman · 2020-05-19T18:05:00Z

Okay, so you propose to reparse the generated string by lark itself, to verify the generated text? This would be a good check indeed, one issue might be the %declare symbols, provided by the explicit keyword. How would this be handled?

I agree on having a solution which works best out of the box as possible.

Zac-HD · 2020-05-20T03:37:30Z

I don't know how it would be handled yet, but unfortunately I suspect it will have to be much more complicated than the PR you opened 😞

(for which: thank you! I really appreciate the way your contributing use-cases and design ideas via this issue, and that you're writing code too. I just suspect that this is going to be a very very tricky feature to get right)

windelbouwman · 2020-05-20T11:54:55Z

I did not expect this change to be accepted straight away, in fact, I think this will take long to get right. My pull request was a proposal of some of the places where this might work.

What was the strategy in hypothesmith? Feed the resulting code to the compile function and check if the sourcecode compiles? I do not see any whitespace insertion going on.

Another idea would be to test if two terminals would together result in another single terminal, and if so, insert at least one ignore token between them. Again, the presence of the explicit dictionary makes this hard, since we do not know how the explicit entries parse / lex.

What are your other ideas on how to implement this?

Zac-HD · 2020-05-20T12:13:35Z

What was the strategy in hypothesmith? Feed the resulting code to the compile function and check if the sourcecode compiles? I do not see any whitespace insertion going on.

Yep, that's right - here's the implementation. The trouble with forcing insertion of whitespace is that it would almost certainly mess up the indentation in a few cases, and would also prevent generating interesting valid cases like (a)or[b].

Another idea would be to test if two terminals would together result in another single terminal, and if so, insert at least one ignore token between them. Again, the presence of the explicit dictionary makes this hard, since we do not know how the explicit entries parse / lex.

This is the best idea I have so far - we can probably tell if it's valid even in many cases with explicit strategies, but there is indeed an awkward gap in the API. I'm inclined to at least try that out and see how far we get though; even if that's all we can manage I think it will be a substantial improvement on the status quo.

windelbouwman · 2020-05-20T12:47:15Z

Okay, that sounds good, so the idea would be to loop over pairs of terminals, concatenate them, and them check if they match any terminal.

In the case of lark terminals, this could be done using the regex generated by the terminal terminal.pattern.to_regexp(), so we could pre-compile all those terminal patterns into a re object, and simply match the concatenation. If the concatenation matches, we would need to insert blanks.

As for the explicit set of terminals, is there a way to test if a given string could originate from a draw from the explicit set? Maybe users should be able to specify a regex for the explicit set? But then again, they might as well use the regex strategy.

Zac-HD · 2020-05-20T12:54:52Z

It'll either work, or we'll have a better sense of the problem!

As for the explicit set of terminals, is there a way to test if a given string could originate from a draw from the explicit set?

Nope, unfortunately this is not possible in general. Even the regex strategy doesn't track the pattern it was created from.

windelbouwman · 2020-05-21T20:21:30Z

Okay, regarding the explicit dictionary, we might provide the user with a forbidden or prevent keyword, which is a regular expression, which is used to check if glued terminals result in some forbidden string.

For example:

from_lark(
    grammar,
    explicit={'ID': st.text(alphabet='abcdefghijX', min_size=1)},
    prevent=r'[A-Za-z_]+'
)

This would then generate a[b]() but not inta=2, but instead generate space between int and a since inta matches the prevent regex.

Zac-HD · 2020-05-22T03:34:06Z

Can we use the regexes for our known terminals for this? The loop would look roughly like:

Generate the next terminal to add and append to our list.
Concatenate the last two terminals.
If this concatenation matches any of the terminal regexes, generate an ignored token such that (via filter) inserting it between the last two terminals means there are no terminal patterns which match.
If this worked, insert the ignored token in the appropriate spot.

I think this could work for toy examples, and if so we can probably make it work in more complex cases too.

windelbouwman · 2020-05-22T08:28:13Z

Okay, I implemented this idea in #2438 , and I must say, this works like a charm!

Things to note:

lark does not require the space between int and a, so inta parses fine into int and a. Compliments to lark for this ;)
The explicit symbols are ignored here, so these still can cause issues. Ideas needed for this, but might be postponed to later.
I now insert a single space (it works most of the time), since I do not know how to draw from the ignore set, help would be great here, on how to call the gen_ignore function with draw and draw_state arguments. Question might be here, do we want to draw from the ignore set, or insert a space always (if space is in the ignore set)

cddouma-arm · 2022-03-29T12:47:25Z

I did a small experiment with the lark strategy, and ran into this problem as well. I fixed this with adding an option to separate terminals with ignored terminals. Basically, in hypothesis-python/src/hypothesis/extra/lark.py change:

     def gen_ignore(self, data, draw_state):
-        if self.ignored_symbols and data.draw_bits(2) == 3:
+        with_ignore = data.draw_bits(2, forced = 3 if self.seperated_by_ignored_symbols else None)
+        if self.ignored_symbols and with_ignore == 3:
             emit = data.draw(st.sampled_from(self.ignored_symbols))
             self.draw_symbol(data, emit, draw_state)

But there might be other options here.

Zac-HD added enhancement it's not broken, but we want it to be better lark issues related to support for Lark grammars labels May 19, 2020

windelbouwman mentioned this issue May 19, 2020

Ensure spacing tokens between other tokens. #2438

Closed

Zac-HD mentioned this issue Dec 1, 2020

Allow generator expressions in f-strings Instagram/LibCST#419

Merged

Zac-HD mentioned this issue May 22, 2021

Generated programs are missing spaces Zac-HD/hypothesmith#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to insert whitespace between tokens in lark strategy #2437

Add option to insert whitespace between tokens in lark strategy #2437

windelbouwman commented May 18, 2020

Zac-HD commented May 19, 2020

windelbouwman commented May 19, 2020

Zac-HD commented May 20, 2020

windelbouwman commented May 20, 2020

Zac-HD commented May 20, 2020

windelbouwman commented May 20, 2020 •

edited

Loading

Zac-HD commented May 20, 2020

windelbouwman commented May 21, 2020 •

edited

Loading

Zac-HD commented May 22, 2020

windelbouwman commented May 22, 2020

cddouma-arm commented Mar 29, 2022

Add option to insert whitespace between tokens in lark strategy #2437

Add option to insert whitespace between tokens in lark strategy #2437

Comments

windelbouwman commented May 18, 2020

Zac-HD commented May 19, 2020

windelbouwman commented May 19, 2020

Zac-HD commented May 20, 2020

windelbouwman commented May 20, 2020

Zac-HD commented May 20, 2020

windelbouwman commented May 20, 2020 • edited Loading

Zac-HD commented May 20, 2020

windelbouwman commented May 21, 2020 • edited Loading

Zac-HD commented May 22, 2020

windelbouwman commented May 22, 2020

cddouma-arm commented Mar 29, 2022

windelbouwman commented May 20, 2020 •

edited

Loading

windelbouwman commented May 21, 2020 •

edited

Loading