-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to insert whitespace between tokens in lark strategy #2437
Comments
It would definitely be nice to improve this - for Hypothesmith I've just worked around it with a brute-force filter, but that's not a great option. I think the general solution would be something like "generate the non-ignored parts; while (it doesn't parse to what we generated), generate and insert ignored tokens". Not sure this is feasible and we may need to expose some user-facing control (e.g. preferred ignore rule(s)), but it would be completely automatic in simple cases and ensures that what we generate is always valid. |
Okay, so you propose to reparse the generated string by lark itself, to verify the generated text? This would be a good check indeed, one issue might be the I agree on having a solution which works best out of the box as possible. |
I don't know how it would be handled yet, but unfortunately I suspect it will have to be much more complicated than the PR you opened 😞 (for which: thank you! I really appreciate the way your contributing use-cases and design ideas via this issue, and that you're writing code too. I just suspect that this is going to be a very very tricky feature to get right) |
I did not expect this change to be accepted straight away, in fact, I think this will take long to get right. My pull request was a proposal of some of the places where this might work. What was the strategy in hypothesmith? Feed the resulting code to the compile function and check if the sourcecode compiles? I do not see any whitespace insertion going on. Another idea would be to test if two terminals would together result in another single terminal, and if so, insert at least one ignore token between them. Again, the presence of the What are your other ideas on how to implement this? |
Yep, that's right - here's the implementation. The trouble with forcing insertion of whitespace is that it would almost certainly mess up the indentation in a few cases, and would also prevent generating interesting valid cases like
This is the best idea I have so far - we can probably tell if it's valid even in many cases with explicit strategies, but there is indeed an awkward gap in the API. I'm inclined to at least try that out and see how far we get though; even if that's all we can manage I think it will be a substantial improvement on the status quo. |
Okay, that sounds good, so the idea would be to loop over pairs of terminals, concatenate them, and them check if they match any terminal. In the case of lark terminals, this could be done using the regex generated by the terminal As for the explicit set of terminals, is there a way to test if a given string could originate from a draw from the explicit set? Maybe users should be able to specify a regex for the explicit set? But then again, they might as well use the regex strategy. |
It'll either work, or we'll have a better sense of the problem!
Nope, unfortunately this is not possible in general. Even the regex strategy doesn't track the pattern it was created from. |
Okay, regarding the For example: from_lark(
grammar,
explicit={'ID': st.text(alphabet='abcdefghijX', min_size=1)},
prevent=r'[A-Za-z_]+'
) This would then generate |
Can we use the regexes for our known terminals for this? The loop would look roughly like:
I think this could work for toy examples, and if so we can probably make it work in more complex cases too. |
Okay, I implemented this idea in #2438 , and I must say, this works like a charm! Things to note:
|
I did a small experiment with the lark strategy, and ran into this problem as well. I fixed this with adding an option to separate terminals with ignored terminals. Basically, in def gen_ignore(self, data, draw_state):
- if self.ignored_symbols and data.draw_bits(2) == 3:
+ with_ignore = data.draw_bits(2, forced = 3 if self.seperated_by_ignored_symbols else None)
+ if self.ignored_symbols and with_ignore == 3:
emit = data.draw(st.sampled_from(self.ignored_symbols))
self.draw_symbol(data, emit, draw_state) But there might be other options here. |
Hi,
First off: compliments for the lark strategy, it is fantastic! One feature I would like to request is to be able to inject whitespace, or any token seperator text between the seperate terminals.
Given this python snippet:
It generates text like this:
Which is not valid C code, since we lack a space between
int
andbX
.Possible options for the api:
Add an extra keyword to
from_lark
, something likefrom_lark(spacing=' ')
?Another option would be to inspect the
%ignore
tag in the grammar, and generate text matching this ignore tag?The text was updated successfully, but these errors were encountered: