Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding unicode friendly clue parsing #92

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

Canonelis
Copy link
Contributor

@Canonelis Canonelis commented Jul 13, 2021

Changed the clue parsing algorithm to handle unicode characters. The regular expression it now mimics is ^\s*\a+(-\a+)?(\s*-?\s*\d+|\s*(-|\s)\s*inf)\s*$
where "\a" represents all legal clue letters. I needed to avoid using any string library functions that allows matching using %d-style syntax when parsing a clue.

Changed the clue parsing algorithm to handle unicode characters. The regular expression it now mimics is "^\s*\a+(-\a+)?(\s*-?\s*\d+|\s*(-|\s+)inf)$"
where "\a" represents all legal clue letters.  Needed to avoid using any string library functions that allows matching using %d-style syntax when parsing a clue.
@Canonelis
Copy link
Contributor Author

So this allows all letters allowed by %a in lua scripting, but also allows any unicode characters above 0x0370 except for some whitespace characters and dashes. Very versatile and still allows for all the same clue formats as before.

@Canonelis
Copy link
Contributor Author

If you're busy I could provide a fairly exhaustive list of test cases. Anything I can do to help u add this to the project?

Fixed range of illegal characters
Make important character modifiers legal
Added compatibility with typing in foreign digit systems. It converts then to normal numbers before further processing.
Fixed bug where you can put a hyphen at the beginning of a clue if there is whitespace before it.
Added more whitespace characters.
@Canonelis
Copy link
Contributor Author

Did some rigorous testing on it, found one flaw. Generated 2000 clues that should work and they did. Generated 5000 clues that shouldn't work and they didn't. This is ready.

Lowercase works well with unicode characters
@Canonelis Canonelis force-pushed the adding-unicode-friendly-clue-parsing branch from a455e14 to 1acaed9 Compare July 28, 2021 05:58
@Canonelis
Copy link
Contributor Author

Canonelis commented Aug 6, 2021

This would be good to add pretty soon since you have so many foreign decks. Right now the characters it allows in clues is fairly arbitrary. If the character's code mod 256 is in the range of A-Z or a-z or À-ÿ then it accepts it, otherwise it rejects it.

I've played a few games with it now and I think it's done.

@Canonelis
Copy link
Contributor Author

Canonelis commented Aug 25, 2021

Here are 2
near legit clues(but not legit).txt
legit clues.txt
files you can copy and paste from.
They each were randomly generated and filtered by the regular expression
^\s*\a+(-\a+)?(\s*-?\s*\d+|\s*(-|\s)\s*inf)\s*$
So with the allowed character sets, it gets pretty weird, but for testing purposes it worked great.
There are the numbers 0-9 in many other languages, so I included them as well which is why you might not see a normal number in each clue. For displaying and logging the clue, however, it puts it in as a normal digit.

Copy link
Owner

@Ryan6578 Ryan6578 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to review getClueDetails, but these are what I found for now.

src/Global.-1.ttslua Outdated Show resolved Hide resolved
src/Global.-1.ttslua Outdated Show resolved Hide resolved
.gitignore Outdated Show resolved Hide resolved
src/Global.-1.ttslua Show resolved Hide resolved
src/Global.-1.ttslua Outdated Show resolved Hide resolved
@@ -232,6 +233,84 @@ analytics =
sessions = {}
}

----------[ Character sets ]----------

digits_table = {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these split up into 10 tables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The digits_table[0] table lists all the characters that are the number 0 in other languages.
The digits_table[1] table lists all the characters that are the number 1 in other languages.
etc.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by that?

src/Global.-1.ttslua Outdated Show resolved Hide resolved
src/Global.-1.ttslua Show resolved Hide resolved
src/Global.-1.ttslua Show resolved Hide resolved
src/Global.-1.ttslua Show resolved Hide resolved
Fixed range of illegal characters.
Do a small fix to prevent regex backtracking overflow on chat messages with large gaps of whitespace in them, such as "!A_______________B" where _ is a space.
Condensed the code for adding ranges of illegal characters.
This is was checked and produces the exact same table as before.
@Canonelis
Copy link
Contributor Author

Here are the submitted changes to the code.

A correct greedy way to remove leading and trailing whitespace
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants