You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I did some more debugging and it seems like the sample_tokens function returns an empty list, because due to the "chinese dot" not being detected as punctuation the isalnum check fails:
import unicodedata
import sys
tbl = dict.fromkeys(i for i in range(sys.maxunicode)
if unicodedata.category(chr(i)).startswith('P'))
def remove_punctuation(text):
return text.translate(tbl)
Or for this particular case we could just append the whole text if no tokens have been found (probably the worst option)
if not tokens:
tokens.append(inputstring)
Let me know if any of those options would work for you and I will open a PR.
Hi @reinoldus, thanks for the suggestions, it actually amounts to a tokenization issue.
I believe the most efficient approach is to add the character to string.punctuation in a global variable and then to replace this line in the deduplication module (just above): token = token.strip(string.punctuation)
Feel free to draft a pull request on it, I could review and integrate it.
Hi everyone,
I noticed that if this "chinese dot" is included in the fingerprint hash "。" then the hash is always "ffffffffffffffff"
I hope the encoding is not all messed up.
The text was updated successfully, but these errors were encountered: