Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Chinese dot" breaks the fingerprint method #782

Open
reinoldus opened this issue Feb 3, 2025 · 2 comments
Open

"Chinese dot" breaks the fingerprint method #782

reinoldus opened this issue Feb 3, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@reinoldus
Copy link

Hi everyone,

I noticed that if this "chinese dot" is included in the fingerprint hash "。" then the hash is always "ffffffffffffffff"

from trafilatura.deduplication import content_fingerprint, Simhash
content_a = Simhash("""行政長官岑浩""")


print(Simhash("""行政長官岑浩""").to_hex()) # 13dd8c82d4634a48
print(Simhash("""欢迎与我交流""").to_hex()) # 58429793861fa351
print(Simhash("""行政長官岑浩。""").to_hex()) #ffffffffffffffff
print(Simhash("""欢迎与我交流。""").to_hex()) #ffffffffffffffff

I hope the encoding is not all messed up.

@adbar adbar added the enhancement New feature or request label Feb 3, 2025
@reinoldus
Copy link
Author

reinoldus commented Feb 4, 2025

I did some more debugging and it seems like the sample_tokens function returns an empty list, because due to the "chinese dot" not being detected as punctuation the isalnum check fails:

        if token.isalnum():
            tokens.append(token)

We have a couple of options here:

Use the "regex" lib to remove unicode punctuation

    import regex  
    strip_punct = regex.compile(r'\p{P}+')
    token = strip_punct.sub('', token)

or this approach:

import unicodedata
import sys

tbl = dict.fromkeys(i for i in range(sys.maxunicode)
                      if unicodedata.category(chr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

Or for this particular case we could just append the whole text if no tokens have been found (probably the worst option)

    if not tokens:
        tokens.append(inputstring)

Let me know if any of those options would work for you and I will open a PR.

@adbar
Copy link
Owner

adbar commented Feb 4, 2025

Hi @reinoldus, thanks for the suggestions, it actually amounts to a tokenization issue.

I believe the most efficient approach is to add the character to string.punctuation in a global variable and then to replace this line in the deduplication module (just above):
token = token.strip(string.punctuation)

Feel free to draft a pull request on it, I could review and integrate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants