"Chinese dot" breaks the fingerprint method #782

reinoldus · 2025-02-03T07:21:22Z

Hi everyone,

I noticed that if this "chinese dot" is included in the fingerprint hash "。" then the hash is always "ffffffffffffffff"

from trafilatura.deduplication import content_fingerprint, Simhash
content_a = Simhash("""行政長官岑浩""")


print(Simhash("""行政長官岑浩""").to_hex()) # 13dd8c82d4634a48
print(Simhash("""欢迎与我交流""").to_hex()) # 58429793861fa351
print(Simhash("""行政長官岑浩。""").to_hex()) #ffffffffffffffff
print(Simhash("""欢迎与我交流。""").to_hex()) #ffffffffffffffff

I hope the encoding is not all messed up.

The text was updated successfully, but these errors were encountered:

reinoldus · 2025-02-04T07:26:49Z

I did some more debugging and it seems like the sample_tokens function returns an empty list, because due to the "chinese dot" not being detected as punctuation the isalnum check fails:

        if token.isalnum():
            tokens.append(token)

We have a couple of options here:

Use the "regex" lib to remove unicode punctuation

    import regex  
    strip_punct = regex.compile(r'\p{P}+')
    token = strip_punct.sub('', token)

or this approach:

import unicodedata
import sys

tbl = dict.fromkeys(i for i in range(sys.maxunicode)
                      if unicodedata.category(chr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

Or for this particular case we could just append the whole text if no tokens have been found (probably the worst option)

    if not tokens:
        tokens.append(inputstring)

Let me know if any of those options would work for you and I will open a PR.

adbar · 2025-02-04T18:05:11Z

Hi @reinoldus, thanks for the suggestions, it actually amounts to a tokenization issue.

I believe the most efficient approach is to add the character to string.punctuation in a global variable and then to replace this line in the deduplication module (just above):
token = token.strip(string.punctuation)

Feel free to draft a pull request on it, I could review and integrate it.

adbar added the enhancement New feature or request label Feb 3, 2025

reinoldus mentioned this issue Feb 5, 2025

Fixing tokenizer not stripping "Ideographic Full Stop" (chinese full … #783

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Chinese dot" breaks the fingerprint method #782

"Chinese dot" breaks the fingerprint method #782

reinoldus commented Feb 3, 2025

reinoldus commented Feb 4, 2025 •

edited

Loading

adbar commented Feb 4, 2025

"Chinese dot" breaks the fingerprint method #782

"Chinese dot" breaks the fingerprint method #782

Comments

reinoldus commented Feb 3, 2025

reinoldus commented Feb 4, 2025 • edited Loading

adbar commented Feb 4, 2025

reinoldus commented Feb 4, 2025 •

edited

Loading