Skip to content

Conversation

@zhaoyang868686
Copy link

Hi guys,

There is a bug in case_insensitive_matching_strategy.

It use

text = re.sub(anonymized, original, text, flags=re.IGNORECASE)

to replace anonymized with original in text.

For phone numbers such as "+1-235-234-8740x164" starting with "+"

text = re.sub(pattern='+1-235-234-8740x164', repl='XXX', string='XXX', flags=re.IGNORECASE)

The first parameter pattern expect a string or a regular expressions string, if the string starts with "+", it will be recognized as a regular expressions and lead to an error.

In regular expressions "+" causes the resulting RE to match 1 or more repetitions of the preceding RE, but there is no characters before "+" in phone number.
-> re.error: nothing to repeat at position 0

How to reproduce:

anonymizer = PresidioReversibleAnonymizer()
anonymizer._deanonymizer_mapping.update(new_mapping={'PHONE_NUMBER': {'+1-235-234-8740x164': '12345678'}})
anonymizer.deanonymize(text_to_deanonymize='some text', deanonymizer_matching_strategy=case_insensitive_matching_strategy)

How to fix:

    for entity_type in deanonymizer_mapping:
        for anonymized, original in deanonymizer_mapping[entity_type].items():
            # Use regular expressions for case-insensitive matching and replacing
            text = re.sub(pattern=re.escape(pattern=anonymized),
                          repl=original,
                          string=text,
                          flags=re.IGNORECASE)
    return text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects
Status: Triage

Development

Successfully merging this pull request may close these issues.

1 participant