Skip to content

[Enterprise Usage] Cannot pass in domestic phone numbers #1513

@npatki

Description

@npatki

Enterprise users have access to additional RDTs that offer them features such as contextual anonymization.

Problem Description

For phone_number data, the SDV automatically assigns the AnonymizedGeoExtractor that can parse out phone numbers. If my phone numbers are domestic, it means they do not have an international country code. The transformer expects the default_country code to be provided in this case.

For example, I may have (617) 253-3400 which is a US domestic phone number. So it expects US as the default country.

What happens today

Today the data processor will assign a transformer without default_country attached, so there is an error if I pass in domestic phone numbers.

import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer

data = pd.DataFrame(data={
    'id': [0, 1],
    'age': [29, 45],
    'domestic_numbers': ['(617) 253-3400', '(617) 495-1000']
})

metadata = SingleTableMetadata.load_from_dict({
    'primary_key': 'id',
    'columns': {
        'id': { 'sdtype': 'id' },
        'age': { 'sdtype': 'numerical' },
        'domestic_numbers': { 'sdtype': 'phone_number' }
    }
})

synth = GaussianCopulaSynthesizer(metadata, locales=['en_US'])
synth.fit(data)

Output:

ValueError: Phone number (617) 253-3400 is represented in national format. Please provide ``default_country`` for nationally represented numbers when creating the transformer instance.

Expected behavior

If there is a single locale provided (in the locales parameter), then the data processor should:

  1. Parse out the country code. (This is everything after the underscore. For example, en_US would be US.)
  2. Assign phone number sdtype to an AnonymizedGeoExtractor with that country code as the default_country.
synthesizer.get_transformers()
{
    ...,
    'domestic_numbers': AnonymizedGeoExtractor(default_country='US')
}

Additional context

  • If there are multiple locales, then do not pass in a default country. In this case, the phone numbers should have an international country code
  • The AnonymizedGeoExtractor will check for both the default country and international numbers (as a fallback). So no need to worry about other cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestRequest for a new featurefeature:preprocessingRelated to transforming the raw data for modeling & sampling

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions