Skip to content

[FEAT] block_on Method Should Support Arrays And SUBSTR #2563

@ModeMonkey

Description

@ModeMonkey

Is your proposal related to a problem?

The block_on method does not seem to support arrays when implementing the substr method. I've provided sample code below to demonstrate the point. With the release of v4.0.6 and its new PairwiseStringDistanceFunctionAtThresholds function for comparing similarities between arrays, it would seem appropriate to expect data be in arrays and for the block_on function to be able to work on those arrays.

Describe the solution you'd like

I would like the block_on method to be able to function on arrays when using the substr method.

Describe alternatives you've considered

One alternative is to provide a string blocking rule like:

EXISTS (
    SELECT 1
    FROM UNNEST(l.surname) AS l_surname_struct,
         UNNEST(r.surname) AS r_surname_struct,
    WHERE SUBSTR(l_surname_struct.unnest, 1, 3) = SUBSTR(r_surname_struct.unnest, 1, 3)
)

This blocking rule works with at least DuckDB.

However, this ends up being incredibly slow. Adding a second unnest and substr method, like for first_name, fails to compute in a timely fashion on a 32-core computer. After 30 minutes the computer did not seem to be able to estimate the probability two random records match in the fake_1000 sample.

Perhaps not including the ability to use the substr method on arrays in the block_on method is intentional, as it can't be made performant. If this is the case, perhaps an error message that notifies the user of the performance issues and provides the blocking rule string above if they really want this functionality?

Additional context

Here is some sample code to demonstrate the point. I've mimicked it off the the intro Splink tutorial. I made this while trying to understand how to implement the new PairwiseStringDistanceFunctionAtThresholds method.

from splink import  splink_datasets
import splink.comparison_library as cl
from splink import Linker, SettingsCreator, block_on, DuckDBAPI

df = splink_datasets.fake_1000
df = df.drop(columns=["cluster"])
df.head(5)

def make_list(in_value):
    if type(in_value) == str and len(in_value) > 0:
        return [in_value]
    else:
        return []

# turn all columns into arrays
for column in df.columns:
    if column != "unique_id":
        df[column] = df[column].apply(make_list)

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.PairwiseStringDistanceFunctionAtThresholds("first_name", "levenshtein", [0,1,2,3,4,5,6,7,8,9]),
        cl.PairwiseStringDistanceFunctionAtThresholds("surname",    "levenshtein", [0,1,2,3,4,5,6,7,8,9]),
        cl.PairwiseStringDistanceFunctionAtThresholds("dob",        "levenshtein", [0,1,2,3,4,5,6,7,8,9]),
        cl.PairwiseStringDistanceFunctionAtThresholds("city",       "levenshtein", [0,1,2,3,4,5,6,7,8,9]).configure(term_frequency_adjustments=True),
        cl.PairwiseStringDistanceFunctionAtThresholds("email",      "levenshtein", [0,1,2,3,4,5,6,7,8,9]),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "city", arrays_to_explode=["first_name","city"]),
        block_on("surname",            arrays_to_explode=["surname"]),
    ],
    retain_intermediate_calculation_columns=True,
)

linker = Linker(df, settings, db_api=DuckDBAPI())

deterministic_rules = [
    block_on("first_name", "dob",        arrays_to_explode=["first_name", "dob"]),
    block_on("email",                    arrays_to_explode=["email"]),
    block_on("substr(first_name, 1, 2)", arrays_to_explode=["first_name"]),
    
]

linker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)
# ^^^^^^^ Error appears here, seemingly with the addition of the substr block_on rule

# vvvvvvvv This is the rest of the code, which works after removing the third block_on rule with the substr: 
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
training_blocking_rule = block_on("first_name", "surname")
training_session_fname_sname = (
    linker.training.estimate_parameters_using_expectation_maximisation(training_blocking_rule)
)
training_blocking_rule = block_on("dob")
training_session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
    training_blocking_rule
)
df_predictions = linker.inference.predict(threshold_match_probability=0.2)
df_predictions.as_pandas_dataframe(limit=5)
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predictions, threshold_match_probability=0.5
)
clusters.as_pandas_dataframe(limit=10)

I know the comparison rules above aren't great, I was just making a proof of concept. Overall I really love the new PairwiseStringDistanceFunctionAtThresholds method - makes things so much easier than the alternatives I've messed with. Great work on that feature!

Happy to help, though I don't really know where to start.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions