Idaho - uid/dedupe #24

tarakc02 · 2024-09-18T01:50:24Z

as discussed, took a look at the Idaho uid code. The code itself (using pd.factorize) makes sense to me, i assume other cleaning steps (including flattening/deduplication of duplicate stints) happens downstream. But these notes are about the logic for the UID.

It makes the assumption that first+last name consistently and uniquely identifies a single person. There are two ways this can cause issues:

collision on first+last name (overmatching): if two or more people share the same name, they will appear as a single person in this table. I did a bit of review, and couldn't find any obvious places where that was happening (based on looking at employment history - agencies and dates). But it's an assumption we can't really test.
the same individual is recorded with different names in different parts of their employment history (undermatching): depending on how the post agency compiles the data, this may or may not be possible. i've seen releases in other states where a person with the same id number changes their last name (e.g. after getting married), as well as occasions where the first name is represented differently in different records (like "Mike" vs. "Michael" or whatever). it takes a bit of effort to test whether that's happening here, and may not at all if Idaho post uses a canonical name table rather than recording the name for each line in the employment history.

in terms of the costs of being wrong, if we have to have these issues we'd probably prefer to err on the side of undermatching. But in general there's not a great way to measure if/how much this is happening. the code we have results in 8,342 unique currently active leos in idaho (based on no end date), which is higher than what published stats i can find -- this doesn't prove anything but is at least consistent with more undermatching vs. overmatching.

If we were using this data to count, e.g. calculating police per capita or something, I would not feel comfortable making such strong assumptions to generate the uid. But in our case the actual use of the data is as a lookup tool. And we have at least one other state where we expect undermatching, which is California, because we don't dedupe across police and prison data.

But in this case it makes me wonder if there is any value to creating the id at all? Is there a reason we can't publish the table without the ID, since we cannot guarantee it is a unique person identifier?

The text was updated successfully, but these errors were encountered:

tarakc02 self-assigned this Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idaho - uid/dedupe #24

Idaho - uid/dedupe #24

tarakc02 commented Sep 18, 2024

Idaho - uid/dedupe #24

Idaho - uid/dedupe #24

Comments

tarakc02 commented Sep 18, 2024