Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idaho - uid/dedupe #24

Open
tarakc02 opened this issue Sep 18, 2024 · 0 comments
Open

Idaho - uid/dedupe #24

tarakc02 opened this issue Sep 18, 2024 · 0 comments
Assignees

Comments

@tarakc02
Copy link
Collaborator

as discussed, took a look at the Idaho uid code. The code itself (using pd.factorize) makes sense to me, i assume other cleaning steps (including flattening/deduplication of duplicate stints) happens downstream. But these notes are about the logic for the UID.

It makes the assumption that first+last name consistently and uniquely identifies a single person. There are two ways this can cause issues:

  1. collision on first+last name (overmatching): if two or more people share the same name, they will appear as a single person in this table. I did a bit of review, and couldn't find any obvious places where that was happening (based on looking at employment history - agencies and dates). But it's an assumption we can't really test.
  2. the same individual is recorded with different names in different parts of their employment history (undermatching): depending on how the post agency compiles the data, this may or may not be possible. i've seen releases in other states where a person with the same id number changes their last name (e.g. after getting married), as well as occasions where the first name is represented differently in different records (like "Mike" vs. "Michael" or whatever). it takes a bit of effort to test whether that's happening here, and may not at all if Idaho post uses a canonical name table rather than recording the name for each line in the employment history.

in terms of the costs of being wrong, if we have to have these issues we'd probably prefer to err on the side of undermatching. But in general there's not a great way to measure if/how much this is happening. the code we have results in 8,342 unique currently active leos in idaho (based on no end date), which is higher than what published stats i can find -- this doesn't prove anything but is at least consistent with more undermatching vs. overmatching.

If we were using this data to count, e.g. calculating police per capita or something, I would not feel comfortable making such strong assumptions to generate the uid. But in our case the actual use of the data is as a lookup tool. And we have at least one other state where we expect undermatching, which is California, because we don't dedupe across police and prison data.

But in this case it makes me wonder if there is any value to creating the id at all? Is there a reason we can't publish the table without the ID, since we cannot guarantee it is a unique person identifier?

@tarakc02 tarakc02 self-assigned this Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant