Cleaning Dirty Location Data

The approach:

1. Identify which fields in the spreadsheet containing location info

We identified the following 7 fields as containing location info: "Ocean", "Continent", "Country", "Department / Province / State", "County", "City/Town/Hamlet", "Specific Locale"

2. Extract these fields into individual text files for easier analysis

The listunique.py script was written to iterate over the CSV of dirty data to create 7 files containing all the non-empty values for each field (including duplicates). The results of this script are available in the dirty_locations folder

2 of the 7 fields contained no data at all (at least in the sample set), Ocean and City/Town/Hamlet

3. Generate lists of unique values for each field

The uniques.sh shell script was created to extract lists of unique values from the 7 output files created above. It also generates a word count file showing the number of lines in all the output files, which is useful for understanding what percentage of the data is invalid. This is the only script that was not written in python, and the reason is because this sort of task is trivial using standing unix commands. These results can also be found in the dirty_locations folder

4. Compare unique lists to dictionaries created from available resources

Two dictionaries we created from publicly available resources. The first is a list of ISO 3166-1 standard country short names, pulled from Wikipedia. The other is a large list of world cities with populations greater than 1000. This list came from GeoNames. These dictionaries are stored here.

5. Attempt simple data correction

Since we only had 2 good dictionaries, and the sample data had no values in the city field, that left us only Country to attempt to correct with any confidence. The spellcheck.py script was created to iterate over the full list of Country values from the dirty data and perform the following tasks:

Look for the value in the list of known countries
If it is not found:
- Add it to a list of misspelled countries
- See if you can force it to match a value in the dictionary editing a single character (an edit distance of 1).
- If that works, add it to another list of "off by one" values along with the match it found

Both files and their corresponding _unique files are available in the dirty_locations folder.

6. Document some statistics of how dirty the data is and how much can be easily fixed

Using the word count file we can make the following observations about the data

Total row of data: 74255

Total data present:

Country 72084 (97%)
Department / Province / State 65626 (88%)
Continent (68%)
County 48872 (66%)
Specific Locale 27940 (38%)
City/Town/Hamlet 0 (0%)
Ocean 0 (0%)

Total VALID Country data: 22101 (30%)

Valid Country data within 1 edit distance: 63991 (83%) <-- THAT'S A BIG IMPROVEMENT!

7. Next steps

Repeat this process for the remaining location fields. Perhaps use the city dictionary against the locale or county field, or use the much larger "all countries" dataset from GeoNames which actually contains EVERY location they track, not just countries.
Modify the spellcheck script to attempt an edit distance of 2. The script already supports this, but one line has been commented out to disable it because running it would not complete within the hours of this hackathon.
Update the country dictionary to contain historically accurate data (Burma, USSR, etc).
This same approach can be used to clean/validate taxonomy fields given the necessary dictionary files can be created.
Use a natural language parser on the Locale field to extract the very precise location info contained there
Look for incorrectly corrected data. For example "holand" is currently detected as a misspelling of Poland.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly