-
Notifications
You must be signed in to change notification settings - Fork 1
Cleaning Dirty Location Data
We identified the following 7 fields as containing location info: "Ocean", "Continent", "Country", "Department / Province / State", "County", "City/Town/Hamlet", "Specific Locale"
The listunique.py script was written to iterate over the CSV of dirty data to create 7 files containing all the non-empty values for each field (including duplicates). The results of this script are available in the dirty_locations folder
2 of the 7 fields contained no data at all (at least in the sample set), Ocean and City/Town/Hamlet
The uniques.sh shell script was created to extract lists of unique values from the 7 output files created above. It also generates a word count file showing the number of lines in all the output files, which is useful for understanding what percentage of the data is invalid. This is the only script that was not written in python, and the reason is because this sort of task is trivial using standing unix commands. These results can also be found in the dirty_locations folder
Two dictionaries we created from publicly available resources. The first is a list of ISO 3166-1 standard country short names, pulled from Wikipedia. The other is a large list of world cities with populations greater than 1000. This list came from GeoNames. These dictionaries are stored here.
Since we only had 2 good dictionaries, and the sample data had no values in the city field, that left us only Country to attempt to correct with any confidence. The spellcheck.py script was created to iterate over the full list of Country values from the dirty data and perform the following tasks:
- Look for the value in the list of known countries
- If it is not found:
- Add it to a list of misspelled countries
- See if you can force it to match a value in the dictionary editing a single character (an edit distance of 1).
- If that works, add it to another list of "off by one" values along with the match it found
Both files and their corresponding _unique files are available in the dirty_locations folder.
Using the word count file we can make the following observations about the data
Total row of data: 74255
Total data present:
- Country 72084 (97%)
- Department / Province / State 65626 (88%)
- Continent (68%)
- County 48872 (66%)
- Specific Locale 27940 (38%)
- City/Town/Hamlet 0 (0%)
- Ocean 0 (0%)
Total VALID Country data: 22101 (30%)
Valid Country data within 1 edit distance: 63991 (83%) <-- THAT'S A BIG IMPROVEMENT!
-
Repeat this process for the remaining location fields. Perhaps use the city dictionary against the locale or county field, or use the much larger "all countries" dataset from GeoNames which actually contains EVERY location they track, not just countries.
-
Modify the spellcheck script to attempt an edit distance of 2. The script already supports this, but one line has been commented out to disable it because running it would not complete within the hours of this hackathon.
-
Update the country dictionary to contain historically accurate data (Burma, USSR, etc).
-
This same approach can be used to clean/validate taxonomy fields given the necessary dictionary files can be created.
-
Use a natural language parser on the Locale field to extract the very precise location info contained there
-
Look for incorrectly corrected data. For example "holand" is currently detected as a misspelling of Poland.