The goal is to find the appropriate website for each government entity in the United States by using a Google search.
Currently, it is limited to Texas government entities. The main script is located at search.py
.
We read the Texas government entities from the first tab of data/Texas Local Governments.xlsx
, perform a Google search query using the entity name, and use some domain-specific logic to either return a suitable url or no url. The results are written to a data/texas_websites_...xlsx
file (the most recent is texas_websites_01_10_22.xlsx
).
At a high-level, we use a two-pass algorithm where on the first pass we look for entities that contain their own website, and in the second pass we look for directory listings that come from a list of websites specified from overrides/valid_urls.csv
.
If there is a government entity that returns an incorrect url under the current algorithm, you can override the url returned by adding an entry to the csv file overrides/overriden_entities.csv
. Provide the entity name exactly as it appears in column A of the resultant data/texas_websites_...xlsx
file.
To install the necessary dependencies, run:
pip install -r requirements.txt
The main script is located at search.py
, and the main method is iterate
. There are a few tunable parameters to the main method that are worth mentioning.
parallel
- A flag indicating whether we should run the urllib requests in parallel. Although this is faster, after a few hundred calls, Google usually blocks requests because of high load.match_correct
- A flag when set to true only finds urls for entities that have a labelled correct url (as indicated by having an entry in columnC
of thetexas_websites_...xlsx
spreadsheet). Updates columnD
as to whether the generated url matches the correct url. If the flag is false, the script finds urls for all 5000+ entities.access_url
- A flag indicating whether we should perform a http request on each website returned by the Google result. If false, we just rely on the url itself and the page title to determine the validity of a link. We've found experimentally that results are better when this flag is set toFalse
.