GitHub - govwiki/GovernmentEntityScraper: Finds government websites by choosing the appropriate Google Search result

Overview

The goal is to find the appropriate website for each government entity in the United States by using a Google search. The main script is located at search.py.

Program Flow

We read the Texas government entities from the first tab of data/Texas Local Governments.xlsx, perform a Google search query using the entity name, and use some domain-specific logic to either return a suitable url or no url. The results are written to a data/texas_websites_...xlsx file (the most recent is texas_websites_01_10_22.xlsx).

Algorithm

At a high-level, we use a two-pass algorithm where on the first pass we look for entities that contain their own website, and in the second pass we look for directory listings that come from a list of websites specified from overrides/valid_urls.csv.

Manual Overrides

If there is a government entity that returns an incorrect url under the current algorithm, you can override the url returned by adding an entry to the csv file overrides/overriden_entities.csv. Provide the entity name exactly as it appears in column A of the resultant data/texas_websites_...xlsx file.

Set Up

To install the necessary dependencies, run:

pip install -r requirements.txt

Running the Script

The main script is located at search.py, and the main method is iterate. There are a few tunable parameters to the main method that are worth mentioning.

parallel - A flag indicating whether we should run the urllib requests in parallel. Although this is faster, after a few hundred calls, Google usually blocks requests because of high load.
match_correct - A flag when set to true only finds urls for entities that have a labelled correct url (as indicated by having an entry in column C of the texas_websites_...xlsx spreadsheet). Updates column D as to whether the generated url matches the correct url. If the flag is false, the script finds urls for all 5000+ entities.
access_url - A flag indicating whether we should perform a http request on each website returned by the Google result. If false, we just rely on the url itself and the page title to determine the validity of a link. We've found experimentally that results are better when this flag is set to False.

Locating files on Government Entity Websites

A second script in this repository conducts user-specified searches against each government website. This script is named main.py, and the main method is get_url. There are a few tunable parameters to the main method that are worth mentioning.

input_file - A flag points to the file where the urls for requests are located. (required)
sheet_name - A flag indicating name of sheet in input_file - excel book. (required)
column_number - A flag points to the column in input_file in which the required urls are located. (required)
output_file - A flag points to the file where the results will be saved. (required)
config_file - A flag points to the file with templates of requests. (required)
startRow - A flag indicating from which line in the input_file script execution will begin. (optional, default = 2)
endRow - A flag indicating up to which line in the input_file the script will be executed. (optional, default = 6)
year - A flag points to year for templates. (optional, default = 2021)

For example:

python main.py ./data/Local_Education_Authority_Web_Addresses.xlsx Sheet1 4 new.xlsx config.txt startRow=2 endRow=6 year=2021

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.idea		.idea
__pycache__		__pycache__
data		data
notes		notes
overrides		overrides
.gitignore		.gitignore
README.md		README.md
config.txt		config.txt
main.py		main.py
overriden_entities.py		overriden_entities.py
requirements.txt		requirements.txt
search.py		search.py
url_checker.py		url_checker.py
valid_urls.py		valid_urls.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

Program Flow

Algorithm

Manual Overrides

Set Up

Running the Script

Locating files on Government Entity Websites

About

Uh oh!

Releases

Packages

Languages

govwiki/GovernmentEntityScraper

Folders and files

Latest commit

History

Repository files navigation

Overview

Program Flow

Algorithm

Manual Overrides

Set Up

Running the Script

Locating files on Government Entity Websites

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages