-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OGE.gov Scraper #8
Comments
Do you want help with this? |
Hi @bnsmith3! I'd love some help on this one! Feel free to approach this however you'd like, though I do recommend standing up an independent repository solely focused on a scraper. As this would be a civic data service, I was considering implementation on morph.io: |
Ok. Is it okay with you if the first version simply stores the pdfs, and maybe version two will involve actually parsing the pdfs? |
If you want to just do the downloading part, I can take a swing at extractibg the text from the PDFs |
Hi @zacherybohon! re: parsing. Thus far, we've been using CPI's pfd-parser project to parse the (somewhat) structured PDF files. It's a NodeJS script and it's pretty ugly but it seems to work well enough. That said, the outputs of that script assume you're parsing a set of Public Financial Disclosure Reports and want the data partitioned by table, indexed by filer. That may or may not make sense based on your use case. There's a use case for parsing one file into, say, JSON. Long story short, I wouldn't be opposed to seeing another Public Financial Disclosure Report parser crop up. I would again suggest any new parser should do one thing well and therefore be in a standalone repository. |
Sounds good. Based on that, it made sense to me to make the parsing a separate issue, so I made one: #11. |
The US Office of Government Ethics posts certain Public Financial Disclosure Reports on their website directly. In particular, there is a table of reports for "executive branch officials occupying positions for which the pay is set at Levels 1 and 2 of the Executive Schedule" sorted in reverse chronological order here:
https://extapps2.oge.gov/201/Presiden.nsf/PAS%20Filings%20by%20Date?OpenView
There is also a screen sorted by name which includes Ethics Agreements:
https://extapps2.oge.gov/201/Presiden.nsf/PAS%20Index?OpenView
The President and Vice President's reports are here:
https://extapps2.oge.gov/201/Presiden.nsf/President%20and%20Vice%20President%20Index
It seems reasonable to build a web scraper to search this site daily for new PFDs. That should probably be a separate GitHub project, but I wanted to document the idea here.
The text was updated successfully, but these errors were encountered: