OGE.gov Scraper #8

gregoryfoster · 2017-04-03T19:27:37Z

The US Office of Government Ethics posts certain Public Financial Disclosure Reports on their website directly. In particular, there is a table of reports for "executive branch officials occupying positions for which the pay is set at Levels 1 and 2 of the Executive Schedule" sorted in reverse chronological order here:
https://extapps2.oge.gov/201/Presiden.nsf/PAS%20Filings%20by%20Date?OpenView

There is also a screen sorted by name which includes Ethics Agreements:
https://extapps2.oge.gov/201/Presiden.nsf/PAS%20Index?OpenView

The President and Vice President's reports are here:
https://extapps2.oge.gov/201/Presiden.nsf/President%20and%20Vice%20President%20Index

It seems reasonable to build a web scraper to search this site daily for new PFDs. That should probably be a separate GitHub project, but I wanted to document the idea here.

bnsmith3 · 2017-04-12T21:48:43Z

Do you want help with this?

gregoryfoster · 2017-04-12T22:35:55Z

Hi @bnsmith3! I'd love some help on this one! Feel free to approach this however you'd like, though I do recommend standing up an independent repository solely focused on a scraper.

As this would be a civic data service, I was considering implementation on morph.io:
https://morph.io/

bnsmith3 · 2017-04-12T23:50:21Z

Ok. Is it okay with you if the first version simply stores the pdfs, and maybe version two will involve actually parsing the pdfs?

hazel-bohon · 2017-04-13T00:13:16Z

If you want to just do the downloading part, I can take a swing at extractibg the text from the PDFs

gregoryfoster · 2017-04-13T00:23:46Z

Hi @zacherybohon!

re: parsing. Thus far, we've been using CPI's pfd-parser project to parse the (somewhat) structured PDF files. It's a NodeJS script and it's pretty ugly but it seems to work well enough.

That said, the outputs of that script assume you're parsing a set of Public Financial Disclosure Reports and want the data partitioned by table, indexed by filer. That may or may not make sense based on your use case. There's a use case for parsing one file into, say, JSON. Long story short, I wouldn't be opposed to seeing another Public Financial Disclosure Report parser crop up. I would again suggest any new parser should do one thing well and therefore be in a standalone repository.

bnsmith3 · 2017-04-13T00:41:33Z

Sounds good. Based on that, it made sense to me to make the parsing a separate issue, so I made one: #11.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OGE.gov Scraper #8

OGE.gov Scraper #8

gregoryfoster commented Apr 3, 2017 •

edited

Loading

bnsmith3 commented Apr 12, 2017

gregoryfoster commented Apr 12, 2017

bnsmith3 commented Apr 12, 2017

hazel-bohon commented Apr 13, 2017

gregoryfoster commented Apr 13, 2017

bnsmith3 commented Apr 13, 2017

OGE.gov Scraper #8

OGE.gov Scraper #8

Comments

gregoryfoster commented Apr 3, 2017 • edited Loading

bnsmith3 commented Apr 12, 2017

gregoryfoster commented Apr 12, 2017

bnsmith3 commented Apr 12, 2017

hazel-bohon commented Apr 13, 2017

gregoryfoster commented Apr 13, 2017

bnsmith3 commented Apr 13, 2017

gregoryfoster commented Apr 3, 2017 •

edited

Loading