Skip to content

Latest commit

 

History

History
39 lines (25 loc) · 7.82 KB

README.md

File metadata and controls

39 lines (25 loc) · 7.82 KB

Parsing NFHS-5

The National Family & Health Survey (NFHS) is a survey in India that attempts to collect information on health conditions, nutrition, family planning, domestic violence, and a host of other factors through conducting surveys on a random ("representative") sample of Indian households in all states. The fifth NFHS was conducted through 2019-21, and the reports were released to the public in 2021 and can be found at this link.

One small problem, however, is that all the reports are provided as PDFs, which are pretty neat for humans to read, but terrible for computers to parse. This repo contains scripts that will download all the district wise reports (704 of them), and extract data from the tables, and convert it into machine-friendly JSON. NFHS provides district-wise, state-wise and entire country (aggregate) reports. This repo currently contains code to download, parse and generate JSONs for district-wise reports only. Districtwise reports contain information on 104 "indicators" (or questions asked in the survey). Statewise reports seem to contain some extra information that is not reported district wise, and has approximately 130+ indicators.

Note: I tried my best to make sure the data is being parsed correctly, but there is a possibility that some data in JSON might not be 100% accurate - there is no way I could have manually verified all 704 PDF files and their outputs, so I randomly sampled and verified a couple of files, all of which looked okay. If you want to replicate the data parsing from PDFs, feel free to go through the *.py files.

All code in this repository is released under the MIT License. The data (JSON, PDFs) are available as a Kaggle Dataset

Downloading district-wise data

  1. District wise data is available at this link (web archive link).

  2. From this webpage, we get the links to each of the statewise pages, which is saved in the statewise_district_links.csv file.

  3. Then, the get_districtwise_links.py script is used to compile the list of all district wise file URLs into districtwise_links.csv.

  4. download_all_districts.py is used to download PDFs and save them to districtwise_data/pdfs. During this process, it appears that the webpages for one state (Telangana) and one Union Territory (Chandigarh), currently point to a 404 page. So data for these.

  5. It looks like district wise data for Telangana is available in the Telangana State Compendium - we slice this file up, district wise and save the PDFs in the districtwise_data/telangana folder. Chandigarh has only one district, which covers the entire union territory, so it probably won't have any separate "district-wise" data, as such.

  6. There are 704 district-wise PDF files, totalling to approximately 450MB of data.

  7. With all this done, we use parse_pdf.py to parse the PDF and dump district wise data to JSON (in the directory districtwise_data/json/. This script uses Tabula and pdfminer.six for parsing PDFs.

  8. In the first round of PDF parsing, we used the parse_pdf.py script at commit ce4f8ee. Out of the 704 PDFs, we could generate JSONs successfully for only 563 files. 141 PDFs resulted in errors, which are listed below, along with what was done to solve the errors:

State Failed Total Files Failed Filenames Solving the issue
Madhya Pradesh 50 50 All files Created a Tabula template file and used that.
Rajasthans 33 33 All files Created a Tabula template file and used that.
telangana 31 31 All files Turns out there was an image in the first (introduction) page, which I forgot to filter out.
Himachal Pradesh 12 12 All files Created a Tabula template file and used that.
nct_of_delhi_ut 11 11 All files Created a Tabula template file and used that.
Maharashtra 2 36 raigarh and thane Raigarh: In general, all districtwise data files had only 6 pages, so I added an assert statement to ensure that the PDF file has exactly six pages. Turns out Raigarhs file has 7 pages (one blank page extra on page 6). Also, added a Tabula template file for this. Thane's error was being caused due to a nan/empty value in the 'Indicator' column.
West Bengal 1 20 jalpaiguri Even this file has 7 pages, instead of 6. Also, data tables are located on pages [3, 4, 6] (instead of the usual pages [3, 4, 5]); page 5 is blank
Gujarat 1 22 kheda In page 4 of this PDf, the heading "NFHS-5 (2019-20)" took two lines instead of one, causing the parsing script to fail
  1. The Tabula template files were generated manually by dragging and selecting the tables using the Tabula Desktop app for Linux. The saved template files are located in the directory tabula_templates in this repository. Tabula 1.2.1, which was downloaded from this link (sha256sum: fea6a5d26e2ab1abf2cc0a694d93810c59e93e0ce9190fce31541fdf6e7e6ece tabula-jar-1.2.1.zip).