Skip to content

Commit

Permalink
Restructure directory for pilot. Created jupyter subdir under pilot. …
Browse files Browse the repository at this point in the history
…FHIR-2451
  • Loading branch information
mingward committed Aug 20, 2024
1 parent 0f968e9 commit af91ed7
Show file tree
Hide file tree
Showing 4 changed files with 141 additions and 24 deletions.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,48 +1,174 @@
{
"cells": [
{
"cell_type": "raw",
"id": "1211f6c8-6ee5-4d5c-b926-1f3210c8704a",
"cell_type": "markdown",
"id": "4619fbca-7eaa-4846-8f0a-035ca7438254",
"metadata": {},
"source": [
"# Query pilot server for pheontype data.\n",
"## What FHIR server to use?\n",
"Note this sample code is using a synthetic data server at: https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1\n",
"The real server for URECA study(phs002921) is at: https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot1/x1. You will first need to make Controlled Data Access Request(DAR). https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetPdf.cgi?document_name=GeneralAAInstructions.pdf\n",
"The real server for URECA study(phs002921) is at: https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot1/x1. You will first need to make Controlled Data Access Request(DAR) at dbGaP home page: https://www.ncbi.nlm.nih.gov/gap/. If you are new to applying for dbGaP controlled access data, please see: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetPdf.cgi?document_name=GeneralAAInstructions.pdf\n",
"## About the authorization token:\n",
"1. If you are using the synthetic server to try out FHIR pilot server, you do not need to have the \"real token\" in the token file. The script below still uses a token file so it works once you have the real token in the file.\n",
"2. If you want to use the real study data which is controlled-access, you need DAR approval. After your DAR is approved, go to https://www.ncbi.nlm.nih.gov/gap/power-user-portal/, login with your eRA account, scroll down and click on the \"Task Specific Token\" button to get the token file. Save the token in a text file. In the example below, it is saved to \"task-specific-token-all.txt\".\n",
"## What does this script do?\n",
"## What sample scripts do ths notebook have?\n",
"### A simple \"Hello World\" to dbGaP FHIR API\n",
"1. This script assumes that you have basic python knowledge and know what is FHIR resource(see: https://hl7.org/fhir/resourcelist.html). \n",
"This basic script uses module \"requests\" to conect to the FHIR server and fetch the ResearchStudy resource content for dbGaP study URECA (phs002921). You can see the study details at: \n",
" https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002921.v2.p1. \n",
"The real data is controlled-access. Here, we will get synthetic data from a test pilot server.\n",
"\n",
"2. Note, you will see the url used in the script to access FHIR server. You can use the url directly on the browser to see result as well. For example, \"https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/ResearchStudy?_id=phs002921\" will return the study summary data. This is handy for testing. Python script allows you to parse and analyze the data programmatically. \n",
"###\n",
"This sample script shows how to get the Study Subject Phenotype data. You can see the content of the Subject Phenotype dataset here: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/dataset.cgi?study_id=phs002921.v2.p1&pht=12614 including the data dictionary (https://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs002921/phs002921.v2.p1/pheno_variable_summaries/phs002921.v2.pht012614.v1.ICAC_Subject_Phenotypes.data_dict.xml ) \n",
"Note that in dbGaP, the Subject Phenotype dataset usually includes demographic data in addition to phenotypic data.\n",
"## Script summary\n",
"This script first connects to the synthetic data server. Retrieves the patients of URECA study and saves it in a Python List: patient_ids. \n",
"The script then iterates through the \"patient_ids\", to get the data in \"subject phenotype\" file which is stored in FHIR Observation Resource. \n",
"The script saves the phenotype values in patient_observations.csv. \n"
"Note that in dbGaP, the Subject Phenotype dataset usually includes demographic data in addition to phenotypic data.\n"
]
},
{
"cell_type": "markdown",
"id": "655fdf8c-6e6e-4010-a7c3-304b9a74ab18",
"metadata": {},
"source": [
"########################################################################## \n",
"# Sample script -1 Hello to dbGaP FHIR Server\n",
"# \n",
"# 1. This scripts assumes that you have basic python knowledge and know what is FHIR resource(see: https://hl7.org/fhir/resourcelist.html). \n",
"# This basic script uses module \"requests\" to conect to the FHIR server and fetch the ResearchStudy resource content for \n",
"# dbGaP study URECA (phs002921). You can see the study details at: \n",
"# https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002921.v2.p1. \n",
"# The real data is controlled-access. Here, we will get synthetic data from a test pilot server.\n",
"# 2. Note - You can put \"https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/ResearchStudy?_id=phs002921\" in your browser to see the same data. \n",
"# This is handy for testing. Python script allows you to parse and analyze the data programmatically. \n",
"#\n",
"########################################################################## \n",
"\n",
"import requests\n",
"import json\n",
"\n",
"# Initialize a session and update headers\n",
"session = requests.Session()\n",
"session.headers.update({\n",
" 'Accept': 'application/fhir+json',\n",
" 'Content-Type': 'application/x-www-form-urlencoded',\n",
"})\n",
"\n",
"# Set up the FHIR server URL and query\n",
"dbgap_fhir_url = \"https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/\"\n",
"my_query = \"ResearchStudy?_id=phs002921\"\n",
"\n",
"my_fhir_url = dbgap_fhir_url + my_query\n",
"\n",
"# Print the full URL being used for the query\n",
"print(f\"my_fhir_url is {my_fhir_url}\")\n",
"\n",
"# Send the GET request to the FHIR server\n",
"response = session.get(my_fhir_url)\n",
"\n",
"# Check if the request was successful\n",
"if response.status_code == 200:\n",
" response_json = response.json()\n",
" \n",
" # Print the JSON response in a readable format\n",
" print(json.dumps(response_json, indent=4))\n",
"else:\n",
" print(f\"Failed to retrieve data: {response.status_code}\")\n"
]
},
{
"cell_type": "markdown",
"id": "fc3f26a4-a45e-427b-ab73-4abd76bf0937",
"metadata": {},
"source": [
"# Try out more queries in dbGaP FHIR pilot server\n",
"\n",
"dbGaP-FHIR-Resource-Intro.md introduces you to all the resources in dbGaP FHIR pilot server.\n",
"Go ahead and use the example query in this file and modfiy the sample script-1 to check out the content of these resources. \n",
"\n",
"# A few quick FHIR query tip\n",
"- To get a few entries in a resource, https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/ResearchSubject?study=phs001232&_count=10\n",
"- To get the count of entries in a resouce, https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/ResearchSubject?_summary=count\n",
"- To get the count of researchSubject for phs002921: https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/ResearchSubject?study=phs002921&_summary=count\n",
"- \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "37fe7a43-20ea-406a-b02d-bd0b86aa9be0",
"id": "b4fb038d-209c-4e16-a61b-cd8c936363a2",
"metadata": {},
"outputs": [],
"source": [
"## Script summary\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "37fe7a43-20ea-406a-b02d-bd0b86aa9be0",
"metadata": {},
"outputs": [
{
"ename": "ModuleNotFoundError",
"evalue": "No module named 'fhir_fetcher'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[10], line 22\u001b[0m\n\u001b[1;32m 20\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mdatetime\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m datetime\n\u001b[1;32m 21\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mtime\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m sleep\n\u001b[0;32m---> 22\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mfhir_fetcher\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m fetch_all_data \u001b[38;5;66;03m# Ensure this module is available and handles paging through all records\u001b[39;00m\n\u001b[1;32m 25\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mfetch_patient_ids\u001b[39m(session, fhir_base_url, study_reference):\n\u001b[1;32m 28\u001b[0m query_url \u001b[38;5;241m=\u001b[39m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mfhir_base_url\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m/ResearchSubject?study=\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mstudy_reference\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m\n",
"\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'fhir_fetcher'"
]
}
],
"source": [
"########################################################################## \n",
"# Sample script -2 Get all the Observation data for a study from the dbGaP FHIR Server\n",
"# \n",
"# 1. This scripts assumes that you can write python scripts and understand the FHIR resource types used to represent dbGaP data. Please see dbGaP-FHIR-Resource-Intro.md \n",
"# 2. This script does the following in sequence: \n",
"# - Connects to the synthetic data server. \n",
"# - From *Patient* resource, it retrieves the patients of URECA study and saves it in a Python List: patient_ids in fetch_patient_ids().\n",
"# - For each patient_id in the patient_ids list, it fetches dbGaP subject phenotype data in *Observation* Resource in fetch_patient_observations().\n",
"# - It parses each patient's Observation json data to a python data dictionary *data* in extract_observation_data()\n",
"# What is in observation data? Please see URECA study page - phenotype dataset - ICAC_Subject_Phenotypes page: \n",
"# https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/dataset.cgi?study_id=phs002921.v2.p1&pht=12614\n",
"# You see that each demographic, measurement or observation is called a \"variable\" in dbGaP. Obsevation resource has both the \"variable\" and the value for the patient.\n",
"# The code add each patient's observation values for the list of variables in a list of data dictionary *output_table* \n",
"# - It then writes the content of the \"output_table\" in patient_observations.csv. \n",
"# 3. *Very Important* - if you have access to real data, please ensure this file is stored and viewed in a secure computing environment.\n",
"########################################################################## \n",
"import os\n",
"import requests\n",
"import csv\n",
"from datetime import datetime\n",
"from time import sleep\n",
"from fhir_fetcher import fetch_all_data # Ensure this module is available and handles paging through all records\n",
"\n",
"\n",
"def fetch_patient_ids(session, fhir_base_url, study_reference):\n",
" \n",
" \n",
" query_url = f\"{fhir_base_url}/ResearchSubject?study={study_reference}\"\n",
" \n",
" # query_url example: https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/ResearchSubject?study=phs002921\n",
" \n",
" print ( query_url)\n",
" research_subjects = fetch_all_data(session, query_url, 0, 'n')\n",
" patient_ids = [entry['resource']['individual']['reference'].split('/')[-1] for entry in research_subjects]\n",
" return patient_ids\n",
"\n",
"\n",
"def fetch_patient_observations(session, fhir_base_url, patient_id):\n",
" qstr = f'Observation?subject=Patient/{patient_id}'\n",
" # above qstr example:\n",
" # https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/Observation?subject=Patient/4317770\n",
" \n",
" # query URL example: https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/Observation?subject=Patient/4317770\n",
" \n",
" start_url = f\"{fhir_base_url}/{qstr}\"\n",
" observations = fetch_all_data(session, start_url, 0) # Fetch all observations for the patient\n",
" return observations\n",
"\n",
"\n",
"def extract_observation_data(observations):\n",
" data = {}\n",
" for entry in observations:\n",
Expand All @@ -57,16 +183,7 @@
" data[attribute_name] = value\n",
" return data\n",
"\n",
"def fetch_patient_ids(session, fhir_base_url, study_reference):\n",
" \n",
" \n",
" query_url = f\"{fhir_base_url}/ResearchSubject?study={study_reference}\"\n",
" # query_url example: https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/ResearchSubject?study=phs002921\n",
" print ( query_url)\n",
" research_subjects = fetch_all_data(session, query_url, 0, 'n')\n",
" patient_ids = [entry['resource']['individual']['reference'].split('/')[-1] for entry in research_subjects]\n",
" return patient_ids\n",
"\n",
"def main():\n",
"\n",
" # \n",
Expand Down Expand Up @@ -115,7 +232,7 @@
" patient_ids = fetch_patient_ids(session, fhir_base_url, study_reference)\n",
" print(f\"Total patients fetched: {len(patient_ids)}\")\n",
"\n",
" data = []\n",
" output_table = []\n",
" columns = set()\n",
" patients_with_observations = 0\n",
" for patient_id in patient_ids:\n",
Expand All @@ -124,7 +241,7 @@
" if observation_data:\n",
" observation_data['Patient'] = patient_id\n",
" columns.update(observation_data.keys())\n",
" data.append(observation_data)\n",
" output_table.append(observation_data)\n",
" patients_with_observations += 1\n",
" # print(f\"Observations obtained for patient: {patient_id}\")\n",
" print(f\"Accumulative patients with observations: {patients_with_observations}\")\n",
Expand All @@ -137,7 +254,7 @@
" with open(output_file, 'w', newline='') as csvfile:\n",
" csvwriter = csv.DictWriter(csvfile, fieldnames=columns)\n",
" csvwriter.writeheader()\n",
" csvwriter.writerows(data)\n",
" csvwriter.writerows(output_table)\n",
"\n",
" print(f\"Data written to {output_file}\")\n",
"\n",
Expand Down
File renamed without changes.

0 comments on commit af91ed7

Please sign in to comment.