Restructure directory for pilot. Created jupyter subdir under pilot. …

…FHIR-2451
ncbi · Aug 20, 2024 · af91ed7 · af91ed7
1 parent 0f968e9
commit af91ed7
Show file tree

Hide file tree

Showing 4 changed files with 141 additions and 24 deletions.
diff --git a/jupyter/pilot/README.md → pilot/README.md b/jupyter/pilot/README.md → pilot/README.md
diff --git a/jupyter/pilot/dbGaP-FHIR-Resource-Intro.md → pilot/dbGaP-FHIR-Resource-Intro.md b/jupyter/pilot/dbGaP-FHIR-Resource-Intro.md → pilot/dbGaP-FHIR-Resource-Intro.md
diff --git a/...1_phs002921_URECA_subject_phenotype.ipynb → ...1_phs002921_URECA_subject_phenotype.ipynb b/...1_phs002921_URECA_subject_phenotype.ipynb → ...1_phs002921_URECA_subject_phenotype.ipynb
@@ -1,48 +1,174 @@
 {
  "cells": [
   {
-   "cell_type": "raw",
-   "id": "1211f6c8-6ee5-4d5c-b926-1f3210c8704a",
+   "cell_type": "markdown",
+   "id": "4619fbca-7eaa-4846-8f0a-035ca7438254",
    "metadata": {},
    "source": [
     "# Query pilot server for pheontype data.\n",
     "## What FHIR server to use?\n",
     "Note this sample code is using a synthetic data server at: https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1\n",
-    "The real server for URECA study(phs002921) is at: https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot1/x1. You will first need to make Controlled Data Access Request(DAR). https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetPdf.cgi?document_name=GeneralAAInstructions.pdf\n",
+    "The real server for URECA study(phs002921) is at: https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot1/x1. You will first need to make Controlled Data Access Request(DAR) at dbGaP home page: https://www.ncbi.nlm.nih.gov/gap/.  If you are new to applying for dbGaP controlled access data, please see: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetPdf.cgi?document_name=GeneralAAInstructions.pdf\n",
     "## About the authorization token:\n",
     "1. If you are using the synthetic server to try out FHIR pilot server, you do not need to have the \"real token\" in the token file. The script below still uses a token file so it works once you have the real token in the file.\n",
     "2. If you want to use the real study data which is controlled-access, you need DAR approval. After your DAR is approved, go to https://www.ncbi.nlm.nih.gov/gap/power-user-portal/, login with your eRA account, scroll down and click on the \"Task Specific Token\" button to get the token file. Save the token in a text file. In the example below, it is saved to \"task-specific-token-all.txt\".\n",
-    "## What does this script do?\n",
+    "## What sample scripts do ths notebook have?\n",
+    "### A simple \"Hello World\" to dbGaP FHIR API\n",
+    "1. This script assumes that you have basic python knowledge and know what is FHIR resource(see: https://hl7.org/fhir/resourcelist.html).  \n",
+    "This basic script uses module \"requests\" to conect to the FHIR server and fetch the ResearchStudy resource content for dbGaP study URECA (phs002921). You can see the study details at: \n",
+    "      https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002921.v2.p1. \n",
+    "The real data is controlled-access. Here, we will get synthetic data from a test pilot server.\n",
+    "\n",
+    "2.  Note, you will see the url used in the script to access FHIR server. You can use the url directly on the browser to see result as well. For   example, \"https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/ResearchStudy?_id=phs002921\" will return the study summary data. This is handy for testing. Python script allows you to parse and analyze the data programmatically. \n",
+    "###\n",
     "This sample script shows how to get the Study Subject Phenotype data.  You can see the content of the Subject Phenotype dataset here: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/dataset.cgi?study_id=phs002921.v2.p1&pht=12614 including the data dictionary (https://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs002921/phs002921.v2.p1/pheno_variable_summaries/phs002921.v2.pht012614.v1.ICAC_Subject_Phenotypes.data_dict.xml ) \n",
-    "Note that in dbGaP, the Subject Phenotype dataset usually includes demographic data in addition to phenotypic data.\n",
-    "## Script summary\n",
-    "This script first connects to the synthetic data server. Retrieves the patients of URECA study and saves it in a Python List: patient_ids. \n",
-    "The script then iterates through the \"patient_ids\", to get the data in \"subject phenotype\" file which is stored in FHIR Observation Resource. \n",
-    "The script saves the phenotype values in patient_observations.csv.  \n"
+    "Note that in dbGaP, the Subject Phenotype dataset usually includes demographic data in addition to phenotypic data.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "655fdf8c-6e6e-4010-a7c3-304b9a74ab18",
+   "metadata": {},
+   "source": [
+    "########################################################################## \n",
+    "#  Sample script -1 Hello to dbGaP FHIR Server\n",
+    "#  \n",
+    "#  1.  This scripts assumes that you have basic python knowledge and know what is FHIR resource(see: https://hl7.org/fhir/resourcelist.html).  \n",
+    "#      This basic script uses module \"requests\" to conect to the FHIR server and fetch the ResearchStudy resource content for \n",
+    "#      dbGaP study URECA (phs002921). You can see the study details at: \n",
+    "#      https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002921.v2.p1. \n",
+    "#      The real data is controlled-access. Here, we will get synthetic data from a test pilot server.\n",
+    "#  2.  Note - You can put \"https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/ResearchStudy?_id=phs002921\" in your browser to see the same data. \n",
+    "#      This is handy for testing. Python script allows you to parse and analyze the data programmatically. \n",
+    "#\n",
+    "########################################################################## \n",
+    "\n",
+    "import requests\n",
+    "import json\n",
+    "\n",
+    "# Initialize a session and update headers\n",
+    "session = requests.Session()\n",
+    "session.headers.update({\n",
+    "    'Accept': 'application/fhir+json',\n",
+    "    'Content-Type': 'application/x-www-form-urlencoded',\n",
+    "})\n",
+    "\n",
+    "# Set up the FHIR server URL and query\n",
+    "dbgap_fhir_url = \"https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/\"\n",
+    "my_query = \"ResearchStudy?_id=phs002921\"\n",
+    "\n",
+    "my_fhir_url = dbgap_fhir_url + my_query\n",
+    "\n",
+    "# Print the full URL being used for the query\n",
+    "print(f\"my_fhir_url is {my_fhir_url}\")\n",
+    "\n",
+    "# Send the GET request to the FHIR server\n",
+    "response = session.get(my_fhir_url)\n",
+    "\n",
+    "# Check if the request was successful\n",
+    "if response.status_code == 200:\n",
+    "    response_json = response.json()\n",
+    "    \n",
+    "    # Print the JSON response in a readable format\n",
+    "    print(json.dumps(response_json, indent=4))\n",
+    "else:\n",
+    "    print(f\"Failed to retrieve data: {response.status_code}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fc3f26a4-a45e-427b-ab73-4abd76bf0937",
+   "metadata": {},
+   "source": [
+    "# Try out more queries in dbGaP FHIR pilot server\n",
+    "\n",
+    "dbGaP-FHIR-Resource-Intro.md introduces you to all the resources in dbGaP FHIR pilot server.\n",
+    "Go ahead and use the example query in this file and modfiy the sample script-1 to check out the content of these resources. \n",
+    "\n",
+    "# A few quick FHIR query tip\n",
+    "- To get a few entries in a resource, https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/ResearchSubject?study=phs001232&_count=10\n",
+    "- To get the count of entries in a resouce, https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/ResearchSubject?_summary=count\n",
+    "- To get the count of researchSubject for phs002921: https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/ResearchSubject?study=phs002921&_summary=count\n",
+    "- \n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "37fe7a43-20ea-406a-b02d-bd0b86aa9be0",
+   "id": "b4fb038d-209c-4e16-a61b-cd8c936363a2",
    "metadata": {},
    "outputs": [],
    "source": [
+    "## Script summary\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "37fe7a43-20ea-406a-b02d-bd0b86aa9be0",
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "ModuleNotFoundError",
+     "evalue": "No module named 'fhir_fetcher'",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
+      "Cell \u001b[0;32mIn[10], line 22\u001b[0m\n\u001b[1;32m     20\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mdatetime\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m datetime\n\u001b[1;32m     21\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mtime\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m sleep\n\u001b[0;32m---> 22\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mfhir_fetcher\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m fetch_all_data  \u001b[38;5;66;03m# Ensure this module is available and handles paging through all records\u001b[39;00m\n\u001b[1;32m     25\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mfetch_patient_ids\u001b[39m(session, fhir_base_url, study_reference):\n\u001b[1;32m     28\u001b[0m     query_url \u001b[38;5;241m=\u001b[39m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mfhir_base_url\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m/ResearchSubject?study=\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mstudy_reference\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m\n",
+      "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'fhir_fetcher'"
+     ]
+    }
+   ],
+   "source": [
+    "########################################################################## \n",
+    "#  Sample script -2 Get all the Observation data for a study from the dbGaP FHIR Server\n",
+    "#  \n",
+    "#  1. This scripts assumes that you can write python scripts and understand the FHIR resource types used to represent dbGaP data. Please see dbGaP-FHIR-Resource-Intro.md \n",
+    "#  2. This script does the following in sequence: \n",
+    "#    - Connects to the synthetic data server. \n",
+    "#    - From *Patient* resource, it retrieves the patients of URECA study and saves it in a Python List: patient_ids in fetch_patient_ids().\n",
+    "#    - For each patient_id in the patient_ids list, it fetches dbGaP subject phenotype data in *Observation* Resource in fetch_patient_observations().\n",
+    "#    - It parses each patient's Observation json data to a python data dictionary *data* in extract_observation_data()\n",
+    "#      What is in observation data? Please see URECA study page - phenotype dataset - ICAC_Subject_Phenotypes page: \n",
+    "#       https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/dataset.cgi?study_id=phs002921.v2.p1&pht=12614\n",
+    "#       You see that each demographic, measurement or observation is called a \"variable\" in dbGaP.  Obsevation resource has both the \"variable\" and the value for the patient.\n",
+    "#       The code add each patient's observation values for the list of variables in a list of data dictionary *output_table* \n",
+    "#    - It then writes the content of the \"output_table\" in patient_observations.csv.  \n",
+    "#  3. *Very Important* - if you have access to real data, please ensure this file is stored and viewed in a secure computing environment.\n",
+    "########################################################################## \n",
     "import os\n",
     "import requests\n",
     "import csv\n",
     "from datetime import datetime\n",
     "from time import sleep\n",
     "from fhir_fetcher import fetch_all_data  # Ensure this module is available and handles paging through all records\n",
     "\n",
+    "\n",
+    "def fetch_patient_ids(session, fhir_base_url, study_reference):\n",
+    "    \n",
+    "    \n",
+    "    query_url = f\"{fhir_base_url}/ResearchSubject?study={study_reference}\"\n",
+    "    \n",
+    "    # query_url example: https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/ResearchSubject?study=phs002921\n",
+    "    \n",
+    "    print ( query_url)\n",
+    "    research_subjects = fetch_all_data(session, query_url, 0, 'n')\n",
+    "    patient_ids = [entry['resource']['individual']['reference'].split('/')[-1] for entry in research_subjects]\n",
+    "    return patient_ids\n",
+    "\n",
+    "\n",
     "def fetch_patient_observations(session, fhir_base_url, patient_id):\n",
     "    qstr = f'Observation?subject=Patient/{patient_id}'\n",
-    "    # above qstr example:\n",
-    "    # https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/Observation?subject=Patient/4317770\n",
+    "    \n",
+    "    # query URL example: https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/Observation?subject=Patient/4317770\n",
+    "    \n",
     "    start_url = f\"{fhir_base_url}/{qstr}\"\n",
     "    observations = fetch_all_data(session, start_url, 0)  # Fetch all observations for the patient\n",
     "    return observations\n",
     "\n",
+    "\n",
     "def extract_observation_data(observations):\n",
     "    data = {}\n",
     "    for entry in observations:\n",
@@ -57,16 +183,7 @@
     "                data[attribute_name] = value\n",
     "    return data\n",
     "\n",
-    "def fetch_patient_ids(session, fhir_base_url, study_reference):\n",
-    "    \n",
     "    \n",
-    "    query_url = f\"{fhir_base_url}/ResearchSubject?study={study_reference}\"\n",
-    "    # query_url example: https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-pilot/x1/ResearchSubject?study=phs002921\n",
-    "    print ( query_url)\n",
-    "    research_subjects = fetch_all_data(session, query_url, 0, 'n')\n",
-    "    patient_ids = [entry['resource']['individual']['reference'].split('/')[-1] for entry in research_subjects]\n",
-    "    return patient_ids\n",
-    "\n",
     "def main():\n",
     "\n",
     "    # \n",
@@ -115,7 +232,7 @@
     "    patient_ids = fetch_patient_ids(session, fhir_base_url, study_reference)\n",
     "    print(f\"Total patients fetched: {len(patient_ids)}\")\n",
     "\n",
-    "    data = []\n",
+    "    output_table = []\n",
     "    columns = set()\n",
     "    patients_with_observations = 0\n",
     "    for patient_id in patient_ids:\n",
@@ -124,7 +241,7 @@
     "        if observation_data:\n",
     "            observation_data['Patient'] = patient_id\n",
     "            columns.update(observation_data.keys())\n",
-    "            data.append(observation_data)\n",
+    "            output_table.append(observation_data)\n",
     "            patients_with_observations += 1\n",
     "            # print(f\"Observations obtained for patient: {patient_id}\")\n",
     "            print(f\"Accumulative patients with observations: {patients_with_observations}\")\n",
@@ -137,7 +254,7 @@
     "    with open(output_file, 'w', newline='') as csvfile:\n",
     "        csvwriter = csv.DictWriter(csvfile, fieldnames=columns)\n",
     "        csvwriter.writeheader()\n",
-    "        csvwriter.writerows(data)\n",
+    "        csvwriter.writerows(output_table)\n",
     "\n",
     "    print(f\"Data written to {output_file}\")\n",
     "\n",

diff --git a/jupyter/pilot/fhir_fetcher.py → pilot/jupyter/fhir_fetcher.py b/jupyter/pilot/fhir_fetcher.py → pilot/jupyter/fhir_fetcher.py