Updated prompts

timmanik · timmanik · commit 4a7bf0fce1da · 2024-06-04T18:59:05.000-04:00
diff --git a/01-cancer-data-analysis/fm-ad-notebook-exploration.ipynb b/01-cancer-data-analysis/fm-ad-notebook-exploration.ipynb
@@ -158,6 +158,7 @@
    "source": [
     "import boto3\n",
     "import pandas as pd\n",
+    "import numpy as np\n",
     "from botocore import UNSIGNED\n",
     "from botocore.config import Config\n",
     "from io import StringIO\n",
diff --git a/01-cancer-data-analysis/fm-ad-notebook-processing.ipynb b/01-cancer-data-analysis/fm-ad-notebook-processing.ipynb
@@ -34,9 +34,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# convert the combined_data.csv to dataframe called combined_df\n",
     "import pandas as pd\n",
-    "\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# convert the combined_data.csv to dataframe called combined_df\n",
     "combined_df = pd.read_csv('combined_data.csv')"
    ]
   },
@@ -95,7 +103,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# list the name of columns that have more than or equal to 18004 unique values"
+    "# create a dictionary to store names of columns that have great than or equal to 18004 unique values"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As we can see from the list above, the `case_id` column is included in the list. However, since we want to use this list to specifiy which columns to remove from the dataframe, we should remove `case_id` from this list."
    ]
   },
   {
@@ -104,7 +119,31 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# create a new dataframe and drop columns that have more than or equal to 18004 unique values. however, do not drop the 'case_id' column"
+    "# remove case_id from the dictionary above"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's verify that the `case_id` column has been removed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# create a copy of the current dataframe\n",
+    "# drop columns from the dictionary above"
    ]
   },
   {
@@ -250,7 +289,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# show the records with the case_id aff95088-8760-46d2-a404-b545807e0735"
+    "# show the records with the case_id fcd9637f-00f2-49e9-bb87-94e556d5d7eb"
    ]
   },
   {
@@ -268,9 +307,10 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Write a Python code that selects the two records with the same 'case_id' value of 'aff95088-8760-46d2-a404-b545807e0735'.\n",
-    "# Display these records for visual inspection. Then, verify that these records complement each other in terms of null and non-null values for all columns after the first four columns.\n",
+    "# Select the two records with the same 'case_id' value of 'fcd9637f-00f2-49e9-bb87-94e556d5d7eb.\n",
+    "# Display these records for visual inspection. Then, verify that these records complement each other in terms of null and non-null values for all columns after the first five columns.\n",
     "# In other words, if one record has NaN values in a column, the other record should have non-NaN values in that same column, and vice versa.\n",
+    "# Also, if both records have NaN values in the same column and ignore it from the comparison\n",
     "# If the two records complement each other, print \"The two records complement each other.\" Otherwise, print \"The two records do not complement each other.\""
    ]
   },
@@ -288,9 +328,19 @@
    "outputs": [],
    "source": [
     "# write a Python code snippet that iterates over all unique 'case_id' values. For each 'case_id', select all records associated with that 'case_id'.\n",
-    "# Verify that these records complement each other in terms of null and non-null values for all columns after the first four columns.\n",
+    "# Verify that these records complement each other in terms of null and non-null values for all columns AFTER the first five columns.\n",
     "# In other words, if one record has NaN values in a column, the other records should have non-NaN values in that same column, and vice versa.\n",
-    "# Print a dictionary where each 'case_id' is a key and the corresponding value is a boolean indicating whether all records with that 'case_id' perfectly complement each other in terms of null and non-null values."
+    "# Print a dictionary where each 'case_id' is a key and the corresponding value is a boolean indicating whether all records with that 'case_id' perfectly complement each other in terms of null and non-null values.\n",
+    "# For example, if all records with 'case_id' = 'fcd9637f-00f2-49e9-bb87-94e556d5d7eb' perfectly complement each other, the dictionary should have the key 'fcd9637f-00f2-49e9-bb87-94e556d5d7eb' with a value of True."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Check if all the values in the dictionary are True if so print \"All records complement each other.\" otherwise print \"Not all records complement each other.\""
    ]
   },
   {
@@ -489,7 +539,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Verify the range of the new age column is as expected."
+    "Verify the range of the new age column is as expected. The proper range should be between 19-88."
    ]
   },
   {
@@ -501,6 +551,22 @@
     "# show statistical summary of the diagnoses.age_at_diagnosis_years column"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once the range is expected, we can drop the diagnosis.age_at_diagnosis column so that in the next notebook, GitHub Copilot is able to automatically choose our single \"age column\" for vidsualizations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# drop diagnosis.age_at_diagnosis column"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
diff --git a/01-cancer-data-analysis/fm-ad-notebook-visualization.ipynb b/01-cancer-data-analysis/fm-ad-notebook-visualization.ipynb
@@ -42,8 +42,34 @@
    "outputs": [],
    "source": [
     "import pandas as pd\n",
-    "\n",
-    "combined_data_cleansed_df = pd.read_csv('combined_data_cleansed.csv')"
+    "import numpy as np\n",
+    "import seaborn as sns\n",
+    "import matplotlib.pyplot as plt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "combined_data_cleansed_df = pd.read_csv('combined_data_cleaned.csv')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's rename our dataframe to df so that it will be easier to use the code suggestions from GitHub Copilot chat."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = combined_data_cleansed_df"
    ]
   },
   {
@@ -82,21 +108,36 @@
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "# show first few records"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "# show df shape"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "# show df columns"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# show df column and data types"
+   ]
   },
   {
    "cell_type": "markdown",
@@ -112,20 +153,121 @@
     "#### 4.1 Data visualization"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### Distribution of disease types\n",
+    "\n",
+    "Understanding the distribution of disease types helps identify the most common and rare cancers in the dataset, which is crucial for allocating resources and prioritizing research."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "# create a bar graph of the diagnoses.age_at_diagnosis_years column to see the distribution of ages"
+    "# create a pie chart of top 10 cases.disease_type from df"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now let's share with GitHub copilot chat the columns in our dataset and what visualizations and correlations it thinks that we can create from these columns."
+    "##### Gender demographic\n",
+    "\n",
+    "Let's take a look at how the data is distributed with respect to gender."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# show the distribution of the column demographic.gender in bar chart"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As we can see from above, the gender information was available from all but 9 samples and showed a slight bias toward females versus males.\n",
+    "\n",
+    "According to the [study](https://aacrjournals.org/cancerres/article/77/9/2464/625134/High-Throughput-Genomic-Profiling-of-Adult-Solid), this bias can be explained in part by the large number of breast and GYN cancer samples within the dataset since both breast and gynecological cancers are specific to females. Let's try to visually see if that is the case."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# show the relationship between cases.primary_site and demographic.gender"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A similar analysis we can look at is the relationship between the disease type and the gender of the patient\n",
+    "\n",
+    "Identifying gender differences in disease prevalence can highlight gender-specific vulnerabilities or protective factors, influencing personalized treatment approaches."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# visualize the relationship between cases.disease_type and demographic.gender"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### Age distribution\n",
+    "\n",
+    "The study \"High-Throughput Genomic Profiling of Adult Solid Tumors\" utilized patient samples that were part of routine clinical care, which were submitted for genomic profiling by Foundation Medicine. So the study did not do a random sampling as part of their data collection.\n",
+    "\n",
+    "That being said, let's see how close to a normal distribution the dataset is with respect to age."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# show distribution of diagnoses.age_at_diagnosis_years"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "What is the relationship between age at diagnosis and disease type?\n",
+    "\n",
+    "This question helps determine if certain cancers are more likely to occur at specific ages, which can inform targeted awareness and early detection efforts in particular demographics."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Is there a relationship between the primary diagnosis and the sample type?\n",
+    "\n",
+    "This question is important to understand if certain diagnoses are more likely to be made from specific types of samples, affecting diagnostic strategies and the feasibility of certain tests."
    ]
   },
   {
@@ -142,6 +284,13 @@
     "#### 4.X Additional analysis"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let's share with GitHub copilot chat the columns in our dataset and what visualizations and correlations it thinks that we can create from these columns."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,