Update readme and notebooks

timmanik · timmanik · commit 30e762aaca1a · 2024-06-03T14:55:30.000-04:00
diff --git a/00-intro-to-pandas/README.md b/00-intro-to-pandas/README.md
@@ -0,0 +1,49 @@
+# Introduction to Pandas
+
+Below is a quick primer on pandas. The purpose of this markdown is to cover the basic concepts of pandas to set the context for the exercises that we will be undertaking in this workshop.
+
+## What is Pandas?
+
+Pandas is a powerful Python library used for data manipulation and analysis. It provides data structures and functions designed to work efficiently with structured data, particularly large and complex datasets.
+
+## Key Features of Pandas
+
+- **Data manipulation**: Perform basic and advanced data operations like merging, reshaping, selecting, as well as cleaning.
+- **Time Series**: Extensive support for date-time data for financial and time-series analysis.
+- **Handling Missing Data**: Convenient methods for detecting, removing, and replacing missing data.
+- **Efficient I/O Tools**: Tools for reading data from a variety of formats (CSV, Excel, JSON, etc.) and writing data.
+
+## What is a DataFrame?
+
+A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is arguably the most central and widely used pandas object, akin to a spreadsheet.
+
+### Creating a DataFrame
+
+You can create a DataFrame from various sources such as a list, dictionary, or reading from a CSV file.
+
+```python
+import pandas as pd
+
+# Creating from a dictionary
+data = {'Name': ['John', 'Anna'], 'Age': [28, 22]}
+df = pd.DataFrame(data)
+
+# Display the DataFrame
+print(df)
+```
+
+### Basic DataFrame Operations
+- **Viewing Data**: df.head() to see the first few rows.
+- **Describing Data**: df.describe() to view summary statistics.
+- **Data Selection**: df['Column'] to select a column.
+
+### Why Use Pandas?
+
+Pandas simplifies tasks that are complex in other languages, enabling more readable, concise, and user-friendly data manipulation code. It is an indispensable tool for data scientists due to its powerful data processing capabilities and seamless integration with other libraries.
+
+
+### Further Learning
+
+For more detailed exploration, consider checking out the [Pandas Documentation](https://pandas.pydata.org/docs/).
+
+Also checkout an introductory guide on [Python for Data Analysis](https://wesmckinney.com/book/).
diff --git a/01-cancer-data-analysis/README.md b/01-cancer-data-analysis/README.md
@@ -14,7 +14,7 @@ This repository contains a collection of notebooks for the data analysis workflo
 
 To get started with the cancer data analysis, follow these steps:
 
-1. Clone this repository to your local machine.
+1. Ensure you have cloned the repository by following the steps in the [Getting Started section of the main README](../README.md#getting-started) of this repo.
 2. Open the notebooks in Visual Studio Code.
 3. Run the notebooks in the specified order (exploration, processing, visualization) to perform the complete data analysis workflow.
 
diff --git a/01-cancer-data-analysis/fm-ad-notebook-processing.ipynb b/01-cancer-data-analysis/fm-ad-notebook-processing.ipynb
@@ -84,7 +84,7 @@
    "source": [
     "According to the dataset documentation, there are 18,004 records in the study. Additionally, results from section 2.6 of [fm-ad-notebook-exploration.ipynb](fm-ad-notebook-exploration.ipynb) indicate that the `case_id` column has 18,004 unique values.\n",
     "\n",
-    "Moreover, other columns also have 18,004 unique values. These columns likely serve as unique identifiers similar to the `case_id` column, making them redundant.\n",
+    "Moreover, other columns also have 18,004 unique values. These columns likely serve as unique identifiers similar to the `case_id` column. For the purposes of this workshop, we can assume they are redundant.\n",
     "\n",
     "Let's create a prompt to identify which of these columns fit this criterion."
    ]
@@ -95,7 +95,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# list the name of columns that have more than or equal to 18004 unique values\n"
+    "# list the name of columns that have more than or equal to 18004 unique values"
    ]
   },
   {
@@ -175,7 +175,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We saw in section 2.7 of the `fm-ad-notebook-exploration.ipynb` notebook that there were duplicate records. Let's go ahead and drop them."
+    "We saw in section 2.7 of [fm-ad-notebook-exploration.ipynb](fm-ad-notebook-exploration.ipynb) notebook that there were duplicate records. Let's go ahead and drop them."
    ]
   },
   {
@@ -207,7 +207,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In section 2.7, you saw that there were records that shared the same case_id. Let's check if there are any other records share a case_id.\n"
+    "In section 2.7, you saw that there were records that shared the same `case_id`. Let's check if there are any other records share a `case_id`.\n"
    ]
   },
   {
@@ -219,6 +219,13 @@
     "# count how many records share the same case_id"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's take a look at a visual representation of the distribution of the number of records that are shared by `case_id`'s."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -232,7 +239,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's take a look at the instance where a case_id is shared between records.\n",
+    "Let's take a look at the list of records shared by a particular `case_id`.\n",
     "\n",
     "Create a prompt below to generate code to show you records that shares a case_id different from the case_id in section 2.7 of [fm-ad-notebook-exploration.ipynb](fm-ad-notebook-exploration.ipynb)."
    ]
@@ -246,6 +253,15 @@
     "# show the records with the case_id aff95088-8760-46d2-a404-b545807e0735"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So far we have created prompts that are called zero-shot prompts. Basically it means that these prompts have no specific examples, we just tell it to do what we want.\n",
+    "\n",
+    "Next we will be working with one-shot prompts. In addition to describing what you want like in zero-shot prompts, one-shot prompts are prompts adds the prompt with a single example. This helps generate a more context-aware response."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -412,6 +428,13 @@
     "6947/19"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Based on the calculation above, the column's normalization factor is 365. So let's transform the existing age column by dividing it by 365 and create a new column and dataframe."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -421,6 +444,13 @@
     "# create a new dataframe, create a new column 'diagnoses.age_at_diagnosis_years' by dividing 'diagnoses.age_at_diagnosis' by 365, and drop the 'diagonses.age_at_diagnosis' column"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The publication “High-Throughput Genomic Profiling of Adult Solid Tumors Reveals Novel Insights into Cancer Pathogenesis”, http://cancerres.aacrjournals.org/content/77/9/2464.long, removes records of patients aged 89 or older. Let's do some data cleaning to reflect this."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -439,6 +469,13 @@
     "# drop the record with 'diagnosis.age_at_diagnosis_years' greater or equal to 89"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Currently, the age column stores the ages of the participants as floats. The publication however describes and visualizes the result of the age distribution as integers. Given that information, let's do more data transformation to reflect this."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,