|
42 | 42 | "outputs": [],
|
43 | 43 | "source": [
|
44 | 44 | "import pandas as pd\n",
|
45 |
| - "\n", |
46 |
| - "combined_data_cleansed_df = pd.read_csv('combined_data_cleansed.csv')" |
| 45 | + "import numpy as np\n", |
| 46 | + "import seaborn as sns\n", |
| 47 | + "import matplotlib.pyplot as plt" |
| 48 | + ] |
| 49 | + }, |
| 50 | + { |
| 51 | + "cell_type": "code", |
| 52 | + "execution_count": null, |
| 53 | + "metadata": {}, |
| 54 | + "outputs": [], |
| 55 | + "source": [ |
| 56 | + "combined_data_cleansed_df = pd.read_csv('combined_data_cleaned.csv')" |
| 57 | + ] |
| 58 | + }, |
| 59 | + { |
| 60 | + "cell_type": "markdown", |
| 61 | + "metadata": {}, |
| 62 | + "source": [ |
| 63 | + "Let's rename our dataframe to df so that it will be easier to use the code suggestions from GitHub Copilot chat." |
| 64 | + ] |
| 65 | + }, |
| 66 | + { |
| 67 | + "cell_type": "code", |
| 68 | + "execution_count": null, |
| 69 | + "metadata": {}, |
| 70 | + "outputs": [], |
| 71 | + "source": [ |
| 72 | + "df = combined_data_cleansed_df" |
47 | 73 | ]
|
48 | 74 | },
|
49 | 75 | {
|
|
82 | 108 | "execution_count": null,
|
83 | 109 | "metadata": {},
|
84 | 110 | "outputs": [],
|
85 |
| - "source": [] |
| 111 | + "source": [ |
| 112 | + "# show first few records" |
| 113 | + ] |
86 | 114 | },
|
87 | 115 | {
|
88 | 116 | "cell_type": "code",
|
89 | 117 | "execution_count": null,
|
90 | 118 | "metadata": {},
|
91 | 119 | "outputs": [],
|
92 |
| - "source": [] |
| 120 | + "source": [ |
| 121 | + "# show df shape" |
| 122 | + ] |
93 | 123 | },
|
94 | 124 | {
|
95 | 125 | "cell_type": "code",
|
96 | 126 | "execution_count": null,
|
97 | 127 | "metadata": {},
|
98 | 128 | "outputs": [],
|
99 |
| - "source": [] |
| 129 | + "source": [ |
| 130 | + "# show df columns" |
| 131 | + ] |
| 132 | + }, |
| 133 | + { |
| 134 | + "cell_type": "code", |
| 135 | + "execution_count": null, |
| 136 | + "metadata": {}, |
| 137 | + "outputs": [], |
| 138 | + "source": [ |
| 139 | + "# show df column and data types" |
| 140 | + ] |
100 | 141 | },
|
101 | 142 | {
|
102 | 143 | "cell_type": "markdown",
|
|
112 | 153 | "#### 4.1 Data visualization"
|
113 | 154 | ]
|
114 | 155 | },
|
| 156 | + { |
| 157 | + "cell_type": "markdown", |
| 158 | + "metadata": {}, |
| 159 | + "source": [ |
| 160 | + "##### Distribution of disease types\n", |
| 161 | + "\n", |
| 162 | + "Understanding the distribution of disease types helps identify the most common and rare cancers in the dataset, which is crucial for allocating resources and prioritizing research." |
| 163 | + ] |
| 164 | + }, |
115 | 165 | {
|
116 | 166 | "cell_type": "code",
|
117 | 167 | "execution_count": null,
|
118 | 168 | "metadata": {},
|
119 | 169 | "outputs": [],
|
120 | 170 | "source": [
|
121 |
| - "# create a bar graph of the diagnoses.age_at_diagnosis_years column to see the distribution of ages" |
| 171 | + "# create a pie chart of top 10 cases.disease_type from df" |
122 | 172 | ]
|
123 | 173 | },
|
124 | 174 | {
|
125 | 175 | "cell_type": "markdown",
|
126 | 176 | "metadata": {},
|
127 | 177 | "source": [
|
128 |
| - "Now let's share with GitHub copilot chat the columns in our dataset and what visualizations and correlations it thinks that we can create from these columns." |
| 178 | + "##### Gender demographic\n", |
| 179 | + "\n", |
| 180 | + "Let's take a look at how the data is distributed with respect to gender." |
| 181 | + ] |
| 182 | + }, |
| 183 | + { |
| 184 | + "cell_type": "code", |
| 185 | + "execution_count": null, |
| 186 | + "metadata": {}, |
| 187 | + "outputs": [], |
| 188 | + "source": [ |
| 189 | + "# show the distribution of the column demographic.gender in bar chart" |
| 190 | + ] |
| 191 | + }, |
| 192 | + { |
| 193 | + "cell_type": "markdown", |
| 194 | + "metadata": {}, |
| 195 | + "source": [ |
| 196 | + "As we can see from above, the gender information was available from all but 9 samples and showed a slight bias toward females versus males.\n", |
| 197 | + "\n", |
| 198 | + "According to the [study](https://aacrjournals.org/cancerres/article/77/9/2464/625134/High-Throughput-Genomic-Profiling-of-Adult-Solid), this bias can be explained in part by the large number of breast and GYN cancer samples within the dataset since both breast and gynecological cancers are specific to females. Let's try to visually see if that is the case." |
| 199 | + ] |
| 200 | + }, |
| 201 | + { |
| 202 | + "cell_type": "code", |
| 203 | + "execution_count": null, |
| 204 | + "metadata": {}, |
| 205 | + "outputs": [], |
| 206 | + "source": [ |
| 207 | + "# show the relationship between cases.primary_site and demographic.gender" |
| 208 | + ] |
| 209 | + }, |
| 210 | + { |
| 211 | + "cell_type": "markdown", |
| 212 | + "metadata": {}, |
| 213 | + "source": [ |
| 214 | + "A similar analysis we can look at is the relationship between the disease type and the gender of the patient\n", |
| 215 | + "\n", |
| 216 | + "Identifying gender differences in disease prevalence can highlight gender-specific vulnerabilities or protective factors, influencing personalized treatment approaches." |
| 217 | + ] |
| 218 | + }, |
| 219 | + { |
| 220 | + "cell_type": "code", |
| 221 | + "execution_count": null, |
| 222 | + "metadata": {}, |
| 223 | + "outputs": [], |
| 224 | + "source": [ |
| 225 | + "# visualize the relationship between cases.disease_type and demographic.gender" |
| 226 | + ] |
| 227 | + }, |
| 228 | + { |
| 229 | + "cell_type": "markdown", |
| 230 | + "metadata": {}, |
| 231 | + "source": [ |
| 232 | + "##### Age distribution\n", |
| 233 | + "\n", |
| 234 | + "The study \"High-Throughput Genomic Profiling of Adult Solid Tumors\" utilized patient samples that were part of routine clinical care, which were submitted for genomic profiling by Foundation Medicine. So the study did not do a random sampling as part of their data collection.\n", |
| 235 | + "\n", |
| 236 | + "That being said, let's see how close to a normal distribution the dataset is with respect to age." |
| 237 | + ] |
| 238 | + }, |
| 239 | + { |
| 240 | + "cell_type": "code", |
| 241 | + "execution_count": null, |
| 242 | + "metadata": {}, |
| 243 | + "outputs": [], |
| 244 | + "source": [ |
| 245 | + "# show distribution of diagnoses.age_at_diagnosis_years" |
| 246 | + ] |
| 247 | + }, |
| 248 | + { |
| 249 | + "cell_type": "markdown", |
| 250 | + "metadata": {}, |
| 251 | + "source": [ |
| 252 | + "What is the relationship between age at diagnosis and disease type?\n", |
| 253 | + "\n", |
| 254 | + "This question helps determine if certain cancers are more likely to occur at specific ages, which can inform targeted awareness and early detection efforts in particular demographics." |
| 255 | + ] |
| 256 | + }, |
| 257 | + { |
| 258 | + "cell_type": "code", |
| 259 | + "execution_count": null, |
| 260 | + "metadata": {}, |
| 261 | + "outputs": [], |
| 262 | + "source": [] |
| 263 | + }, |
| 264 | + { |
| 265 | + "cell_type": "markdown", |
| 266 | + "metadata": {}, |
| 267 | + "source": [ |
| 268 | + "Is there a relationship between the primary diagnosis and the sample type?\n", |
| 269 | + "\n", |
| 270 | + "This question is important to understand if certain diagnoses are more likely to be made from specific types of samples, affecting diagnostic strategies and the feasibility of certain tests." |
129 | 271 | ]
|
130 | 272 | },
|
131 | 273 | {
|
|
142 | 284 | "#### 4.X Additional analysis"
|
143 | 285 | ]
|
144 | 286 | },
|
| 287 | + { |
| 288 | + "cell_type": "markdown", |
| 289 | + "metadata": {}, |
| 290 | + "source": [ |
| 291 | + "Now let's share with GitHub copilot chat the columns in our dataset and what visualizations and correlations it thinks that we can create from these columns." |
| 292 | + ] |
| 293 | + }, |
145 | 294 | {
|
146 | 295 | "cell_type": "code",
|
147 | 296 | "execution_count": null,
|
|
0 commit comments