|
84 | 84 | "source": [
|
85 | 85 | "According to the dataset documentation, there are 18,004 records in the study. Additionally, results from section 2.6 of [fm-ad-notebook-exploration.ipynb](fm-ad-notebook-exploration.ipynb) indicate that the `case_id` column has 18,004 unique values.\n",
|
86 | 86 | "\n",
|
87 |
| - "Moreover, other columns also have 18,004 unique values. These columns likely serve as unique identifiers similar to the `case_id` column, making them redundant.\n", |
| 87 | + "Moreover, other columns also have 18,004 unique values. These columns likely serve as unique identifiers similar to the `case_id` column. For the purposes of this workshop, we can assume they are redundant.\n", |
88 | 88 | "\n",
|
89 | 89 | "Let's create a prompt to identify which of these columns fit this criterion."
|
90 | 90 | ]
|
|
95 | 95 | "metadata": {},
|
96 | 96 | "outputs": [],
|
97 | 97 | "source": [
|
98 |
| - "# list the name of columns that have more than or equal to 18004 unique values\n" |
| 98 | + "# list the name of columns that have more than or equal to 18004 unique values" |
99 | 99 | ]
|
100 | 100 | },
|
101 | 101 | {
|
|
175 | 175 | "cell_type": "markdown",
|
176 | 176 | "metadata": {},
|
177 | 177 | "source": [
|
178 |
| - "We saw in section 2.7 of the `fm-ad-notebook-exploration.ipynb` notebook that there were duplicate records. Let's go ahead and drop them." |
| 178 | + "We saw in section 2.7 of [fm-ad-notebook-exploration.ipynb](fm-ad-notebook-exploration.ipynb) notebook that there were duplicate records. Let's go ahead and drop them." |
179 | 179 | ]
|
180 | 180 | },
|
181 | 181 | {
|
|
207 | 207 | "cell_type": "markdown",
|
208 | 208 | "metadata": {},
|
209 | 209 | "source": [
|
210 |
| - "In section 2.7, you saw that there were records that shared the same case_id. Let's check if there are any other records share a case_id.\n" |
| 210 | + "In section 2.7, you saw that there were records that shared the same `case_id`. Let's check if there are any other records share a `case_id`.\n" |
211 | 211 | ]
|
212 | 212 | },
|
213 | 213 | {
|
|
219 | 219 | "# count how many records share the same case_id"
|
220 | 220 | ]
|
221 | 221 | },
|
| 222 | + { |
| 223 | + "cell_type": "markdown", |
| 224 | + "metadata": {}, |
| 225 | + "source": [ |
| 226 | + "Let's take a look at a visual representation of the distribution of the number of records that are shared by `case_id`'s." |
| 227 | + ] |
| 228 | + }, |
222 | 229 | {
|
223 | 230 | "cell_type": "code",
|
224 | 231 | "execution_count": null,
|
|
232 | 239 | "cell_type": "markdown",
|
233 | 240 | "metadata": {},
|
234 | 241 | "source": [
|
235 |
| - "Let's take a look at the instance where a case_id is shared between records.\n", |
| 242 | + "Let's take a look at the list of records shared by a particular `case_id`.\n", |
236 | 243 | "\n",
|
237 | 244 | "Create a prompt below to generate code to show you records that shares a case_id different from the case_id in section 2.7 of [fm-ad-notebook-exploration.ipynb](fm-ad-notebook-exploration.ipynb)."
|
238 | 245 | ]
|
|
246 | 253 | "# show the records with the case_id aff95088-8760-46d2-a404-b545807e0735"
|
247 | 254 | ]
|
248 | 255 | },
|
| 256 | + { |
| 257 | + "cell_type": "markdown", |
| 258 | + "metadata": {}, |
| 259 | + "source": [ |
| 260 | + "So far we have created prompts that are called zero-shot prompts. Basically it means that these prompts have no specific examples, we just tell it to do what we want.\n", |
| 261 | + "\n", |
| 262 | + "Next we will be working with one-shot prompts. In addition to describing what you want like in zero-shot prompts, one-shot prompts are prompts adds the prompt with a single example. This helps generate a more context-aware response." |
| 263 | + ] |
| 264 | + }, |
249 | 265 | {
|
250 | 266 | "cell_type": "code",
|
251 | 267 | "execution_count": null,
|
|
412 | 428 | "6947/19"
|
413 | 429 | ]
|
414 | 430 | },
|
| 431 | + { |
| 432 | + "cell_type": "markdown", |
| 433 | + "metadata": {}, |
| 434 | + "source": [ |
| 435 | + "Based on the calculation above, the column's normalization factor is 365. So let's transform the existing age column by dividing it by 365 and create a new column and dataframe." |
| 436 | + ] |
| 437 | + }, |
415 | 438 | {
|
416 | 439 | "cell_type": "code",
|
417 | 440 | "execution_count": null,
|
|
421 | 444 | "# create a new dataframe, create a new column 'diagnoses.age_at_diagnosis_years' by dividing 'diagnoses.age_at_diagnosis' by 365, and drop the 'diagonses.age_at_diagnosis' column"
|
422 | 445 | ]
|
423 | 446 | },
|
| 447 | + { |
| 448 | + "cell_type": "markdown", |
| 449 | + "metadata": {}, |
| 450 | + "source": [ |
| 451 | + "The publication “High-Throughput Genomic Profiling of Adult Solid Tumors Reveals Novel Insights into Cancer Pathogenesis”, http://cancerres.aacrjournals.org/content/77/9/2464.long, removes records of patients aged 89 or older. Let's do some data cleaning to reflect this." |
| 452 | + ] |
| 453 | + }, |
424 | 454 | {
|
425 | 455 | "cell_type": "code",
|
426 | 456 | "execution_count": null,
|
|
439 | 469 | "# drop the record with 'diagnosis.age_at_diagnosis_years' greater or equal to 89"
|
440 | 470 | ]
|
441 | 471 | },
|
| 472 | + { |
| 473 | + "cell_type": "markdown", |
| 474 | + "metadata": {}, |
| 475 | + "source": [ |
| 476 | + "Currently, the age column stores the ages of the participants as floats. The publication however describes and visualizes the result of the age distribution as integers. Given that information, let's do more data transformation to reflect this." |
| 477 | + ] |
| 478 | + }, |
442 | 479 | {
|
443 | 480 | "cell_type": "code",
|
444 | 481 | "execution_count": null,
|
|
0 commit comments