|
97 | 97 | },
|
98 | 98 | {
|
99 | 99 | "cell_type": "markdown",
|
| 100 | + "metadata": { |
| 101 | + "collapsed": false, |
| 102 | + "jupyter": { |
| 103 | + "outputs_hidden": false |
| 104 | + } |
| 105 | + }, |
100 | 106 | "source": [
|
101 | 107 | "We will now import the LC-MS data with the metadata (Y variables) and feature annotation for LC-MS.\n",
|
102 | 108 | "\n",
|
|
112 | 118 | "Y - represents the sex of the individudals (0: Female, 1: Male) used as the response variable for the PLS-DA model\n",
|
113 | 119 | "\n",
|
114 | 120 | "##### NB - Full data available from: [https://zenodo.org/doi/10.5281/zenodo.4053166](https://zenodo.org/doi/10.5281/zenodo.4053166)."
|
115 |
| - ], |
116 |
| - "metadata": { |
117 |
| - "collapsed": false |
118 |
| - } |
| 121 | + ] |
119 | 122 | },
|
120 | 123 | {
|
121 | 124 | "cell_type": "code",
|
122 | 125 | "execution_count": null,
|
| 126 | + "metadata": { |
| 127 | + "collapsed": false, |
| 128 | + "jupyter": { |
| 129 | + "outputs_hidden": false |
| 130 | + } |
| 131 | + }, |
123 | 132 | "outputs": [],
|
124 | 133 | "source": [
|
125 | 134 | "# Load the dataset\n",
|
|
138 | 147 | "# Extract the retention times and m/z to use in 2D plots of the dataset\n",
|
139 | 148 | "retention_times = np.array([x.split('_')[0] for x in variable_names], dtype='float')/60\n",
|
140 | 149 | "mz_values = np.array([x.split('_')[1][0:-3] for x in variable_names], dtype='float')"
|
141 |
| - ], |
142 |
| - "metadata": { |
143 |
| - "collapsed": false |
144 |
| - } |
| 150 | + ] |
145 | 151 | },
|
146 | 152 | {
|
147 | 153 | "cell_type": "code",
|
148 | 154 | "execution_count": null,
|
| 155 | + "metadata": { |
| 156 | + "collapsed": false, |
| 157 | + "jupyter": { |
| 158 | + "outputs_hidden": false |
| 159 | + } |
| 160 | + }, |
149 | 161 | "outputs": [],
|
150 | 162 | "source": [
|
151 | 163 | "# Compare binary response to labels\n",
|
152 | 164 | "print(dementia_rpos_dataset['Gender'].value_counts())\n",
|
153 | 165 | "print(pd.DataFrame(gender_y).value_counts())"
|
154 |
| - ], |
155 |
| - "metadata": { |
156 |
| - "collapsed": false |
157 |
| - } |
| 166 | + ] |
158 | 167 | },
|
159 | 168 | {
|
160 | 169 | "cell_type": "code",
|
161 | 170 | "execution_count": null,
|
| 171 | + "metadata": { |
| 172 | + "collapsed": false, |
| 173 | + "jupyter": { |
| 174 | + "outputs_hidden": false |
| 175 | + } |
| 176 | + }, |
162 | 177 | "outputs": [],
|
163 | 178 | "source": [
|
164 | 179 | "# Split data into train test\n",
|
165 | 180 | "X_train, X_test, y_train, y_test = train_test_split(x_data, gender_y, test_size=0.10, random_state=42)\n"
|
166 |
| - ], |
167 |
| - "metadata": { |
168 |
| - "collapsed": false |
169 |
| - } |
| 181 | + ] |
170 | 182 | },
|
171 | 183 | {
|
172 | 184 | "cell_type": "markdown",
|
|
267 | 279 | {
|
268 | 280 | "cell_type": "code",
|
269 | 281 | "execution_count": null,
|
| 282 | + "metadata": { |
| 283 | + "collapsed": false, |
| 284 | + "jupyter": { |
| 285 | + "outputs_hidden": false |
| 286 | + } |
| 287 | + }, |
270 | 288 | "outputs": [],
|
271 | 289 | "source": [
|
272 | 290 | "# # Run this cell to fit a PLS-DA model with 2 components and UV scaling\n",
|
273 | 291 | "# pls_da = ChemometricsPLSDA(n_components=2, x_scaler=scaling_object_uv)\n",
|
274 | 292 | "# pls_da.fit(X_train, y_train)\n",
|
275 | 293 | "# x_test_log = X_test.copy()"
|
276 |
| - ], |
277 |
| - "metadata": { |
278 |
| - "collapsed": false |
279 |
| - } |
| 294 | + ] |
280 | 295 | },
|
281 | 296 | {
|
282 | 297 | "cell_type": "code",
|
|
308 | 323 | },
|
309 | 324 | {
|
310 | 325 | "cell_type": "markdown",
|
| 326 | + "metadata": { |
| 327 | + "collapsed": false, |
| 328 | + "jupyter": { |
| 329 | + "outputs_hidden": false |
| 330 | + } |
| 331 | + }, |
311 | 332 | "source": [
|
312 | 333 | "The *plot_scores* methods from `ChemometricsPLS` and `ChemometricsPLSDA` objects share the same functionality as `ChemometricsPCA.plot_scores`. Score plot data points can be colored by levels of a continuous or discrete covariate by using the `color` argument, and setting the ```discrete``` argument to ```True``` or ```False```, accordingly). The index (row index of the data matrix **X**) of the outlying can be labeled with ```label_outliers=True``` and the plot title changed with the argument```plot_title```."
|
313 |
| - ], |
314 |
| - "metadata": { |
315 |
| - "collapsed": false |
316 |
| - } |
| 334 | + ] |
317 | 335 | },
|
318 | 336 | {
|
319 | 337 | "cell_type": "code",
|
320 | 338 | "execution_count": null,
|
| 339 | + "metadata": { |
| 340 | + "collapsed": false, |
| 341 | + "jupyter": { |
| 342 | + "outputs_hidden": false |
| 343 | + } |
| 344 | + }, |
321 | 345 | "outputs": [],
|
322 | 346 | "source": [
|
323 | 347 | "# Plot the scores\n",
|
324 | 348 | "pls_da.plot_scores(color=y_train, discrete=True, label_outliers=True, plot_title=None)"
|
325 |
| - ], |
326 |
| - "metadata": { |
327 |
| - "collapsed": false |
328 |
| - } |
| 349 | + ] |
329 | 350 | },
|
330 | 351 | {
|
331 | 352 | "cell_type": "markdown",
|
| 353 | + "metadata": { |
| 354 | + "collapsed": false, |
| 355 | + "jupyter": { |
| 356 | + "outputs_hidden": false |
| 357 | + } |
| 358 | + }, |
332 | 359 | "source": [
|
333 | 360 | "It is also possible to assess the overfitting and general model performance by using \"machine learning\" metrics such as accuracy, precision, recall, f1, ROC curves and their respective area under the curve (AUC). These metrics are calculated by comparing the predicted Y values from the model with the true Y values. \n",
|
334 | 361 | "\n",
|
|
345 | 372 | "$$F1 = 2 * \\frac{Precision * Recall}{Precision + Recall}$$\n",
|
346 | 373 | "\n",
|
347 | 374 | "AUC is the area under the ROC curve. The ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The TPR is the same as the recall, and the FPR is 1 - specificity. The AUC is a measure of how well a model can distinguish between classes. An AUC of 1 means the model can perfectly distinguish between classes, and an AUC of 0.5 means the model cannot distinguish between classes at all.\n"
|
348 |
| - ], |
349 |
| - "metadata": { |
350 |
| - "collapsed": false |
351 |
| - } |
| 375 | + ] |
352 | 376 | },
|
353 | 377 | {
|
354 | 378 | "cell_type": "code",
|
355 | 379 | "execution_count": null,
|
| 380 | + "metadata": { |
| 381 | + "collapsed": false, |
| 382 | + "jupyter": { |
| 383 | + "outputs_hidden": false |
| 384 | + } |
| 385 | + }, |
356 | 386 | "outputs": [],
|
357 | 387 | "source": [
|
358 | 388 | "# Predict the response Y (sex) based on the test set\n",
|
|
373 | 403 | "print('Specificity', round(tn/(tn+fp), 3))\n",
|
374 | 404 | "print('F1 Score', round(f1_score(y_test, np.where(y_pred > 0.5, 1, 0)), 3))\n",
|
375 | 405 | "print('AUC', round(roc_auc_score(y_test, np.where(y_pred > 0.5, 1, 0)), 3))"
|
376 |
| - ], |
377 |
| - "metadata": { |
378 |
| - "collapsed": false |
379 |
| - } |
| 406 | + ] |
380 | 407 | },
|
381 | 408 | {
|
382 | 409 | "cell_type": "markdown",
|
|
464 | 491 | {
|
465 | 492 | "cell_type": "code",
|
466 | 493 | "execution_count": null,
|
| 494 | + "metadata": { |
| 495 | + "collapsed": false, |
| 496 | + "jupyter": { |
| 497 | + "outputs_hidden": false |
| 498 | + } |
| 499 | + }, |
467 | 500 | "outputs": [],
|
468 | 501 | "source": [
|
469 | 502 | "pls_da.scree_plot(x_train_log, y_train, total_comps=10)"
|
470 |
| - ], |
471 |
| - "metadata": { |
472 |
| - "collapsed": false |
473 |
| - } |
| 503 | + ] |
474 | 504 | },
|
475 | 505 | {
|
476 | 506 | "cell_type": "markdown",
|
|
611 | 641 | },
|
612 | 642 | {
|
613 | 643 | "cell_type": "markdown",
|
| 644 | + "metadata": { |
| 645 | + "collapsed": false, |
| 646 | + "jupyter": { |
| 647 | + "outputs_hidden": false |
| 648 | + } |
| 649 | + }, |
614 | 650 | "source": [
|
615 | 651 | "Now we can assess the model performance using test set samples which were not used during model fitting or cross-validation. Compare the confusion matrices and classification metrics of the model fitted with 2 latent variables and the model fitted with 4 latent variables."
|
616 |
| - ], |
617 |
| - "metadata": { |
618 |
| - "collapsed": false |
619 |
| - } |
| 652 | + ] |
620 | 653 | },
|
621 | 654 | {
|
622 | 655 | "cell_type": "code",
|
623 | 656 | "execution_count": null,
|
| 657 | + "metadata": { |
| 658 | + "collapsed": false, |
| 659 | + "jupyter": { |
| 660 | + "outputs_hidden": false |
| 661 | + } |
| 662 | + }, |
624 | 663 | "outputs": [],
|
625 | 664 | "source": [
|
626 | 665 | "# Predict the response Y (sex) based on the test set\n",
|
|
641 | 680 | "print('Specificity', round(tn/(tn+fp), 3))\n",
|
642 | 681 | "print('F1 Score', round(f1_score(y_test, np.where(y_pred > 0.5, 1, 0)), 3))\n",
|
643 | 682 | "print('AUC', round(roc_auc_score(y_test, np.where(y_pred > 0.5, 1, 0)), 3))"
|
644 |
| - ], |
645 |
| - "metadata": { |
646 |
| - "collapsed": false |
647 |
| - } |
| 683 | + ] |
648 | 684 | },
|
649 | 685 | {
|
650 | 686 | "cell_type": "markdown",
|
|
1070 | 1106 | {
|
1071 | 1107 | "cell_type": "code",
|
1072 | 1108 | "execution_count": null,
|
1073 |
| - "outputs": [], |
1074 |
| - "source": [], |
1075 | 1109 | "metadata": {
|
1076 |
| - "collapsed": false |
1077 |
| - } |
| 1110 | + "collapsed": false, |
| 1111 | + "jupyter": { |
| 1112 | + "outputs_hidden": false |
| 1113 | + } |
| 1114 | + }, |
| 1115 | + "outputs": [], |
| 1116 | + "source": [] |
1078 | 1117 | }
|
1079 | 1118 | ],
|
1080 | 1119 | "metadata": {
|
|
0 commit comments