Skip to content

Commit 6eafe42

Browse files
committed
PCA correction
Corrected the PCA and already applied into the jupyter Notebook
1 parent 8da6956 commit 6eafe42

10 files changed

+31713
-949
lines changed

.ipynb_checkpoints/Multivariate Analysis - PCA-checkpoint.ipynb

Lines changed: 1213 additions & 0 deletions
Large diffs are not rendered by default.

.ipynb_checkpoints/Multivariate Analysis - Supervised Analysis with PLS-DA-checkpoint.ipynb

Lines changed: 261 additions & 517 deletions
Large diffs are not rendered by default.

Multivariate Analysis - PCA.ipynb

Lines changed: 187 additions & 270 deletions
Large diffs are not rendered by default.

Multivariate Analysis - Supervised Analysis with PLS-DA.ipynb

Lines changed: 91 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,12 @@
9797
},
9898
{
9999
"cell_type": "markdown",
100+
"metadata": {
101+
"collapsed": false,
102+
"jupyter": {
103+
"outputs_hidden": false
104+
}
105+
},
100106
"source": [
101107
"We will now import the LC-MS data with the metadata (Y variables) and feature annotation for LC-MS.\n",
102108
"\n",
@@ -112,14 +118,17 @@
112118
"Y - represents the sex of the individudals (0: Female, 1: Male) used as the response variable for the PLS-DA model\n",
113119
"\n",
114120
"##### NB - Full data available from: [https://zenodo.org/doi/10.5281/zenodo.4053166](https://zenodo.org/doi/10.5281/zenodo.4053166)."
115-
],
116-
"metadata": {
117-
"collapsed": false
118-
}
121+
]
119122
},
120123
{
121124
"cell_type": "code",
122125
"execution_count": null,
126+
"metadata": {
127+
"collapsed": false,
128+
"jupyter": {
129+
"outputs_hidden": false
130+
}
131+
},
123132
"outputs": [],
124133
"source": [
125134
"# Load the dataset\n",
@@ -138,35 +147,38 @@
138147
"# Extract the retention times and m/z to use in 2D plots of the dataset\n",
139148
"retention_times = np.array([x.split('_')[0] for x in variable_names], dtype='float')/60\n",
140149
"mz_values = np.array([x.split('_')[1][0:-3] for x in variable_names], dtype='float')"
141-
],
142-
"metadata": {
143-
"collapsed": false
144-
}
150+
]
145151
},
146152
{
147153
"cell_type": "code",
148154
"execution_count": null,
155+
"metadata": {
156+
"collapsed": false,
157+
"jupyter": {
158+
"outputs_hidden": false
159+
}
160+
},
149161
"outputs": [],
150162
"source": [
151163
"# Compare binary response to labels\n",
152164
"print(dementia_rpos_dataset['Gender'].value_counts())\n",
153165
"print(pd.DataFrame(gender_y).value_counts())"
154-
],
155-
"metadata": {
156-
"collapsed": false
157-
}
166+
]
158167
},
159168
{
160169
"cell_type": "code",
161170
"execution_count": null,
171+
"metadata": {
172+
"collapsed": false,
173+
"jupyter": {
174+
"outputs_hidden": false
175+
}
176+
},
162177
"outputs": [],
163178
"source": [
164179
"# Split data into train test\n",
165180
"X_train, X_test, y_train, y_test = train_test_split(x_data, gender_y, test_size=0.10, random_state=42)\n"
166-
],
167-
"metadata": {
168-
"collapsed": false
169-
}
181+
]
170182
},
171183
{
172184
"cell_type": "markdown",
@@ -267,16 +279,19 @@
267279
{
268280
"cell_type": "code",
269281
"execution_count": null,
282+
"metadata": {
283+
"collapsed": false,
284+
"jupyter": {
285+
"outputs_hidden": false
286+
}
287+
},
270288
"outputs": [],
271289
"source": [
272290
"# # Run this cell to fit a PLS-DA model with 2 components and UV scaling\n",
273291
"# pls_da = ChemometricsPLSDA(n_components=2, x_scaler=scaling_object_uv)\n",
274292
"# pls_da.fit(X_train, y_train)\n",
275293
"# x_test_log = X_test.copy()"
276-
],
277-
"metadata": {
278-
"collapsed": false
279-
}
294+
]
280295
},
281296
{
282297
"cell_type": "code",
@@ -308,27 +323,39 @@
308323
},
309324
{
310325
"cell_type": "markdown",
326+
"metadata": {
327+
"collapsed": false,
328+
"jupyter": {
329+
"outputs_hidden": false
330+
}
331+
},
311332
"source": [
312333
"The *plot_scores* methods from `ChemometricsPLS` and `ChemometricsPLSDA` objects share the same functionality as `ChemometricsPCA.plot_scores`. Score plot data points can be colored by levels of a continuous or discrete covariate by using the `color` argument, and setting the ```discrete``` argument to ```True``` or ```False```, accordingly). The index (row index of the data matrix **X**) of the outlying can be labeled with ```label_outliers=True``` and the plot title changed with the argument```plot_title```."
313-
],
314-
"metadata": {
315-
"collapsed": false
316-
}
334+
]
317335
},
318336
{
319337
"cell_type": "code",
320338
"execution_count": null,
339+
"metadata": {
340+
"collapsed": false,
341+
"jupyter": {
342+
"outputs_hidden": false
343+
}
344+
},
321345
"outputs": [],
322346
"source": [
323347
"# Plot the scores\n",
324348
"pls_da.plot_scores(color=y_train, discrete=True, label_outliers=True, plot_title=None)"
325-
],
326-
"metadata": {
327-
"collapsed": false
328-
}
349+
]
329350
},
330351
{
331352
"cell_type": "markdown",
353+
"metadata": {
354+
"collapsed": false,
355+
"jupyter": {
356+
"outputs_hidden": false
357+
}
358+
},
332359
"source": [
333360
"It is also possible to assess the overfitting and general model performance by using \"machine learning\" metrics such as accuracy, precision, recall, f1, ROC curves and their respective area under the curve (AUC). These metrics are calculated by comparing the predicted Y values from the model with the true Y values. \n",
334361
"\n",
@@ -345,14 +372,17 @@
345372
"$$F1 = 2 * \\frac{Precision * Recall}{Precision + Recall}$$\n",
346373
"\n",
347374
"AUC is the area under the ROC curve. The ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The TPR is the same as the recall, and the FPR is 1 - specificity. The AUC is a measure of how well a model can distinguish between classes. An AUC of 1 means the model can perfectly distinguish between classes, and an AUC of 0.5 means the model cannot distinguish between classes at all.\n"
348-
],
349-
"metadata": {
350-
"collapsed": false
351-
}
375+
]
352376
},
353377
{
354378
"cell_type": "code",
355379
"execution_count": null,
380+
"metadata": {
381+
"collapsed": false,
382+
"jupyter": {
383+
"outputs_hidden": false
384+
}
385+
},
356386
"outputs": [],
357387
"source": [
358388
"# Predict the response Y (sex) based on the test set\n",
@@ -373,10 +403,7 @@
373403
"print('Specificity', round(tn/(tn+fp), 3))\n",
374404
"print('F1 Score', round(f1_score(y_test, np.where(y_pred > 0.5, 1, 0)), 3))\n",
375405
"print('AUC', round(roc_auc_score(y_test, np.where(y_pred > 0.5, 1, 0)), 3))"
376-
],
377-
"metadata": {
378-
"collapsed": false
379-
}
406+
]
380407
},
381408
{
382409
"cell_type": "markdown",
@@ -464,13 +491,16 @@
464491
{
465492
"cell_type": "code",
466493
"execution_count": null,
494+
"metadata": {
495+
"collapsed": false,
496+
"jupyter": {
497+
"outputs_hidden": false
498+
}
499+
},
467500
"outputs": [],
468501
"source": [
469502
"pls_da.scree_plot(x_train_log, y_train, total_comps=10)"
470-
],
471-
"metadata": {
472-
"collapsed": false
473-
}
503+
]
474504
},
475505
{
476506
"cell_type": "markdown",
@@ -611,16 +641,25 @@
611641
},
612642
{
613643
"cell_type": "markdown",
644+
"metadata": {
645+
"collapsed": false,
646+
"jupyter": {
647+
"outputs_hidden": false
648+
}
649+
},
614650
"source": [
615651
"Now we can assess the model performance using test set samples which were not used during model fitting or cross-validation. Compare the confusion matrices and classification metrics of the model fitted with 2 latent variables and the model fitted with 4 latent variables."
616-
],
617-
"metadata": {
618-
"collapsed": false
619-
}
652+
]
620653
},
621654
{
622655
"cell_type": "code",
623656
"execution_count": null,
657+
"metadata": {
658+
"collapsed": false,
659+
"jupyter": {
660+
"outputs_hidden": false
661+
}
662+
},
624663
"outputs": [],
625664
"source": [
626665
"# Predict the response Y (sex) based on the test set\n",
@@ -641,10 +680,7 @@
641680
"print('Specificity', round(tn/(tn+fp), 3))\n",
642681
"print('F1 Score', round(f1_score(y_test, np.where(y_pred > 0.5, 1, 0)), 3))\n",
643682
"print('AUC', round(roc_auc_score(y_test, np.where(y_pred > 0.5, 1, 0)), 3))"
644-
],
645-
"metadata": {
646-
"collapsed": false
647-
}
683+
]
648684
},
649685
{
650686
"cell_type": "markdown",
@@ -1070,11 +1106,14 @@
10701106
{
10711107
"cell_type": "code",
10721108
"execution_count": null,
1073-
"outputs": [],
1074-
"source": [],
10751109
"metadata": {
1076-
"collapsed": false
1077-
}
1110+
"collapsed": false,
1111+
"jupyter": {
1112+
"outputs_hidden": false
1113+
}
1114+
},
1115+
"outputs": [],
1116+
"source": []
10781117
}
10791118
],
10801119
"metadata": {

0 commit comments

Comments
 (0)