-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathParkinsonVoiceSVM.py
431 lines (254 loc) · 14.5 KB
/
ParkinsonVoiceSVM.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
"""
Jul 26 2024
"Parkinson's Disease Detection using Machine Learning - Python"
Utilized by this video tutorial : https://youtu.be/HbyN_ey-JVc?si=0SMd4H0ybejVNkVx
Codes are changed & rearranged based on my own decisions and practices.
Dataset: https://www.kaggle.com/datasets/thecansin/parkinsons-data-set
Will be also shared with the repository.
Reference: Please refer with names or Github links which about Siddardhan S, Oğuzhan Memiş (Oguzhan Memis), and the source of the dataset.
Contacts are welcomed.
CODE ORGANIZATION:
The codes are separated into 7 different cells based on 5 steps.
Read the descriptions and run the codes cell by cell. You can also run the whole code at once, if you want.
The steps are as follows:
1) Importing the data
2) EDA
3) Preprocessing by 3 options. Run 1 of them and compare the results.
4) Model training and tuning
5) Model usage
"""
#%% 1) Importings and Data
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import minmax_scale , StandardScaler
from sklearn.metrics import accuracy_score , f1_score , recall_score, precision_score , confusion_matrix
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns
'''
KEY INFORMATION ABOUT THE DATASET
__________________________________
Oxford Parkinson's Disease Detection Dataset
2008-06-26
Cite: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008),'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease',IEEE Transactions on Biomedical Engineering
Parkinson's is a neurological and progressive disease, which occurs from lack of some neurotransmitters in the brain such as Dopamine.
It harms the fine-tuned control of the movements of the body. As it progress, several symptoms such as tremor, body stiffness and loss of balance are observed.
Dataset:
Instances (samples): 195
Attributes (features): 22 + 1(label)
Focused on various acoustic properties of voice recordings, which are used to detect and analyze Parkinson's disease.
Voice measurements from 31 people, 23 with Parkinson's disease (PD).6 recordings per patient.
The main aim is discriminate healthy=0 PD=1 as binary classification.
How these measurements are performed:
Voice Recording: Subjects are typically asked to sustain a vowel sound (often 'ahhh') for several seconds.
Signal Processing: The voice recordings are then processed using specialized software that can extract these acoustic features.
MDVP: Many of these features are extracted using Multi-Dimensional Voice Program (MDVP), a software tool for voice analysis.
Advanced Analysis: Some of the more complex measures (like RPDE, D2, DFA) require advanced mathematical analysis of the voice signal.
22 features:
MDVP:Fo(Hz) --Average frequency of vocal fold vibration during speech. Typically 85-180 Hz for males, 165-255 Hz for females.
MDVP:Fhi(Hz) -- The highest frequency of vocal fold vibration,Reduced in some PD. Usually under 500 Hz.
MDVP:Flo(Hz) --Minimum vocal fundamental frequency.Typically above 50 Hz.
MDVP:Jitter(%) --Variation of fundamental frequency, expressed as a percentage. Typically <1% in healthy voices, can be higher in PD which is voice instability.
MDVP:Jitter(Abs) --Usually in microseconds, typically <83.2 μs for healthy voices.
MDVP:RAP --Short-term (cycle-to-cycle) irregularity measure. Typically <0.680% in healthy voices.
MDVP:PPQ -- Another measure of frequency irregularity over a slightly longer term. Typically <0.840% in healthy voices.
Jitter:DDP --Difference of Differences of Periods,variability of the fundamental frequency. Similar range.
MDVP:Shimmer --Variability of the peak-to-peak amplitude in percentage. Typically <3.810% in healthy , often increased in PD.
MDVP:Shimmer(dB)-- In a logarithmic scale. Typically <0.350 dB in healthy.
Shimmer:APQ3 --Short-term (3-point) amplitude variability measure. Typically <3.070% in healthy.
Shimmer:APQ5 -- Similar to APQ3, but uses 5 points instead of 3. Typically <4.230% in healthy.
MDVP:APQ --Longer-term amplitude variability. Typically <3.810% in healthy.
Shimmer:DDA --Similar range to other shimmer measures, typically low in healthy voices.
NHR: Noise-to-Harmonics Ratio : Amount of noise in the signal. Higher in PD due to increased breathiness. Typically <0.190 in healthy.
HNR: Harmonics-to-Noise Ratio : Inverse of NHR, measures voice clarity. Lower in PD. Typically >20 dB in healthy.
RPDE: Recurrence Period Density Entropy -- Quantifies the repetitiveness of a time series, predictability of vocal fold vibrations. 0 to 1, higher values indicating more complexity.
D2: Correlation Dimension -- Measure of signal complexity. Can detect subtle nonlinear changes in voice. Typically between 1 and 3.
DFA: Detrended Fluctuation Analysis -- Statistical self-similarity of a signal at different time scales. Typically between 0.5 and 1, with 0.5 indicating uncorrelated noise.
spread1 -- Nonlinear measure of fundamental frequency variation. Higher values indicate more variation.
spread2 -- Complements spread1 in measuring F0 (center, or fundamental frequency) variability.
PPE: Pitch Period Entropy -- Impredictability of pitch periods. Typically between 0 and 1, with higher values indicating unpredictability, as PD patients have.
'''
pdata = pd.read_csv("parkinsons.csv")
# Importing errors can be solved in different way. If you encounter problems, try change your path variable or the folder which code runs in it.
pd.set_option('display.float_format', '{:.6f}'.format)
pdata.head() # further observations can be done in EDA step.
y = pdata.iloc[:, [17]].to_numpy() # labels are located in 17. column
x = pdata.iloc[:,1:24].to_numpy()
x = np.delete(x, 16, axis=1) # labels are deleted
print(str(np.sum(y))) # 147 of 195 are the Parkinson's. Dataset is unbalanced.
'''
The issue of unbalanced data can be solved by various methods such as downsampling(rejection), augmentation techniques, GANs etc.
'''
#%% 2) Exploratory Data Analysis
pdata.info() # Getting basic information about columns and numbers of non-null data
fulldescribe= pd.DataFrame()
fulldescribe = pdata.describe() # To observe the basic statistics
# Plot1 : Statistics of dataset
plt.figure(figsize=(16, 12))
sns.heatmap(fulldescribe.transpose(), annot= True, cmap="Purples", fmt=".2f") # numeric format can be changed.
plt.show()
# Obtain the rows including the label we want to see
status = pdata.sort_values(by="status", ascending= False ) # Sort for the status column to see the differences.
parkinsons = pdata[pdata['status'] == 1] # Filter rows where status is 1
healthies = pdata[pdata['status'] == 0]
# Plot 2: Comparison between labels
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 10)) # to make a subplot.
sns.heatmap(parkinsons.describe().transpose(), annot= True, cmap="Purples", fmt=".2f", ax=axes[0])
axes[0].set_title('Parkinsons')
sns.heatmap(healthies.describe().transpose(), annot= True, cmap="Purples", fmt=".2f", ax=axes[1])
axes[1].set_title('Healthy ')
plt.tight_layout()
plt.show()
# Plot3 : Correlation between the features
numerical_data = pdata.select_dtypes(include=[float, int]) # Select only numerical columns
correlations = numerical_data.corr() # Shows correlation among the columns, which columns are similar
plt.figure(figsize=(16, 12))
sns.heatmap(correlations, annot= True, cmap="Purples")
plt.show()
'''
We can clearly see that, scale of features varies. In simpler terms, some features have much bigger numbers than others.
Also, some features are negatively correlated.
'''
#%% 3.a) Dataset with Normalization by fitting to all
# Reshape the array to shape (195,) to help Scikit models work in more compatible way
y = y.reshape(195,)
'''
If some features are much bigger than other, it will reduce the overall score.
It is manipulating the data by applying Z-score normalization which assumes Normal Distribution for the data
It helps in comparing features that have different units or scales.
'''
scale = StandardScaler()
scale.fit(x) # Adapting to our data
x_scaled = scale.transform(x) # Apply the transformation to obtain the data in new scales.
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y,test_size=0.4 , random_state=42)
#%% 3.b) Dataset with Normalization by fitting just to training
'''
Fitting the scaler on both training and test data introduces data leakage.
This means that the model can indirectly "seen" the influence of test data on training data. The model has access to small amount of information from the test set, during training.
By only using the training statistics, you ensure that the model is evaluated just based on how well it can adapt to unseen data.
Therefore, fitting process can be setted just by using training set, to transform both training and test sets.
'''
y = y.reshape(195,)
x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.4 , random_state=42)
scale = StandardScaler()
scale.fit(x_train) # Adapting to just training set
x_train = scale.transform(x_train) # Transform
x_test = scale.transform(x_test)
#%% 3.c) Dataset without Normalization
y = y.reshape(195,)
# Bigger scales of data, can make the Kernel (dimensionality increase) process much longer.
# Especially the Polynomial Kernel takes enormously long time.
# You can can try to scale down some features. It can effect in a good or in a bad way.
'''
x[:, 0] = x[:, 0]/1000
x[:, 1] = x[:, 1]/1000
x[:, 2] = x[:, 2]/1000
x[:, 15] = x[:, 15]/10
x[:, 18] = x[:, 18]/10
'''
# Train-test split with randomization
x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.4 , random_state=42)
# DO NOT RUN THEM IN SAME TIME, INSTEAD RUN 1 OF THESE 3 SECTIONS TO SEE THE DIFFERENCE
#%% 4) SVM with GridSearchCV
'''
Support Vector Machine is a frequently used algorithm for both classification and regression tasks.
Main idea is, finding the optimal plane which linearly separates the datapoints in feature space.
Feature Engineering can be done by the Kernel functions, to invent more features about the data.
In order to increase the dimension, and find a way to separate the classes linearly.
In the algorithm, the datapoints are tried to separate by the decision boundary, which is a plane between the datapoints.
Support vectors are the other planes which represents closest points of each class .
Distance between the support vectors, including the decision boundary, is called the margin.
In higher dimensions, these planes are called "hyperplanes". Distance between the planes are calculated by using Normal vectors.
SVM aims to find the decision boundary that maximizes the margin between the classes.
The algorithm tries to increase this margin by using distance calculations and changing the hyperplane, in order to optimize the classification result.
This is achieved through an optimization process that involves solving a convex optimization problem.
Hyperparameters:
-Kernel functions include linear, polynomial, radial basis function (RBF), and sigmoid kernels. To map the data into linearly separable shape.
-C value tells algorithm how much avoid the misclassifications. High C value can cause overfitting, low C value can cause underfitting,
optimal C value adds required regularization.
-Gamma value tells what number of points used to determine the boundary.
low value includes farther points, high value only includes few closest points.
'''
mymodel = svm.SVC(probability=True) # Classifier model
mymodel.get_params() # Look for which parameters can be choosen
'''
Grid search is about simply iterating all sets of values defined, to obtain best performed model.
It also uses Cross-Validation. And parameter grid need to be defined.
'''
# Define the parameter grid
param_grid = {
'C': [0.001, 0.01, 0.05, 0.1, 0.2, 0.6, 1, 8, 9, 10, 11, 12, 13],
'kernel': ['linear', 'rbf' , 'sigmoid' ],
'gamma': ['scale', 'auto', 0.01, 0.1, 0.25, 0.3, 0.33, 0.4, 0.5, 0.7, 1, 2, 10 , 15]
}
grid_search = GridSearchCV(mymodel, param_grid, cv=4, n_jobs=-1, verbose=0)
# Setting n_jobs to -1 means that the algorithm will use all available CPU cores on your machine
# Verbose setting is for observing the details of the ongoing process
grid_search.fit(x_train, y_train)
# Getting the best model
best_model = grid_search.best_estimator_
# Calculate the metrics for classification
predictions = best_model.predict(x_test) # Make predictions
s1 = accuracy_score(y_test, predictions)
s2 = f1_score(y_test, predictions, average='weighted')
s3 = recall_score(y_test, predictions, average='weighted')
s4 = precision_score(y_test, predictions)
print(f"Accuracy: {s1*100:.2f}%")
print(f"F1 Score: {s2*100:.2f}%")
print(f"Recall: {s3*100:.2f}%")
print(f"Precision: {s4*100:.2f}%")
# Confusion matrix
cm = confusion_matrix(y_test, predictions)
print("Confusion Matrix:")
print(cm)
plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues') # Heatmap of confusion matrix
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.title('Confusion Matrix')
plt.show()
# Print the best model parameters
print("Best parameters:")
for param, value in grid_search.best_params_.items():
print(f"{param}: {value}")
'''
Without normalization:
Best parameters:
C: 1
gamma: scale
kernel: linear
85.90%
With standard normalization:
Best parameters:
C: 8 or 10
gamma: 0.25 or 0.3
kernel: rbf
93.59%
'''
#%% 5) Prediction system with importing
# For real-life usage of this model, the model can be saved in special formats.
# Then the model file can be imported and used for individual predictions on new data.
# Example prediction system with our current data as follows:
from joblib import dump, load
# Saving
model_file_path = "svm_parkinson.joblib"
dump(best_model, model_file_path)
print(f"Model saved to {model_file_path}")
# Importing
modell = load(model_file_path)
# Select a row. Do not use to apply your transformations steps to the selected data as well.
select = int(input("Please enter a number to choose a row in dataset : "))
prediction2 = modell.predict(scale.transform(x[select,:].reshape(1,-1)))
# Print the result
print(prediction2)
if prediction2 == 1 :
print("Parkinson")
elif prediction2 == 0 :
print("Healthy")
# Testing our result
if prediction2 == y[select] :
print("The prediction is true")
else :
print("The prediction is wrong")