Skip to content

Commit 972f87a

Browse files
authored
Add files via upload
0 parents  commit 972f87a

16 files changed

+58323
-0
lines changed

DataAugmentation/bird.jpg

52.3 KB
Loading

DataAugmentation/featureEngineering.ipynb

+401
Large diffs are not rendered by default.

RF&DT/RF&DT.ipynb

+1
Large diffs are not rendered by default.

RF&DT/randomForest.ipynb

+369
Large diffs are not rendered by default.

RF&DT/summer-products-with-rating-and-performance_2020-08.csv

+1,576
Large diffs are not rendered by default.

SVMkernels/KNN.ipynb

+52
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"## Bonus Assignments\n",
8+
"<ul>\n",
9+
"<li> What is the disadvantages of the KNN classifier</li>\n",
10+
"1. Does not work well with large dataset:\n",
11+
"In large datasets, the cost of calculating the distance between the new point and each existing points is huge which degrades the performance of the algorithm.\n",
12+
"\n",
13+
"2. Does not work well with high dimensions:\n",
14+
"The KNN algorithm doesn't work well with high dimensional data because with large number of dimensions, it becomes difficult for the algorithm to calculate the distance in each dimension.\n",
15+
"\n",
16+
"3. Need feature scaling:\n",
17+
"We need to do feature scaling (standardization and normalization) before applying KNN algorithm to any dataset. If we don't do so, KNN may generate wrong predictions.\n",
18+
"\n",
19+
"4. Sensitive to noisy data, missing values and outliers:\n",
20+
"KNN is sensitive to noise in the dataset. We need to manually impute missing values and remove outliers.\n",
21+
"<li> How to optimize the KNN algorithm</li>\n",
22+
"for a given test sample x:\n",
23+
"\n",
24+
" - find K most similar samples from training set, according to similarity measure s\n",
25+
"\n",
26+
" - return the majority vote of the class from the above set\n",
27+
" \n",
28+
"Consequently the only thing used to define KNN besides K is the similarity measure s, and that's all. There is literally nothing else in this algorithm (as it has 3 lines of pseudocode). On the other hand finding \"the best similarity measure\" is equivalently hard problem as learning a classifier itself, thus there is no real method of doing so, and people usually end up using either simple things (Euclidean distance) or use their domain knowledge to adapt s to the problem at hand.\n",
29+
"</ul>"
30+
]
31+
}
32+
],
33+
"metadata": {
34+
"kernelspec": {
35+
"display_name": "Python 3.9.7 ('base')",
36+
"language": "python",
37+
"name": "python3"
38+
},
39+
"language_info": {
40+
"name": "python",
41+
"version": "3.9.7"
42+
},
43+
"orig_nbformat": 4,
44+
"vscode": {
45+
"interpreter": {
46+
"hash": "5179d32cf6ec497baf3f8a3ef987cc77c5d2dc691fdde20a56316522f61a7323"
47+
}
48+
}
49+
},
50+
"nbformat": 4,
51+
"nbformat_minor": 2
52+
}

SVMkernels/SupervisedLearningModels.ipynb

+385
Large diffs are not rendered by default.

constructFeatureMatrix/data/census/test.csv

+16,282
Large diffs are not rendered by default.

constructFeatureMatrix/data/census/train.csv

+32,562
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)