CODE 0
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
plt.style.use(‘winter’)
pd.set_option(‘max_columns’, None)CODE 1
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy=‘mean’)
imp.fit_transform(X)- Import Stuff using [CODE0]
- Train Test split
- Create and get error of a baseline model (like average or median)
- Open the dataset
- Data Understanding
- DataFrame
.shape .head()and.tail().dtypes.describe().info()
- DataFrame
- Data Preparation
- Drop irrelevant columns and rows
- use
.columnsto obtain list of columns - copy and paste column list
- Remove unnecessary columns
- Set a variable to be the dataset indexed with the remaining columns, add
.copy()at the end - you can also use
.drop([‘cols you wanna drop’], axis=1, inplace=True)
- use
- change necessary column type, using
dataset[column] = pd.to_TheFinalDataType(dataset[column])- You can use
.to_numeric,.to_datetime, etc.
- You can use
- Identify duplicate columns
- use
.duplicated() - use
.loc[.duplicated()]to look at all duplicate rows - use
.duplicated().sum()to get number of duplicated rows - use
.duplicated(subset=[])to get duplicated of specific columns - To remove duplicated use slicing with
~in front for inverse - use
.reset_index(drop=True)to reset index
- use
- Rename Columns
- To rename,
.rename(columns={’old_name’: ‘new_name’}, inplace=True)
- To rename,
- Feature Creation
- Remove or do some other action to
NaNvalues- Using pandas
- Identify using
.isna() - Get total using
.isna().sum() - Replace using
.fillna(value or mean, median, mode, etc)
- Identify using
- Using Scikit-Learn
- Replace using mean, most_frequent or constant for strategy, use [CODE1]
- Using pandas
- Drop irrelevant columns and rows
- Feature Understanding (Univariate Analysis) use
ax =to save any plot into a matplotlib plot- Value Count use
[column].value_counts()to see number of unique values and their count - Histogram, use
.hist(title=“Plot0”)or.plot(kind=‘hist’)on any pd df - KDE, use
.plot(kind=‘kde’) - Bar Graph, use
.plot(kind=‘bar’, title=“Plot1”)or barh instead of bar for horizontal plot - Box Plot
- Value Count use
- Feature Relationship
- Scatter Plot
- For basic, use
.plot(kind=‘scatter’, x=’column’, y=’column too’) - For Pro, use
sns.scatterplot(data=df, x=‘column’, y=‘column 2’), you can usehue=property to use other columns as the color
- For basic, use
- Pair Plot
- Use
sns.pairplot(data=df, vars=[a very long array of stuff]) - Use
sns.pairplot(data=df, vars_x=[…], vars_y=[…])to get only specific stuff - use can also use
hue=for this also.
- Use
- Correlation, use
.corr(), you can pass in the type of [pearson, kendall, spearman or callable]- Use
sns.heatmap()to get better visualization useannot=Trueto see the values inside
- Use
- Scatter Plot
- Use
sklearn.compose.make_column_transformerto apply feature selection for individual columns. - Use
sklearn.pipeline.make_pipelineto make pipelines to automate stuff. Column transform can be added to it as a process. - Dimensionality Reduction
- PCA:
sklearn.decomposition.PCA - t-SNE:
sklean.mainfold.TSNE - UMAP: gotta install
umap-learn,umap.umap_ as umapand use as if it was t-SNE or PCA
- PCA:
- Scaling
- Normalizer:
sklearn.preprocessing.Normalizer - StandardSalar:
sklearn.preprocessing.StandardScalar - MinMaxSalar:
sklearn.preprocessing.MinMaxScalar - RobustScalar:
sklean.preprocessing.RobustScalar
- Normalizer:
- Feature expansion:
- PolynomialFeatures:
sklearn.preprocessing.PolynomialFeatures
- PolynomialFeatures:
- Use Categorical Encoding
- One Hot Encoding:
sklearn.preprocessing.OneHotEncoder - Ordinal Encoding:
sklearn.preprocessing.OrdinalEncoder - LabelEncoder (similar to ordinal but only for target):
sklearn.preprocessing.Labelencoder - Embeddings
- One Hot Encoding:
- Missing Values
- SimpleImputer:
sklearn.impute.SimpleImputer - KNNImputer:
sklearn.impute.KNNImputer
- SimpleImputer:
- Data Transformation / Mapping
- KBinsDiscretizer:
sklearn.preprocessing.KBinsDiscretizer - FunctionTransformer:
sklearn.preprocessing.FunctionTransformer - Binarizer (Thresholding (turns numbers into 0/1)):
sklean.preprocessing.Binarizer
- KBinsDiscretizer:
- Clustering
- KMeans: use
sklearn.cluster.KMeans, to create new feature, it would be a categorical feature. KMeans takes inn_clusters. Use dummies with this to get final features.
- KMeans: use
- Text Transforms
- CountVectorizer (bag of words):
sklearn.feature_extraction.text.CountVectorizer - Tf-IDF Vectorizer:
sklearn.feature_extraction.text.TfidfVectorizer - tfidf and bow:
sklearn.feature_extraction.text.TfidfTransformer - Hashing Vectorizer:
sklearn.feature_extraction.text.HashingVectorizer
- CountVectorizer (bag of words):
- Feature Selection
- Keep the top k features:
sklearn.feature_selection.SelectKBest()/sklearn.feature_selection.SelectPercentile() - Recursive feature elimination:
sklearn.feature_selection.RFE()/sklearn.feature_selection.RFECV() - concatenate all the features obtained previously to get a new dataset. For concatenation use
pd.concat([], axis=1)
- Keep the top k features:
sklearn.compose.make_column_selector: Automatically select numerical/categorical columns.sklearn.compose.make_column_selector: Automatically select columns for column transformer.[any model].check_is_fitted(): check if the model is trained.sklearn.model_selection.cross_val_score: cross validation in one line.
Use this flow chart ![[ml_map.svg]]