Codes

CODE 0

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

plt.style.use(‘winter’)
pd.set_option(‘max_columns’, None)

CODE 1

from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy=‘mean’)
imp.fit_transform(X)

General

Import Stuff using [CODE0]
Train Test split
Create and get error of a baseline model (like average or median)

Data Exploration

Open the dataset
Data Understanding
- DataFrame .shape
- .head() and .tail()
- .dtypes
- .describe()
- .info()
Data Preparation
- Drop irrelevant columns and rows
  - use .columns to obtain list of columns
  - copy and paste column list
  - Remove unnecessary columns
  - Set a variable to be the dataset indexed with the remaining columns, add .copy() at the end
  - you can also use .drop([‘cols you wanna drop’], axis=1, inplace=True)
- change necessary column type, using dataset[column] = pd.to_TheFinalDataType(dataset[column])
  - You can use .to_numeric, .to_datetime, etc.
- Identify duplicate columns
  - use .duplicated()
  - use .loc[.duplicated()] to look at all duplicate rows
  - use .duplicated().sum() to get number of duplicated rows
  - use .duplicated(subset=[]) to get duplicated of specific columns
  - To remove duplicated use slicing with ~ in front for inverse
  - use .reset_index(drop=True) to reset index
- Rename Columns
  - To rename, .rename(columns={’old_name’: ‘new_name’}, inplace=True)
- Feature Creation
- Remove or do some other action to NaN values
  - Using pandas
    - Identify using .isna()
    - Get total using .isna().sum()
    - Replace using .fillna(value or mean, median, mode, etc)
  - Using Scikit-Learn
    - Replace using mean, most_frequent or constant for strategy, use [CODE1]
Feature Understanding (Univariate Analysis) use ax = to save any plot into a matplotlib plot
- Value Count use [column].value_counts() to see number of unique values and their count
- Histogram, use .hist(title=“Plot0”) or .plot(kind=‘hist’) on any pd df
- KDE, use .plot(kind=‘kde’)
- Bar Graph, use .plot(kind=‘bar’, title=“Plot1”) or barh instead of bar for horizontal plot
- Box Plot
Feature Relationship
- Scatter Plot
  - For basic, use .plot(kind=‘scatter’, x=’column’, y=’column too’)
  - For Pro, use sns.scatterplot(data=df, x=‘column’, y=‘column 2’), you can use hue= property to use other columns as the color
- Pair Plot
  - Use sns.pairplot(data=df, vars=[a very long array of stuff])
  - Use sns.pairplot(data=df, vars_x=[…], vars_y=[…]) to get only specific stuff
  - use can also use hue= for this also.
- Correlation, use .corr(), you can pass in the type of [pearson, kendall, spearman or callable]
  - Use sns.heatmap() to get better visualization use annot=True to see the values inside

Feature Selection + Data Preparation

Use sklearn.compose.make_column_transformer to apply feature selection for individual columns.
Use sklearn.pipeline.make_pipeline to make pipelines to automate stuff. Column transform can be added to it as a process.
Dimensionality Reduction
- PCA: sklearn.decomposition.PCA
- t-SNE: sklean.mainfold.TSNE
- UMAP: gotta install umap-learn, umap.umap_ as umap and use as if it was t-SNE or PCA
Scaling
- Normalizer: sklearn.preprocessing.Normalizer
- StandardSalar: sklearn.preprocessing.StandardScalar
- MinMaxSalar: sklearn.preprocessing.MinMaxScalar
- RobustScalar: sklean.preprocessing.RobustScalar
Feature expansion:
- PolynomialFeatures: sklearn.preprocessing.PolynomialFeatures
Use Categorical Encoding
- One Hot Encoding: sklearn.preprocessing.OneHotEncoder
- Ordinal Encoding: sklearn.preprocessing.OrdinalEncoder
- LabelEncoder (similar to ordinal but only for target): sklearn.preprocessing.Labelencoder
- Embeddings
Missing Values
- SimpleImputer: sklearn.impute.SimpleImputer
- KNNImputer: sklearn.impute.KNNImputer
Data Transformation / Mapping
- KBinsDiscretizer: sklearn.preprocessing.KBinsDiscretizer
- FunctionTransformer: sklearn.preprocessing.FunctionTransformer
- Binarizer (Thresholding (turns numbers into 0/1)): sklean.preprocessing.Binarizer
Clustering
- KMeans: use sklearn.cluster.KMeans, to create new feature, it would be a categorical feature. KMeans takes in n_clusters. Use dummies with this to get final features.
Text Transforms
- CountVectorizer (bag of words): sklearn.feature_extraction.text.CountVectorizer
- Tf-IDF Vectorizer: sklearn.feature_extraction.text.TfidfVectorizer
- tfidf and bow: sklearn.feature_extraction.text.TfidfTransformer
- Hashing Vectorizer: sklearn.feature_extraction.text.HashingVectorizer
Feature Selection
- Keep the top k features: sklearn.feature_selection.SelectKBest() / sklearn.feature_selection.SelectPercentile()
- Recursive feature elimination: sklearn.feature_selection.RFE() / sklearn.feature_selection.RFECV()
- concatenate all the features obtained previously to get a new dataset. For concatenation use pd.concat([], axis=1)

Other very cool stuff

sklearn.compose.make_column_selector: Automatically select numerical/categorical columns.
sklearn.compose.make_column_selector: Automatically select columns for column transformer.
[any model].check_is_fitted(): check if the model is trained.
sklearn.model_selection.cross_val_score: cross validation in one line.

Model Selection

Use this flow chart ![[ml_map.svg]]

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Codes

General

Data Exploration

Feature Selection + Data Preparation

Other very cool stuff

Model Selection

About

Uh oh!

Releases

Packages

kargneq/ML-PROCESSES

Folders and files

Latest commit

History

Repository files navigation

Codes

General

Data Exploration

Feature Selection + Data Preparation

Other very cool stuff

Model Selection

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages