Merge pull request #3 from Virgula0/interceptor_improvements

Interceptor improvements
Virgula0 · Jan 16, 2025 · dece392 · dece392
2 parents fb04d51 + e6096e0
commit dece392
Show file tree

Hide file tree

Showing 17 changed files with 69,231 additions and 53,203 deletions.
diff --git a/.github/workflows/markdown-converter.yaml b/.github/workflows/markdown-converter.yaml
@@ -18,7 +18,7 @@ jobs:
     steps:
       - name: Checkout Repository
         uses: actions/checkout@v4
-
+      
       - uses: awalsh128/cache-apt-pkgs-action@latest
         with:
           packages: pandoc texlive-xetex python3 python3-pip
@@ -131,3 +131,5 @@ jobs:
         with:
           commit_message: "Generate PDF"
           file_pattern: docs/docs.pdf
+          skip_checkout: true
+          push_options: --force
diff --git a/README.md b/README.md
@@ -26,6 +26,7 @@ In the project the following files have the described functions:
 - `classifiers.py` -> define a big list of classifiers for `algo_chooser.py`
 - `noiser.py` -> helper script for sending normal http requests
 - `detector.py` -> runs a real-time demo of the model using previously mentioned scripts internally
+- `export_model.py` -> utility for model exportation
 - `model` directory -> contains exported model
 - `merger.py` -> merges 2 datasets in one single dataset
 - `datasets/delayed/merged.csv` -> contains just another dataset for calculating accuracy and other stats 
@@ -68,31 +69,30 @@ Understanding TCP connections is important for building the dataset. When a TCP
 ## Example Dataset Row (normal tcp scan on a closed port 3306)
 
 ```csv
-start_request_time,end_request_time,start_response_time,end_response_time,duration,src_ip,dst_ip,src_port,dst_port,SYN,ACK,FIN,RST,URG,PSH,label
-2025-01-08 13:52:55.274814,2025-01-08 13:52:55.274814,2025-01-08 13:52:55.274874,2025-01-08 13:52:55.274874,6e-05,"['172.31.0.1', '172.31.0.2']","['172.31.0.1', '172.31.0.2']","['44031', '3306']","['44031', '3306']",1,1,0,1,0,0,1
+2025-01-15 12:49:08.025898,2025-01-15 12:49:08.025898,2025-01-15 12:49:08.025946,2025-01-15 12:49:08.025946,4.8e-05,"['172.31.0.2', '172.31.0.1']","['172.31.0.2', '172.31.0.1']","['52666', '22']","['52666', '22']",1,1,0,1,0,0,1
 ```
 
-- Sessions are grouped using `src_ip`, `dst_ip`, `src_port`, and `dst_port` tuple as keys. However, these grouping keys are excluded and not necessary from the model's training phase.
+- Sessions are grouped using `src_port`, and `dst_port` tuple as keys. However, these grouping keys are excluded and not necessary from the model's training phase.
 
 - The `duration` feature provides valuable information for distinguishing between legitimate traffic and `NMAP` scans, as legitimate HTTP requests may exhibit similar flag behaviour but differ in timing.
 
-- The session window in `interceptor.py` is set to **0.5 seconds** by default, as this is typically enough to capture an `NMAP` scan attempt.
+- The session window in `interceptor.py` is set to **0.5 seconds** by default, as this is typically enough to capture an `NMAP` scan attempt on a single port.
 
 More technical explanations are present via comments in `interceptor.py`. The script takes a while for writing succesfully all the data when a lot of requests are performed.
 
 > [!NOTE]
-> During the data collection, some ports were opened intentionally on the host to differentiate some rows in the dataset. For example, an HTTP server on port 1234 has been opened using the following method: `python3 -m http.server 1234`plus, eventually other ports that had already been opened from other services between the range 0-5000.
+> During the data collection, some ports were opened intentionally on the host to differentiate some rows in the dataset. For example, an HTTP server on port `1234` has been opened using the following method: `python3 -m http.server 1234`plus, eventually other ports that had already been opened from other services between the range 0-5000.
 
 ## Common `NMAP` Scans
 
 The following commands were run from the container called `traffic_generator` (the container) having the `sudo python3 interceptor.py` running locally.
 
 ```bash
-nmap -sT 172.31.0.1 -p 0-5000 # TCP Scan
-nmap -sS 172.31.0.1 -p 0-5000 # Stealth Scan
-nmap -sF 172.31.0.1 -p 0-5000 # FIN Scan
-nmap -sN 172.31.0.1 -p 0-5000 # NULL Scan
-nmap -sX 172.31.0.1 -p 0-5000 # XMAS Scan
+nmap -sT 172.31.0.1 -p 0-2500 # TCP Scan
+nmap -sS 172.31.0.1 -p 0-2500 # Stealth Scan
+nmap -sF 172.31.0.1 -p 0-2500 # FIN Scan
+nmap -sN 172.31.0.1 -p 0-2500 # NULL Scan
+nmap -sX 172.31.0.1 -p 0-2500 # XMAS Scan
 ```
 
 The result is the creation of `bad.csv`
@@ -110,9 +110,9 @@ The final dataset consists of a merge (`merged.csv`) used for training the model
 
 The `XGBClassifier` was selected as the final model due to its reliable performance in key areas:
 
-1. High accuracy score (`~0.95`)
-2. Fast prediction speed (`~3ms` on average for 15,000 rows)
-3. High MCC score (`~0.91`)
+1. High accuracy score (`~0.99`)
+2. Fast prediction speed (`~4ms` on average for `24.511` rows)
+3. High MCC score (`~0.98`)
 
 ## Why accuracy metric is important?
 
@@ -124,52 +124,50 @@ MCC should normally be preferred when unbalanced datasets are present. This is n
 
 ## Why the prediction speed is so important?
 
-The prediction speed played a significant role in choosing this model, as it allows efficient analysis of large volumes of network traffic in real-time. The `RandomForestClassifier` is pretty similar in accuracy (maybe even better sometimes), but it has a slower prediction time in average of `~15ms` compared to `~3ms` of `XGBClassifier`. Of course it's useless to underline that even if `DeepSVDD` predicts in `1ms`, given it's low accuracy rate it cannot be even considered.
+The prediction speed played a significant role in choosing this model, as it allows efficient analysis of large volumes of network traffic in real-time. The `RandomForestClassifier` is pretty similar in accuracy (maybe even better sometimes), but it has a slower prediction time in average of `~23ms` compared to `~4ms` of `XGBClassifier`. Of course it's useless to underline that even if `DeepSVDD` predicts in `1ms`, given it's low accuracy rate it cannot be even considered.
 
 ## Model Performance
 
 ```
-Dataset loaded with 15192 records.
+Dataset loaded with 24511 records.
 Dataset preprocessed successfully.
 
-+------------+-----+-----+-----+-----+-----+-----+-------+
-|  duration  | SYN | ACK | FIN | RST | URG | PSH | label |
-+------------+-----+-----+-----+-----+-----+-----+-------+
-|  0.000030  |  1  |  1  |  0  |  1  |  0  |  0  |   1   |
-|  0.000013  |  1  |  1  |  0  |  1  |  0  |  0  |   1   |
-|  0.000012  |  1  |  1  |  0  |  1  |  0  |  0  |   1   |
-|  0.000010  |  1  |  1  |  0  |  1  |  0  |  0  |   1   |
-|  0.000011  |  1  |  1  |  0  |  1  |  0  |  0  |   1   |
-+------------+-----+-----+-----+-----+-----+-----+-------+
++------------+-----+-----+-----+-----+-----+-----+--------+
+| Duration   | SYN | ACK | FIN | RST | URG | PSH | Label |
+|------------|-----|-----|-----|-----|-----|-----|-------|
+| 0.000048   | 1   | 1   | 0   | 1   | 0   | 0   | 1     |
+| 0.000016   | 1   | 1   | 0   | 1   | 0   | 0   | 1     |
+| 0.000015   | 1   | 1   | 0   | 1   | 0   | 0   | 1     |
+| 0.000014   | 1   | 1   | 0   | 1   | 0   | 0   | 1     |
+| 0.000015   | 1   | 1   | 0   | 1   | 0   | 0   | 1     |
++------------+-----+-----+-----+-----+-----+-----+--------+
 
 Dataset split into training and testing sets.
 
-RandomForestClassifier (n_estimators=210): Accuracy: 0.9566, Train time: 411ms, Prediction time: 16ms, MCC: 0.914073, TP: 744, TN: 710, FN: 16, FP: 50
+KNeighborsClassifier (n_estimators=N/A): Accuracy: 0.9910, Train time: 13ms, Prediction time: 271ms, MCC: 0.982114, TP: 1238, TN: 1192, FN: 18, FP: 4
 ....
-
-XGBClassifier (n_estimators=1): Accuracy: 0.9572, Train time: 9ms, Prediction time: 2ms, MCC: 0.915337, TP: 744, TN: 711, FN: 16, FP: 49
-
+RandomForestClassifier (n_estimators=210): Accuracy: 0.9902, Train time: 650ms, Prediction time: 23ms, MCC: 0.980464, TP: 1238, TN: 1190, FN: 18, FP: 6
+....
+XGBClassifier (n_estimators=210): Accuracy: 0.9910, Train time: 86ms, Prediction time: 4ms, MCC: 0.982114, TP: 1238, TN: 1192, FN: 18, FP: 4
 ....
-DeepSVDD (n_estimators=N/A): Accuracy: 0.3868, Train time: 14032ms, Prediction time: 1ms, MCC: -0.348550, TP: 5, TN: 583, FN: 755, FP: 177
+DeepSVDD (n_estimators=N/A): Accuracy: 0.6970, Train time: 22739ms, Prediction time: 1ms, MCC: 0.492361, TP: 526, TN: 1183, FN: 730, FP: 13
 ....
 
 Best Classifier based on Accuracy
 Classifier: XGBClassifier
-n_estimators: 1
-Accuracy Score: 0.9572
+n_estimators: 210
+Accuracy Score: 0.9910
 
 Best Classifier based on MCC
 Classifier: XGBClassifier
-n_estimators: 1
-MCC Score: 0.915337
+n_estimators: 210
+MCC Score: 0.982114
 
 Best Classifier based on prediction time
 Classifier: DeepSVDD
 Time : 1.000000ms
 ```
 
----
-
 ## How Training Dataset was created (detailed)
 
 The training dataset, `datasets/train/merged.csv`, is generated using the following steps:
@@ -197,11 +195,11 @@ The training dataset, `datasets/train/merged.csv`, is generated using the follow
      - `label`: `0` for legitimate traffic, `1` for `NMAP` scans
 4. Run `NMAP` scans from the container:
    ```bash
-      nmap -sT 172.31.0.1 -p 0-5000 
-      nmap -sS 172.31.0.1 -p 0-5000
-      nmap -sF 172.31.0.1 -p 0-5000
-      nmap -sN 172.31.0.1 -p 0-5000
-      nmap -sX 172.31.0.1 -p 0-5000
+      nmap -sT 172.31.0.1 -p 0-2500 
+      nmap -sS 172.31.0.1 -p 0-2500
+      nmap -sF 172.31.0.1 -p 0-2500
+      nmap -sN 172.31.0.1 -p 0-2500
+      nmap -sX 172.31.0.1 -p 0-2500
    ```
 5. Run noise traffic for legitimate requests, from the container:
    ```bash
@@ -217,49 +215,62 @@ The training dataset, `datasets/train/merged.csv`, is generated using the follow
    ```bash
    python3 algo_chooser.py
    ```
+---
 
-## Delayed Dataset
+## Delayed Dataset (Making things harder)
 
 A delayed dataset can be created by introducing delays between requests:
 
 ```bash
-nmap -p 1-10000 --scan-delay 1s 172.31.0.1
+nmap -p 1-5000 --scan-delay 1s 172.31.0.1
 ```
 
 You can also adjust the delay in legitimate requests by modifying `SLEEP_SECOND` in `noiser.py`.
 
-Results confirm the choise of `XGBClassifier`
+With this dataset, the results are a little different and worse.
+
+The reasons why this happens are the following:
+
+1. Here we have a minor number of data since it takes some hours to construct this dataset.
+2. We added a scan delay that introduces a second delay between each Nmap attempt.
+3. The attack type used by Nmap using the above command is a `Stealth Attack` by default, so flags between `HTTP` normal requests and 
+`Stealth Attacks` are practically the same, the only information affordable is the duration (the backup feature introduced for these situations, when distinguish anomaly/normal packets using flags is impossible).
+
+Given this 3 points above, and relying only on duration feature in these kind of situation, an accuracy of `~90%` seems quite reasonable.
+
+We still continue to prefer `XGBClassifier` for the same reasons discussed for the train dataset.
 
 ```
-Dataset loaded with 11351 records.                                                                                                                                                                                  
-Dataset preprocessed successfully.                                                                                                                                                                                  
-+------------+-----+-----+-----+-----+-----+-----+-------+
-|  duration  | SYN | ACK | FIN | RST | URG | PSH | label |
-+------------+-----+-----+-----+-----+-----+-----+-------+
-|  0.000060  |  1  |  1  |  0  |  1  |  0  |  0  |   1   |
-|  0.000068  |  1  |  1  |  0  |  1  |  0  |  0  |   1   |
-|  0.000062  |  1  |  1  |  0  |  1  |  0  |  0  |   1   |
-|  0.000057  |  1  |  1  |  0  |  1  |  0  |  0  |   1   |
-|  0.000074  |  1  |  1  |  0  |  1  |  0  |  0  |   1   |
-+------------+-----+-----+-----+-----+-----+-----+-------+
+Dataset loaded with 10000 records.
+Dataset preprocessed successfully.
+
++------------+-----+-----+-----+-----+-----+-----+---------+
+|   duration   | SYN | ACK | FIN | RST | URG | PSH | label |
+|--------------|-----|-----|-----|-----|-----|-----|-------|
+| 0.000067     |  1  |  1  |  0  |  1  |  0  |  0  |   1   |
+| 0.000051     |  1  |  1  |  0  |  1  |  0  |  0  |   1   |
+| 0.000062     |  1  |  1  |  0  |  1  |  0  |  0  |   1   |
+| 0.000037     |  1  |  1  |  0  |  1  |  0  |  0  |   1   |
+| 0.000042     |  1  |  1  |  0  |  1  |  0  |  0  |   1   |
++------------+-----+-----+-----+-----+-----+-----+---------+
 
 Dataset split into training and testing sets.
 
-RandomForestClassifier (n_estimators=210): Accuracy: 1.0000, Train time: 327ms, Prediction time: 11ms, MCC: 1.000000, TP: 752, TN: 384, FN: 0, FP: 0
+RandomForestClassifier (n_estimators=210): Accuracy: 0.8940, Train time: 483ms, Prediction time: 20ms, MCC: 0.789968, TP: 436, TN: 458, FN: 70, FP: 36
 ....
-XGBClassifier (n_estimators=210): Accuracy: 1.0000, Train time: 28ms, Prediction time: 3ms, MCC: 1.000000, TP: 752, TN: 384, FN: 0, FP: 0
+XGBClassifier (n_estimators=210): Accuracy: 0.8940, Train time: 52ms, Prediction time: 4ms, MCC: 0.789968, TP: 436, TN: 458, FN: 70, FP: 36
 ....
 DeepSVDD (n_estimators=N/A): Accuracy: 0.4833, Train time: 10273ms, Prediction time: 1ms, MCC: 0.270419, TP: 173, TN: 376, FN: 579, FP: 8
 
 Best Classifier based on Accuracy
-Classifier: XGBClassifier
-n_estimators: 210
-Accuracy Score: 1.0000
+Classifier: KNeighborsClassifier
+n_estimators: N/A
+Accuracy Score: 0.8990
 
 Best Classifier based on MCC
-Classifier: XGBClassifier
-n_estimators: 210
-MCC Score: 1.000000
+Classifier: KNeighborsClassifier
+n_estimators: N/A
+MCC Score: 0.798304
 
 Best Classifier based on prediction time
 Classifier: DeepSVDD
@@ -284,8 +295,7 @@ sudo python3 detector.py
 
 > [!TIP]
 > When running the script, a log file containing all events called `logs` is created in the main project directory.
-
----
+> When running the script, some other connections directed to localhost interface may be callected in the process.
 
 ---
 
@@ -302,5 +312,4 @@ https://github.com/user-attachments/assets/f10773c6-742e-4394-913e-42beb0cc3683
 
 # External Dependencies
 
-- `pyshark`
-- `python-nmap`
+- `pyshark`
diff --git a/algo_chooser.py b/algo_chooser.py
@@ -1,5 +1,4 @@
 import pandas as pd
-import joblib
 from sklearn.model_selection import train_test_split
 from sklearn.metrics import accuracy_score, confusion_matrix, matthews_corrcoef
 
@@ -13,7 +12,7 @@
 # Main script
 if __name__ == "__main__":
     # Load Dataset
-    df = pd.read_csv('./datasets/train/merged.csv')
+    df = pd.read_csv('./datasets/delayed/merged.csv')
     print(f"Dataset loaded with {len(df)} records.")
 
     # Preprocess Dataset
@@ -132,5 +131,3 @@
     # Export model prefering best accuracy model over best_mcc
     # I comment the dump because, XGBClassifier does not provide the best accuracy every time
     # The reason why it has been chosen over other models can be found in README.md
-
-    # joblib.dump(best_acc_clf, "./model/model.pkl")