Skip to content

Commit

Permalink
Merge pull request #4 from Virgula0/capturer-trainer-script
Browse files Browse the repository at this point in the history
Capturer trainer script
  • Loading branch information
Virgula0 authored Jan 20, 2025
2 parents 71d1f31 + 5a97b2a commit 0d6e00f
Show file tree
Hide file tree
Showing 11 changed files with 373 additions and 113 deletions.
65 changes: 46 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ In the project the following files have the described functions:
- `algo_chooser.py` -> script for choosing the best machine learning algorithm
- `injector.py` -> injects nmap scans or run normal http requests
- `classifiers.py` -> define a big list of classifiers for `algo_chooser.py`
- `dataset_and_train.py` -> create a dataset locally with bad and good data and produce a model trained on such data
- `noiser.py` -> helper script for sending normal http requests
- `detector.py` -> runs a real-time demo of the model using previously mentioned scripts internally
- `export_model.py` -> utility for model exportation
Expand All @@ -36,9 +37,9 @@ In the project the following files have the described functions:
- `datasets/runtime` -> contains generated runtime datasets when running `detector.py`

> [!WARNING]
> The entire software is designed for running on a Linux environment and although the only changes needed for running on another operating system are interface names, some checks may fail due to different environments.
> For example: `detector.py` contains a check to run with `root` privileges and this check can be disabled if environment is different from Unix. To disable checks you can pass `--no-strick-check` argument to the script.
> This disclaimer has been inserted because I noticed the usage of other operating systems during the course lectures. Running the `detector.py` on a virtual environment should work because everything is set to run on the `localhost` interface, but even here, the creation of the dataset phases may not work because of interfaces used by pyshark on a virtual machine. A solution can be to create another container and let the 2 isolated container to communicate between them, but this is up to the readers to investigate better eventually.
> The entire software is designed for running on a Linux environment however the only changes needed for running on another operating system are interface names.
> For example, windows does not have `lo` loopback interface, as well as ip addresses must be adapted.
> This disclaimer has been inserted because I noticed the usage of other operating systems during the course lectures. Running the `detector.py` on a virtual environment should work because everything is set to run on the `localhost` interface.
## Requirements

Expand All @@ -47,7 +48,7 @@ Install dependencies with:
```bash
python3.11 -m venv venv && \ # python3.13 has been tested to have some problems when trying to install catboost
source venv/bin/activate && \
pip install -r requirements.txt
pip install -r requirements.txt # this is full requirements for running algo_chooser.py too, if you want to skip it install requirements-minimal.txt instead
```

> [!WARNING]
Expand Down Expand Up @@ -85,7 +86,7 @@ start_request_time,end_request_time,start_response_time,end_response_time,durati
More technical explanations are present via comments in `interceptor.py`. The script takes a while for writing succesfully all the data when a lot of requests are performed.

> [!NOTE]
> During the data collection, some ports were opened intentionally on the host to differentiate some rows in the dataset. For example, an HTTP server on port `1234` has been opened using the following method: `python3 -m http.server 1234`plus, eventually other ports that had already been opened from other services between the range 0-5000.
> During the data collection, some ports were opened intentionally on the host to differentiate some rows in the dataset. For example, an HTTP server on port `1234` has been opened using the following method: `python3 -m http.server 1234` plus, eventually other ports that had already been opened from other services between the range 0-5000.
## Common `NMAP` Scans

Expand Down Expand Up @@ -210,10 +211,14 @@ The training dataset, `datasets/train/merged.csv`, is generated using the follow
cd datasets
python3 merger.py
```
7. Train the model:
7. Choose the model:
```bash
python3 algo_chooser.py
```
8. Train and export the model
```bash
python3 export_model.py
```
---

## Delayed Dataset (Making things harder)
Expand Down Expand Up @@ -241,15 +246,7 @@ We still continue to prefer `XGBClassifier` for the same reasons discussed for t

```
Dataset loaded with 10000 records.
+------------+-----+-----+-----+-----+-----+-----+---------+
| duration | SYN | ACK | FIN | RST | URG | PSH | label |
|--------------|-----|-----|-----|-----|-----|-----|-------|
| 0.000067 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
| 0.000051 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
| 0.000062 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
| 0.000037 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
| 0.000042 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
+------------+-----+-----+-----+-----+-----+-----+---------+
....
RandomForestClassifier (n_estimators=210): Accuracy: 0.8940, Train time: 483ms, Prediction time: 20ms, MCC: 0.789968, TP: 436, TN: 458, FN: 70, FP: 36
....
XGBClassifier (n_estimators=210): Accuracy: 0.8940, Train time: 52ms, Prediction time: 4ms, MCC: 0.789968, TP: 436, TN: 458, FN: 70, FP: 36
Expand All @@ -276,7 +273,7 @@ Time : 1.000000ms
# Running the Detector

The `detector.py` represents the live-demo for NMAP attacks detection.
The host is required to having `nmap` installed.
The host is required to having `nmap` and `tshark` (wireshark on windows) installed.

Depending on your distro

Expand All @@ -286,7 +283,34 @@ sudo pacman && sudo pacman -S nmap
flatpak install nmap
```

To run the detector:
You need to install `tshark` for `pyshark`

```bash
sudo apt update -y && sudo apt install tshark -y
sudo pacman && sudo pacman -S tshark
```

At this point you need at least minimal requirements (first define a virtual env)

```bash
pip install -r requirements-minimal.txt
```

> [!CAUTION]
> This is important. The `duration` feature is system depended and is calculated using time differentials on a given local system.
> Since on another system, duration feature can be slightly different because of pyshark times, the behaviour used in the pre-trained model (and so the live demo `detector.py`) can be affected by this.
>
> A solution is to re-create the training set again using `dataset_and_train.py` or alternatively using the train dataset creation procedure already described (more complex but more precise), and then, re-export the model by training it on new intercepted data.
> This is needed otherwise http normal request will be recognized as anomalies because of different duration times captured on another system.
Create dataset and train

```bash
sudo python3 dataset_and_train.py
```

> [!TIP]
> If you have processes running on localhost which may interfere with the capture and you notice that the `bad.csv` or `good.csv` monitor loop does not exist, you can force to continue by pressing ctrl+c. `bad.csv` and `good.csv` should have at least 12k data points per file.
```bash
sudo python3 detector.py
Expand All @@ -299,14 +323,17 @@ sudo python3 detector.py
> [!TIP]
> When running the script, a log file containing all events called `logs` is created in the main project directory.

> [!WARNING]
> Some other connections directed to localhost interface may be callected in the process. Actually this gives a real scenario perspective of the problem.
---

# Demonstration Video

https://github.com/user-attachments/assets/f10773c6-742e-4394-913e-42beb0cc3683
https://www.youtube.com/watch?v=Nsazb0cxeR8

[![https://www.youtube.com/watch?v=Nsazb0cxeR8](https://img.youtube.com/vi/Nsazb0cxeR8/0.jpg)](https://www.youtube.com/watch?v=Nsazb0cxeR8)

# References

Expand All @@ -317,4 +344,4 @@ https://github.com/user-attachments/assets/f10773c6-742e-4394-913e-42beb0cc3683

# External Dependencies

- `pyshark`
- `pyshark`
111 changes: 111 additions & 0 deletions dataset_and_train.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
import subprocess
import time
import threading
import os
import hashlib
import sys
from interceptor import capture_packets
from noiser import generate_noise
from utils import run_command , OS_INSTALLATION_PATHS
from export_model import export_model
from datasets.merger import merge_flow_datasets

IP_CAPTURE = '127.0.0.1'

LOCALHOST = 'localhost'

NMAP_COMMANDS_INJECTOR = [
f"nmap -sT {LOCALHOST} -p 0-2500",
f"nmap -sS {LOCALHOST} -p 0-2500",
f"nmap -sF {LOCALHOST} -p 0-2500",
f"nmap -sN {LOCALHOST} -p 0-2500",
f"nmap -sX {LOCALHOST} -p 0-2500"
]

INTERFACE = 'lo'
CAPTURE_BAD = os.path.join('datasets','runtime','bad.csv')
CAPTURE_GOOD = os.path.join('datasets','runtime','good.csv')
MERGED = os.path.join('datasets','runtime','merged.csv')
MODEL_PATH = os.path.join('model','model.pkl')

def calculate_md5(file_path):
"""Calculate the MD5 checksum of a file."""
hash_md5 = hashlib.md5()
with open(file_path, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()

def produce_dataset_for_this_system():
stop_event = threading.Event()
capture_thread = threading.Thread(target=capture_packets, args=(INTERFACE, IP_CAPTURE, CAPTURE_BAD, 1, stop_event), daemon=True)
capture_thread.start()

time.sleep(3) # give the time to start the monitor

for cmd in NMAP_COMMANDS_INJECTOR:
run_command(cmd)

md5_checksum = ""

# iterate until the md5 does not change anymore for 2 seconds. if you have other coming connections, this can be a problem
# you can press ctrl+c to force the loop exiting
try:
while md5_checksum != calculate_md5(CAPTURE_BAD):
print("Monitor is still saving data... if you want to force the proceeding press ctrl+c (do it if you know what you're doing)")
md5_checksum = calculate_md5(CAPTURE_BAD)
time.sleep(2)
except KeyboardInterrupt:
print("\nOperation interrupted by user. Exiting loop.")

stop_event.set() # stop thread

print("BAD data captured")
stop_event = threading.Event()
capture_thread = threading.Thread(target=capture_packets, args=(INTERFACE, IP_CAPTURE, CAPTURE_GOOD, 0, stop_event), daemon=True)
capture_thread.start()

time.sleep(3) # give the time to start the monitor

# noiser
generate_noise(target_host=f'{LOCALHOST}')

md5_checksum = ""

# iterate until the md5 does not change anymore for 2 seconds. if you have other coming connections, this can be a problem
# you can press ctrl+c to force the loop exiting
try:
while md5_checksum != calculate_md5(CAPTURE_BAD):
print("Monitor is still saving data... if you want to force the proceeding press ctrl+c (do it if you know what you're doing)")
md5_checksum = calculate_md5(CAPTURE_BAD)
time.sleep(2)
except KeyboardInterrupt:
print("\nOperation interrupted by user. Exiting loop.")

stop_event.set() # stop thread

print("Good data captured")
print("Merging")
merge_flow_datasets(CAPTURE_BAD, CAPTURE_GOOD, MERGED)


if __name__ == "__main__":

no_strict_check_flag = "--no-strict-check" in sys.argv if len(sys.argv) > 1 else False

if not any(os.path.exists(os.path.join(x, "nmap")) for x in OS_INSTALLATION_PATHS):
print("Nmap is required to be installed on the host")
sys.exit(-1)

if not any(os.path.exists(os.path.join(x, "wireshark")) for x in OS_INSTALLATION_PATHS):
print("Tshark is required to be installed on the host")
sys.exit(-1)

if not no_strict_check_flag:
if os.geteuid() != 0:
print("Pyshark needs root privileges for capturing data on interfaces")
sys.exit(-1)

produce_dataset_for_this_system()
export_model(dataset_path=MERGED, model_path=MODEL_PATH)
print(f"Model Exported in {MODEL_PATH}")
14 changes: 5 additions & 9 deletions detector.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,13 @@
from utils import preprocess_dataset, save_logs, RUNTIME_CAPTURE, OS_INSTALLATION_PATHS

SLEEP_SECONDS = 1
MODEL_PATH = 'model/model.pkl'
MODEL_PATH = os.path.join('model','model.pkl')
ANOMALY_PERCENTAGE = 30

INTERFACE = 'lo' # change this with your interface
IP = "127.0.0.1"

console = Console()
threads = []

# Graceful shutdown handler
def signal_handler(sig, frame):
Expand Down Expand Up @@ -52,15 +51,14 @@ def generate_output(data, anomaly_detected, normal_count, anomaly_count, normal_

def main():

if not any(os.path.exists(os.path.join(x, "nmap")) for x in OS_INSTALLATION_PATHS):
console.print("[bold red]Nmap is required to be installed on the host[/bold red]")
sys.exit(-1)

no_strict_check_flag = "--no-strict-check" in sys.argv if len(sys.argv) > 1 else False

# byass check for nmap installed on the host and root privileges
if not no_strict_check_flag:

if not any(os.path.exists(os.path.join(x, "nmap")) for x in OS_INSTALLATION_PATHS):
console.print("[bold red]Nmap is required to be installed on the host[/bold red]")
sys.exit(-1)

if os.geteuid() != 0:
console.print("[bold red]Pyshark needs root privileges for capturing data on interfaces[/bold red]")
sys.exit(-1)
Expand All @@ -78,11 +76,9 @@ def main():

# Start background threads
capture_thread = threading.Thread(target=capture_packets, args=(INTERFACE, IP, RUNTIME_CAPTURE), daemon=True)
threads.append(capture_thread)
capture_thread.start()

injector_thread = threading.Thread(target=run_injector, daemon=True)
threads.append(injector_thread)
injector_thread.start()

time.sleep(SLEEP_SECONDS) # wait a second before proceeding
Expand Down
Binary file modified docs/docs.pdf
Binary file not shown.
10 changes: 7 additions & 3 deletions export_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,10 @@
from utils import preprocess_dataset
import pandas as pd
import joblib
import os

if __name__== "__main__":
df = pd.read_csv('./datasets/train/merged.csv')
def export_model(dataset_path=os.path.join('datasets','train','merged.csv'), model_path=os.path.join("model','model.pkl")):
df = pd.read_csv(dataset_path)
print(f"Dataset loaded with {len(df)} records.")

# Preprocess Dataset
Expand Down Expand Up @@ -37,4 +38,7 @@

print(f"Accuracy {acc_score}, MCC {mcc}, TN: {tn} FP: {fp} FN: {fn} TP: {tp}")

joblib.dump(clf, "./model/model.pkl")
joblib.dump(clf, model_path)

if __name__== "__main__":
export_model()
5 changes: 4 additions & 1 deletion injector.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,20 @@
from utils import save_logs

SLEEP_TIME = 1
PROBABILITY_INJECTION = 10 # 30% of probabilities to inject an anomaly each SLEEP_TIME seconds
PROBABILITY_INJECTION = 10 # 10% of probabilities to inject NUMBER_OF_PORTS anomalies each SLEEP_TIME seconds
NUMBER_OF_PORTS = 30
INJECTION_TIME_SLEEP = 5

# It is advisable to disable -sT and -sS by commenting them out as they're detection can be system dependent and a
# new dataset built and re-trained on the new system may be required for detecting them succesfully.
INJECT_OPTIONS = [
"-sT",
"-sS",
"-sF",
"-sN",
"-sX"
]

def should_inject() -> bool:
"""
Determines if an injection should happen based on PROBABILITY_INJECTION.
Expand Down
Loading

0 comments on commit 0d6e00f

Please sign in to comment.