|
1 | 1 | # DeepURLBench Dataset
|
2 | 2 |
|
3 |
| -This repository contains the dataset **DeepURLBench** for the paper: |
4 |
| -**"A New Dataset and Methodology for Malicious URL Classification"** |
5 |
| -by Deep Instinct's Research Team. |
| 3 | +This repository contains the dataset **DeepURLBench**, introduced in the paper **"A New Dataset and Methodology for Malicious URL Classification"** by Deep Instinct's research team. |
6 | 4 |
|
7 |
| ---- |
| 5 | +## Dataset Overview |
8 | 6 |
|
9 |
| -## Dataset Description |
| 7 | +The repository includes two parquet directories: |
10 | 8 |
|
11 |
| -The repository includes two directories in Parquet format: |
| 9 | +1. **`urls_with_dns`**: |
| 10 | + - Contains the following fields: |
| 11 | + - `url`: The URL being analyzed. |
| 12 | + - `first_seen`: The timestamp when the URL was first observed. |
| 13 | + - `TTL` (Time to Live): The time-to-live value of the DNS record. |
| 14 | + - `label`: Indicates whether the URL is malware, phishing or benign. |
| 15 | + - `IP addresses`: The associated IP addresses. |
12 | 16 |
|
13 |
| -1. **`urls_with_dns`**: Contains URLs with associated DNS data. |
14 |
| -2. **`urls_without_dns`**: Contains URLs without DNS data. |
| 17 | +2. **`urls_without_dns`**: |
| 18 | + - Contains the following fields: |
| 19 | + - `url`: The URL being analyzed. |
| 20 | + - `first_seen`: The timestamp when the URL was first observed. |
| 21 | + - `label`: Indicates whether the URL is malware, phishing or benign. |
15 | 22 |
|
16 |
| ---- |
| 23 | +## Usage Instructions |
17 | 24 |
|
18 |
| -## Loading the Dataset |
19 |
| - |
20 |
| -You can load the dataset using **pandas** in Python. Here's an example: |
| 25 | +To load the dataset using Python and Pandas, follow these steps: |
21 | 26 |
|
22 | 27 | ```python
|
23 | 28 | import pandas as pd
|
24 | 29 |
|
25 |
| -# Load a Parquet file |
26 |
| -df = pd.read_parquet('path_to_directory') |
| 30 | +# Replace 'directory' with the path to the parquet file or directory |
| 31 | +df = pd.DataFrame.from_parquet("directory") |
0 commit comments