Features | Project Layout | Released Dataset | Quick Start | Installation | Configuration | Usage | Outputs & Idempotency | Performance & Tips | Troubleshooting & FAQ | Roadmap | Contributing | Citation | License | Acknowledgements |
TL;DR: CTI-Crawler is designed to fetch threat intelligence reports and threat encyclopedia entries (HTML only) from major CTI platforms. It currently includes 31 report crawlers and 6 encyclopedia crawlers, works best on Linux or Docker.
- Broad coverage: 31 CTI report crawlers and 6 threat encyclopedia crawlers. You can check the website list here: list of security websites.
- Reproducible runs: URL-to-file maps ensure already-downloaded HTML is skipped, enabling safe re-runs.
- Two modes: run a single crawler for testing, or run a curated subset / all crawlers at once.
- Docker-friendly: simple
docker compose up -dto get going on servers. - Config-first: central settings for crawler folders, concurrency, loop mode, browser paths, and crawler selection.
- Proxy support: optional integration with
proxy_poolfor rate-limit-friendly crawling.
cti-crawler/
ββ config/
β ββ root_settings.py # Global knobs: paths, threads, loop, browser, selected crawlers.
β ββ cti_reports_crawlers_settings/ # Per-site settings for CTI reports.
β ββ threat_encyclopedia_crawlers_settings/ # Per-site settings for threat encyclopedias.
ββ cti_reports_crawlers/
β ββ url_to_filename_maps # Record the webpage that was already crawled.
β ββ ... # Individual report crawlers.
ββ threat_encyclopedia_crawlers/
β ββ url_to_filename_maps # Record the webpage that was already crawled.
β ββ ... # Individual encyclopedia crawlers.
ββ output/
β ββ cti_reports/ # HTML outputs for report crawlers.
β ββ threat_encyclopedia_reports/ # HTML outputs for encyclopedia crawlers.
ββ utils/
β ββbase_crawler.py
β ββscraper.py
β ββmultithreaded_task_scheduler.py
β ββ...
ββ main.py
Each crawler directory contains a url_to_filename_maps/ folder that records downloaded pages and prevents duplicates during re-runs.
We also release the dataset collected by running CTI-Crawler. You can download the dataset here: Google Drive link.
129,393 reports were collected in the released dataset.
Here's the structure of the dataset
cti-crawler_dataset_20250905-full/
ββ cti_reports/
β ββ cti_reports_platform_html/ # This folder contains all the blogs crawled from a CTI report platform.
β β ββ blog_name.html
β β ββ ...
β ββ ...
ββ threat_encyclopedia_reports/
β ββ threat_encyclopedia_reports_platform_html/ # This folder contains all the blogs crawled from a CTI threat encyclopedia reports platform.
β β ββ blog_name.html
β β ββ ...
β ββ threat_encyclopedia_reports_platform/ # Some platforms contains more than one type of reports. In the absence of the _html suffix, it will contain folders named after different categories.
β β ββ type_html/
β β β ββblog_name.html
β β β ββ ...
β β ββ ...
β ββ ...
ββ report_url_to_filename_maps/ # This folder contains JSON files that store the mapping information of the CTI reports platforms.
β ββ cti_reports_platform.json
β ββ ...
ββ threat_url_to_filename_maps/ # This folder contains JSON files that store the mapping information of the threat encyclopedia reports platforms.
ββ threat_encyclopedia_reports_platform.json
ββ ...
If you want to continue crawling based on this dataset and avoid retrieving duplicate articles, you need to replace the report_url_to_filename_maps and threat_url_to_filename_maps folders with cti_reports_crawlers and threat_encyclopedia_crawlers in the crawler, respectively.
The fastest way is Docker. Ensure Docker and Docker Compose are installed.
git clone https://github.com/peng-gao-lab/cti-dataset.git
cd cti-crawlerStart the stack:
docker-compose build && docker-compose up -d # If the current host is Linux system
# or
docker-compose build ; docker-compose up -d # If the current host is Windows system-druns containers in the background. Remove it if you want to see live logs.- Follow logs at any time:
docker logs --follow crawlers_crawler_1A host volume is mounted, so downloaded files inside the container are available on your machine.
Prefer running with Docker. If you need a local install (Python 3.8+):
git clone https://github.com/peng-gao-lab/cti-dataset.git
cd cti-crawler
pip3 install -r requirements.txtPrerequisites
- Python: 3.8+
- OS: Linux (recommended), Windows, macOS
- Browser: Google Chrome (or Chrome for Testing) + matching Chromedriver
- Docker/Docker Compose (if using containers)
You may want to tweak configs before running.
Some crawlers require a real browser session.
- Set
CHROME_DRIVER_PATHinconfig/root_settings.pyappropriately for your OS and Chrome version. - Download a matching Chrome/Chromedriver pair (e.g., from Chrome for Testing).
- If you place the browser under
config/, adjustchromedriver_name(or similarly named field) inconfig/root_settings.py.
This file controls:
- crawler folder paths
- concurrency (threads for searching & downloading)
- browser path
- loop mode (run once vs. run in a loop)
- which crawler scripts to execute
Tip: You can comment out entries to run a smaller subset.
config/cti_reports_crawlers_settings/
Contains all links used by CTI report crawlers and the map path.config/threat_encyclopedia_crawlers_settings/
Similar for threat encyclopedia crawlers.
You can (a) run one crawler for testing, or (b) run many at once via root_settings.
# Example (adjust the file name)
python cti_reports_crawlers/<crawler_name>.py
# or
python threat_encyclopedia_crawlers/<crawler_name>.py- A corresponding folder under
output/cti_reports/oroutput/threat_encyclopedia_reports/will be created.
Proxy
For testing single crawlers, with_proxy is typically False.
If supported by a crawler, you can enable a proxy via the BaseCrawler setting (integrates with proxy_pool).
Note: some crawlers always require the browser instead of a simple HTTP client.
Locally
python main.pymain.pyruns once by default.- To run continuously, set
LOOP_TIMEinconfig/root_settings.py.
Docker
docker-compose build && docker-compose up -d #If the current host is Linux system
#or
docker-compose build ; docker-compose up -d #If the current host is Windows system- Downloaded HTML files are stored under:
output/cti_reports/output/threat_encyclopedia_reports/
- Each crawler keeps a
url_to_filename_maps/record to avoid re-downloading the same page. - Re-crawling: simply re-run the crawler; existing HTML files are skipped.
To force re-crawl, remove the corresponding mapping entries or delete the target output files.
- Increase/decrease the search/download thread counts in
root_settings.pybased on your network/CPU. - For rate-limited sites, enable proxies and/or reduce concurrency. Or you may want to modify the code.
- Headless browser runs are often more stable on servers (ensure correct Chrome/Chromedriver pairing).
Q: Chromedriver version error (session not created / cannot find Chrome)?
A: Ensure Chrome and Chromedriver versions match. Update CHROME_DRIVER_PATH and related fields in configs.
Q: Docker command not found / permission issues?
A: Install Docker/Compose; add your user to the docker group (Linux); re-login. On macOS/Windows, ensure Docker Desktop is running.
Q: Getting 403/429 (rate-limited)?
A: Enable with_proxy (if supported), throttle concurrency, or use loop mode to pace requests.
Q: Where are my files?
A: Check output/ (host-mounted when using Docker). Logs: docker logs --follow crawlers_crawler_1.
Q: Does this crawl images?
A: No - only HTML is downloaded. Besides, it does not process HTML format. It just downloads the HTML file.
Issues and PRs are welcome!
For new crawlers:
- Review existing ones under
cti_reports_crawlers/orthreat_encyclopedia_crawlers/. - Keep outputs consistent and ensure
url_to_filename_maps/entries are properly maintained. - Add site-specific configs under the corresponding
config/..._settings/.
If you make use of our code or dataset in your research, we would appreciate it if you cite the following papers:
@inproceedings{cheng2025ctinexus,
title={CTINexus: Automatic Cyber Threat Intelligence Knowledge Graph Construction Using Large Language Models},
author={Cheng, Yutong and Bajaber, Osama and Tsegai, Saimon Amanuel and Song, Dawn and Gao, Peng},
booktitle={Proceedings of the IEEE 10th European Symposium on Security and Privacy},
pages={923--938},
year={2025},
series = {EuroS\&P '25}
}@inproceedings{gao2024threatkg,
title={ThreatKG: An AI-Powered System for Automated Open-Source Cyber Threat Intelligence Gathering and Management},
author = {Gao, Peng and Liu, Xiaoyuan and Choi, Edward and Ma, Sibo and Yang, Xinyu and Song, Dawn},
booktitle={Proceedings of the 1st ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis},
pages={1--12},
numpages = {12},
year = {2024},
series = {LAMPS '24}
}The CTI-Crawler is released under the MIT License. By using the crawler, you agree to the terms and conditions of the license.
- CTI report sources: see the maintained list of security websites (Google Sheet).