ciberlab-report-generator is a Python library developed by the Grupo de Robótica that automates the analysis of malware samples (dynamic) and generates reports in PDF format, ready to be interpreted by humans.
It includes data processing, page generation, integration with LLMs (when necessary), and support for documentation with Sphinx.
- Dynamic malware analysis output processing (slicing, normalisation).
- Automated PDF report generation.
- Package structure for reuse in other projects.
- Best practices: packaging with Poetry, testing with pytest, Ruff + Black linter/formatter.
- Documentation generated with Sphinx.
The project now uses a hybrid configuration model:
- YAML files for non-sensitive configuration (paths, mode, model, limits).
- Environment variables for secrets (
OPENAI_API_KEY,VT_API_KEY_X) and runtime overrides.
Base and profile files:
config/base.yamlconfig/profiles/local.yamlconfig/profiles/cape.yamlconfig/profiles/batch_uploader.yaml
Canonical env template:
cp config/.env.example .envWhich env file is applied:
- If
CIBERLABREPORT_ENV_FILEis set, that file is loaded. - Otherwise, the nearest
.envfound from current working directory is loaded. - Existing shell variables always win over file values (
override=False).
Final settings precedence:
- Base YAML (
CIBERLABREPORT_CONF_FILE, defaultconfig/base.yaml) - Profile YAML (
CIBERLABREPORT_PROFILE, defaultlocal) - Environment variables (from shell and/or loaded env file)
Environment variables used by the loader:
| Required | Descripción |
|---|---|
| 🟢 | Obligatory |
| ⚪ | Optional |
| 🔴 | Critical |
| Variable | Required | Default | Description |
|---|---|---|---|
OPENAI_API_KEY |
🔴 | — | OpenAI API key (secret). |
VT_API_KEY_1 |
🟢 | — | First VirusTotal API key (secret). |
VT_API_KEY_X |
⚪ | — | Additional VT keys (VT_API_KEY_2, VT_API_KEY_3, ...). |
CIBERLABREPORT_CONF_FILE |
🟢 | config/base.yaml |
Base YAML config file path. |
CIBERLABREPORT_PROFILE |
🟢 | local |
Profile name under config/profiles/. |
CIBERLABREPORT_ENV_FILE |
⚪ | — | Optional env file loaded before resolving settings. |
MAX_VT_KEYS |
⚪ | YAML runtime.max_keys_list |
Maximum number of sequential VT_API_KEY_X to read. |
INPUT_PATH_DEFAULT |
⚪ | From YAML | Runtime override for input path. |
OUTPUT_PATH_DEFAULT |
⚪ | From YAML | Runtime override for output path. |
CONFIG_PATH |
⚪ | From YAML | Runtime override for config path. |
PROMPTS_PATH |
⚪ | From YAML | Runtime override for prompts path. |
SCHEMAS_PATH |
⚪ | From YAML | Runtime override for schemas path. |
TMP_PATH |
⚪ | From YAML | Runtime override for temporary path. |
AI_MODEL |
⚪ | From YAML | Runtime override for LLM model. |
MODE |
⚪ | From YAML | Runtime override for execution mode (PRO, DEV). |
IMG_THRESHOLD |
⚪ | From YAML | Runtime override for image threshold. |
FAMILIES_SOURCE |
⚪ | From YAML families.source |
Families source override (json or db). |
LOG_LEVEL |
⚪ | INFO |
Logging level used by scripts. |
Important
It is highly recommended to have, at least, one VT API key. To add more VirusTotal API keys, create them sequentially by changing the X in VT_API_KEY_X.
git clone https://github.com/uleroboticsgroup/ciberlab-report-generator.git
cd ciberlab-report-generator
poetry installRecommended usage is to resolve settings first and then build ReportGenerator:
from ciberlabreport.core import ReportGenerator
from ciberlabreport.settings import load_settings
s = load_settings()
rg = ReportGenerator(
openai_api_key=s.openai_api_key,
vt_api_keys=s.vt_api_keys,
max_files=s.max_files,
max_completion_tokens=s.max_completion_tokens,
input_path_default=s.input_path_default,
output_path_default=s.output_path_default,
config_path=s.config_path,
prompts_path=s.prompts_path,
schemas_path=s.schemas_path,
tmp_path=s.tmp_path,
ai_model=s.ai_model,
mode=s.mode,
img_threshold=s.img_threshold,
families_list=s.families_list,
)
rg.generate("input.json")| Parameter | Type | Default | Description |
|---|---|---|---|
openai_api_key |
str |
required | OpenAI API key. |
vt_api_keys |
list | None |
None |
VirusTotal API key list (e.g. ["key1", "key2"]). |
max_files |
int |
3 |
Maximum number of input JSON files to process when input_data is a directory. If set to -1, all files in the directory are processed (no limit). |
preprocess_limits |
PreprocessLimits | None |
PreprocessLimits() |
Custom limits for CAPE preprocessing (e.g. maximum signatures, processes, behaviors kept in the reduced report). If None, a default PreprocessLimits() instance is created. |
max_completion_tokens |
int |
25000 |
Limit the LLM to generate more or less tokens in the output. Use only if execution fails with finish_reason=length |
input_path_default |
str | Path | None |
None |
Base input path. If omitted, current working directory is used. |
output_path_default |
str | Path | None |
None |
Base output path. If omitted, current working directory is used. |
config_path |
str | Path | None |
None |
Base config path (JSON configs). If omitted, current working directory is used. |
prompts_path |
str | Path | None |
None |
Prompt templates directory. Defaults to config_path/prompts. |
schemas_path |
str | Path | None |
None |
JSON schema directory. Defaults to config_path/schemas. |
tmp_path |
str | Path | None |
None |
Temporary files directory. If omitted, current working directory is used. |
ai_model |
str |
gpt-5 |
AI model used in LLM calls. |
mode |
str |
PRO |
Execution mode (PRO, DEV). |
img_threshold |
int |
9 |
Minimum image count to trigger image processing flow. |
families_list |
list | None |
None |
Allowed malware families. If set, postprocessing forces unknown values to Desconocida. |
localandbatch_uploader: useconfig/families.json(viafamilies.source: json).cape: uses databaseSELECT DISTINCT(viafamilies.source: db).
Base YAML supports both modes:
families:
source: json
json_file: families.json
db:
host_env: DB_HOST
port_env: DB_PORT
user_env: MARIADB_ROOT_USER
password_env: MARIADB_ROOT_PASSWORD
database_env: DATABASE
table: samples
column: family
where: "family IS NOT NULL AND family <> ''"Caution
max_files=-1 may produce several errors if the number of files in the specified directory is very large
| Parameter | Type | Default | Description |
|---|---|---|---|
input_data |
str |
required | Path to a single JSON file or to a directory containing JSON files. If it is a directory, all *.json files are collected and optionally limited by max_files. If it is a relative path to a file and it does not exist, the generator tries to resolve it under INPUT_PATH_DEFAULT. |
output_data |
str | None |
None |
Output path. It can be: (1) an absolute *.pdf path, (2) a relative *.pdf name (saved under OUTPUT_PATH_DEFAULT), (3) a directory (one PDF per input), or None, in which case output files are created under OUTPUT_PATH_DEFAULT following the pattern <stem>-report.pdf. |
You can see a simple usage in main.py
For more code details, check here
To run Linter, Formatter, Tests and generate Sphinx documentation in local you much use ci-local.sh:
chmod +x scripts/ci-local.sh
./scripts/ci-local.shThis module has been created aimed to put here all the necesary transformations after the LLM call and before to trasmit to the PDFGenerator.
The functionality rely on configurations entries These entries are stored in regex_config.json. Each entry has this struct:
{
"description": "Description of the changes to apply",
"pattern": "Regex to complile and apply. Be careful to escape the necessary characters.",
"replace": "Regex or str to replace if the pattern matches"
},Important
It is mandatory indicate the correct counter of configs in n_configs, as well as to increase version number each time the file got changes
The ReportGenerator.generate method now returns a report as a dict object. This report contains:
{
"input_files": <List[str]>,
"output_files": <List[str]>,
"time_spent": <float>,
"money_spent": <float> | <str>
"out_files_data": <List[dict]>
}| Field | Type | Description |
|---|---|---|
input_files |
List[str] |
List of names or paths of the input files. |
output_files |
List[str] |
List of names or paths of the generated output files. |
time_spent |
float |
Time spent (in minutes) to perform the operation. |
money_spent |
float or str |
Money spent (in $) to perform the operation. Explanation in str if it could not be calculated. |
out_files_data |
``List[dict]` | List of dictionaries containing important values for each processed input. |
[
{
"input": <md5_hash>,
"out_file_data": {
"malware_type": <List[str]>,
"family": <str>,
"is_ransomware": <bool>,
"signatures": [
{
"name": <str>,
"description": <str>,
"severity": <int>,
"confidence": <int>
},
...
],
"analysis_summary": <str>,
"initial_recommendations": <str>,
"iocs": [
{
"type": <str>,
"value": <str>,
"observations": <str>
},
...
]
}
},
...
]| Field | Type | Description |
|---|---|---|
input |
str |
MD5 hash that identifies the analyzed input sample. |
out_file_data |
dict |
Aggregated report data extracted for that input sample. |
out_file_data.malware_type |
List[str] |
List of detected malware categories (for example: trojan, spyware). |
out_file_data.family |
str |
Detected malware family name. |
out_file_data.is_ransomware |
bool |
Indicates whether the sample is classified as ransomware. |
out_file_data.signatures |
List[dict] |
List of behavioral signatures found during analysis. |
out_file_data.signatures[].name |
str |
Signature name. |
out_file_data.signatures[].description |
str |
Signature description. |
out_file_data.signatures[].severity |
int |
Signature severity score. |
out_file_data.signatures[].confidence |
int |
Signature confidence score. |
out_file_data.analysis_summary |
str |
Executive summary of the analysis results. |
out_file_data.initial_recommendations |
str |
Initial mitigation and response recommendations. |
out_file_data.iocs |
List[dict] |
List of extracted Indicators of Compromise (IOCs). |
out_file_data.iocs[].type |
str |
IOC type (for example: domain, IP, hash, URL). |
out_file_data.iocs[].value |
str |
IOC value. |
out_file_data.iocs[].observations |
str |
Additional context or notes for the IOC. |
See the guide here: batch_upload/README.md
See the guide here: cape/README.md
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
- Canonical legal text for this repository: LICENSE
- Human-readable summary: https://creativecommons.org/licenses/by-nc-sa/4.0/
- Official legalcode reference: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode