Ciberlab-report-generator

Description

ciberlab-report-generator is a Python library developed by the Grupo de Robótica that automates the analysis of malware samples (dynamic) and generates reports in PDF format, ready to be interpreted by humans.
It includes data processing, page generation, integration with LLMs (when necessary), and support for documentation with Sphinx.

Key features

Dynamic malware analysis output processing (slicing, normalisation).
Automated PDF report generation.
Package structure for reuse in other projects.
Best practices: packaging with Poetry, testing with pytest, Ruff + Black linter/formatter.
Documentation generated with Sphinx.

Prerequisites

Python >= 3.12
Poetry >= 2.0.0 to manage the environment and dependencies.

Configuration (YAML + ENV)

The project now uses a hybrid configuration model:

YAML files for non-sensitive configuration (paths, mode, model, limits).
Environment variables for secrets (OPENAI_API_KEY, VT_API_KEY_X) and runtime overrides.

Base and profile files:

config/base.yaml
config/profiles/local.yaml
config/profiles/cape.yaml
config/profiles/batch_uploader.yaml

Canonical env template:

cp config/.env.example .env

Which env file is applied:

If CIBERLABREPORT_ENV_FILE is set, that file is loaded.
Otherwise, the nearest .env found from current working directory is loaded.
Existing shell variables always win over file values (override=False).

Final settings precedence:

Base YAML (CIBERLABREPORT_CONF_FILE, default config/base.yaml)
Profile YAML (CIBERLABREPORT_PROFILE, default local)
Environment variables (from shell and/or loaded env file)

Environment variables used by the loader:

Required	Descripción
🟢	Obligatory
⚪	Optional
🔴	Critical

Variable	Required	Default	Description
`OPENAI_API_KEY`	🔴	—	OpenAI API key (secret).
`VT_API_KEY_1`	🟢	—	First VirusTotal API key (secret).
`VT_API_KEY_X`	⚪	—	Additional VT keys (`VT_API_KEY_2`, `VT_API_KEY_3`, ...).
`CIBERLABREPORT_CONF_FILE`	🟢	`config/base.yaml`	Base YAML config file path.
`CIBERLABREPORT_PROFILE`	🟢	`local`	Profile name under `config/profiles/`.
`CIBERLABREPORT_ENV_FILE`	⚪	—	Optional env file loaded before resolving settings.
`MAX_VT_KEYS`	⚪	YAML `runtime.max_keys_list`	Maximum number of sequential `VT_API_KEY_X` to read.
`INPUT_PATH_DEFAULT`	⚪	From YAML	Runtime override for input path.
`OUTPUT_PATH_DEFAULT`	⚪	From YAML	Runtime override for output path.
`CONFIG_PATH`	⚪	From YAML	Runtime override for config path.
`PROMPTS_PATH`	⚪	From YAML	Runtime override for prompts path.
`SCHEMAS_PATH`	⚪	From YAML	Runtime override for schemas path.
`TMP_PATH`	⚪	From YAML	Runtime override for temporary path.
`AI_MODEL`	⚪	From YAML	Runtime override for LLM model.
`MODE`	⚪	From YAML	Runtime override for execution mode (`PRO`, `DEV`).
`IMG_THRESHOLD`	⚪	From YAML	Runtime override for image threshold.
`FAMILIES_SOURCE`	⚪	From YAML `families.source`	Families source override (`json` or `db`).
`LOG_LEVEL`	⚪	`INFO`	Logging level used by scripts.

Important

It is highly recommended to have, at least, one VT API key. To add more VirusTotal API keys, create them sequentially by changing the X in VT_API_KEY_X.

Instalation

git clone https://github.com/uleroboticsgroup/ciberlab-report-generator.git
cd ciberlab-report-generator
poetry install

Usage

Recommended usage is to resolve settings first and then build ReportGenerator:

from ciberlabreport.core import ReportGenerator
from ciberlabreport.settings import load_settings

s = load_settings()
rg = ReportGenerator(
    openai_api_key=s.openai_api_key,
    vt_api_keys=s.vt_api_keys,
    max_files=s.max_files,
    max_completion_tokens=s.max_completion_tokens,
    input_path_default=s.input_path_default,
    output_path_default=s.output_path_default,
    config_path=s.config_path,
    prompts_path=s.prompts_path,
    schemas_path=s.schemas_path,
    tmp_path=s.tmp_path,
    ai_model=s.ai_model,
    mode=s.mode,
    img_threshold=s.img_threshold,
    families_list=s.families_list,
)
rg.generate("input.json")

`ReportGenerator` parameters

Parameter	Type	Default	Description
`openai_api_key`	`str`	required	OpenAI API key.
`vt_api_keys`	`list \| None`	`None`	VirusTotal API key list (e.g. `["key1", "key2"]`).
`max_files`	`int`	`3`	Maximum number of input JSON files to process when `input_data` is a directory. If set to `-1`, all files in the directory are processed (no limit).
`preprocess_limits`	`PreprocessLimits \| None`	`PreprocessLimits()`	Custom limits for CAPE preprocessing (e.g. maximum signatures, processes, behaviors kept in the reduced report). If `None`, a default `PreprocessLimits()` instance is created.
`max_completion_tokens`	`int`	`25000`	Limit the LLM to generate more or less tokens in the output. Use only if execution fails with `finish_reason=length`
`input_path_default`	`str \| Path \| None`	`None`	Base input path. If omitted, current working directory is used.
`output_path_default`	`str \| Path \| None`	`None`	Base output path. If omitted, current working directory is used.
`config_path`	`str \| Path \| None`	`None`	Base config path (JSON configs). If omitted, current working directory is used.
`prompts_path`	`str \| Path \| None`	`None`	Prompt templates directory. Defaults to `config_path/prompts`.
`schemas_path`	`str \| Path \| None`	`None`	JSON schema directory. Defaults to `config_path/schemas`.
`tmp_path`	`str \| Path \| None`	`None`	Temporary files directory. If omitted, current working directory is used.
`ai_model`	`str`	`gpt-5`	AI model used in LLM calls.
`mode`	`str`	`PRO`	Execution mode (`PRO`, `DEV`).
`img_threshold`	`int`	`9`	Minimum image count to trigger image processing flow.
`families_list`	`list \| None`	`None`	Allowed malware families. If set, postprocessing forces unknown values to `Desconocida`.

Families by profile

local and batch_uploader: use config/families.json (via families.source: json).
cape: uses database SELECT DISTINCT (via families.source: db).

Base YAML supports both modes:

families:
  source: json
  json_file: families.json
  db:
    host_env: DB_HOST
    port_env: DB_PORT
    user_env: MARIADB_ROOT_USER
    password_env: MARIADB_ROOT_PASSWORD
    database_env: DATABASE
    table: samples
    column: family
    where: "family IS NOT NULL AND family <> ''"

Caution

max_files=-1 may produce several errors if the number of files in the specified directory is very large

`ReportGenerator.generate()` parameters

Parameter	Type	Default	Description
`input_data`	`str`	required	Path to a single JSON file or to a directory containing JSON files. If it is a directory, all `*.json` files are collected and optionally limited by `max_files`. If it is a relative path to a file and it does not exist, the generator tries to resolve it under `INPUT_PATH_DEFAULT`.
`output_data`	`str \| None`	`None`	Output path. It can be: (1) an absolute `.pdf` path, (2) a relative `.pdf` name (saved under `OUTPUT_PATH_DEFAULT`), (3) a directory (one PDF per input), or `None`, in which case output files are created under `OUTPUT_PATH_DEFAULT` following the pattern `<stem>-report.pdf`.

You can see a simple usage in main.py

For more code details, check here

Development and testing

To run Linter, Formatter, Tests and generate Sphinx documentation in local you much use ci-local.sh:

chmod +x scripts/ci-local.sh
./scripts/ci-local.sh

Postprocesing guideline

This module has been created aimed to put here all the necesary transformations after the LLM call and before to trasmit to the PDFGenerator. The functionality rely on configurations entries These entries are stored in regex_config.json. Each entry has this struct:

{
  "description": "Description of the changes to apply",
  "pattern": "Regex to complile and apply. Be careful to escape the necessary characters.",
  "replace": "Regex or str to replace if the pattern matches"
},

Important

It is mandatory indicate the correct counter of configs in n_configs, as well as to increase version number each time the file got changes

Results

The ReportGenerator.generate method now returns a report as a dict object. This report contains:

{
  "input_files": <List[str]>,
  "output_files": <List[str]>,
  "time_spent": <float>,
  "money_spent": <float> | <str>
  "out_files_data": <List[dict]>
}

Field	Type	Description
`input_files`	`List[str]`	List of names or paths of the input files.
`output_files`	`List[str]`	List of names or paths of the generated output files.
`time_spent`	`float`	Time spent (in minutes) to perform the operation.
`money_spent`	`float` or `str`	Money spent (in $) to perform the operation. Explanation in str if it could not be calculated.
`out_files_data`	``List[dict]`	List of dictionaries containing important values for each processed input.

Content of `out_files_data`

[
  {
    "input": <md5_hash>,
    "out_file_data": {
      "malware_type": <List[str]>,
      "family": <str>,
      "is_ransomware": <bool>,
      "signatures": [
        {
          "name": <str>,
          "description": <str>,
          "severity": <int>,
          "confidence": <int>
        },
        ...
      ],
      "analysis_summary": <str>,
      "initial_recommendations": <str>,
      "iocs": [
        {
          "type": <str>,
          "value": <str>,
          "observations": <str>
        },
        ...
      ]
    }
  },
  ...
]

Field	Type	Description
`input`	`str`	MD5 hash that identifies the analyzed input sample.
`out_file_data`	`dict`	Aggregated report data extracted for that input sample.
`out_file_data.malware_type`	`List[str]`	List of detected malware categories (for example: trojan, spyware).
`out_file_data.family`	`str`	Detected malware family name.
`out_file_data.is_ransomware`	`bool`	Indicates whether the sample is classified as ransomware.
`out_file_data.signatures`	`List[dict]`	List of behavioral signatures found during analysis.
`out_file_data.signatures[].name`	`str`	Signature name.
`out_file_data.signatures[].description`	`str`	Signature description.
`out_file_data.signatures[].severity`	`int`	Signature severity score.
`out_file_data.signatures[].confidence`	`int`	Signature confidence score.
`out_file_data.analysis_summary`	`str`	Executive summary of the analysis results.
`out_file_data.initial_recommendations`	`str`	Initial mitigation and response recommendations.
`out_file_data.iocs`	`List[dict]`	List of extracted Indicators of Compromise (IOCs).
`out_file_data.iocs[].type`	`str`	IOC type (for example: domain, IP, hash, URL).
`out_file_data.iocs[].value`	`str`	IOC value.
`out_file_data.iocs[].observations`	`str`	Additional context or notes for the IOC.

Automatic batch process

See the guide here: batch_upload/README.md

Cape module integration

See the guide here: cape/README.md

License

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Canonical legal text for this repository: LICENSE
Human-readable summary: https://creativecommons.org/licenses/by-nc-sa/4.0/
Official legalcode reference: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ciberlab-report-generator

Description

Key features

Prerequisites

Configuration (YAML + ENV)

Instalation

Usage

`ReportGenerator` parameters

Families by profile

`ReportGenerator.generate()` parameters

Development and testing

Postprocesing guideline

Results

Content of `out_files_data`

Automatic batch process

Cape module integration

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 207 Commits
.github/workflows		.github/workflows
batch_upload		batch_upload
cape		cape
config		config
docs		docs
scripts		scripts
src/ciberlabreport		src/ciberlabreport
test		test
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Ciberlab-report-generator

Description

Key features

Prerequisites

Configuration (YAML + ENV)

Instalation

Usage

ReportGenerator parameters

Families by profile

ReportGenerator.generate() parameters

Development and testing

Postprocesing guideline

Results

Content of out_files_data

Automatic batch process

Cape module integration

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`ReportGenerator` parameters

`ReportGenerator.generate()` parameters

Content of `out_files_data`

Packages