Skip to content

uleroboticsgroup/ciberlab-report-generator

Repository files navigation

Ciberlab-report-generator

CC BY-NC-SA 4.0

Description

ciberlab-report-generator is a Python library developed by the Grupo de Robótica that automates the analysis of malware samples (dynamic) and generates reports in PDF format, ready to be interpreted by humans.
It includes data processing, page generation, integration with LLMs (when necessary), and support for documentation with Sphinx.

Key features

  • Dynamic malware analysis output processing (slicing, normalisation).
  • Automated PDF report generation.
  • Package structure for reuse in other projects.
  • Best practices: packaging with Poetry, testing with pytest, Ruff + Black linter/formatter.
  • Documentation generated with Sphinx.

Prerequisites

  • Python >= 3.12
  • Poetry >= 2.0.0 to manage the environment and dependencies.

Configuration (YAML + ENV)

The project now uses a hybrid configuration model:

  • YAML files for non-sensitive configuration (paths, mode, model, limits).
  • Environment variables for secrets (OPENAI_API_KEY, VT_API_KEY_X) and runtime overrides.

Base and profile files:

  • config/base.yaml
  • config/profiles/local.yaml
  • config/profiles/cape.yaml
  • config/profiles/batch_uploader.yaml

Canonical env template:

cp config/.env.example .env

Which env file is applied:

  1. If CIBERLABREPORT_ENV_FILE is set, that file is loaded.
  2. Otherwise, the nearest .env found from current working directory is loaded.
  3. Existing shell variables always win over file values (override=False).

Final settings precedence:

  1. Base YAML (CIBERLABREPORT_CONF_FILE, default config/base.yaml)
  2. Profile YAML (CIBERLABREPORT_PROFILE, default local)
  3. Environment variables (from shell and/or loaded env file)

Environment variables used by the loader:

Required Descripción
🟢 Obligatory
Optional
🔴 Critical
Variable Required Default Description
OPENAI_API_KEY 🔴 OpenAI API key (secret).
VT_API_KEY_1 🟢 First VirusTotal API key (secret).
VT_API_KEY_X Additional VT keys (VT_API_KEY_2, VT_API_KEY_3, ...).
CIBERLABREPORT_CONF_FILE 🟢 config/base.yaml Base YAML config file path.
CIBERLABREPORT_PROFILE 🟢 local Profile name under config/profiles/.
CIBERLABREPORT_ENV_FILE Optional env file loaded before resolving settings.
MAX_VT_KEYS YAML runtime.max_keys_list Maximum number of sequential VT_API_KEY_X to read.
INPUT_PATH_DEFAULT From YAML Runtime override for input path.
OUTPUT_PATH_DEFAULT From YAML Runtime override for output path.
CONFIG_PATH From YAML Runtime override for config path.
PROMPTS_PATH From YAML Runtime override for prompts path.
SCHEMAS_PATH From YAML Runtime override for schemas path.
TMP_PATH From YAML Runtime override for temporary path.
AI_MODEL From YAML Runtime override for LLM model.
MODE From YAML Runtime override for execution mode (PRO, DEV).
IMG_THRESHOLD From YAML Runtime override for image threshold.
FAMILIES_SOURCE From YAML families.source Families source override (json or db).
LOG_LEVEL INFO Logging level used by scripts.

Important

It is highly recommended to have, at least, one VT API key. To add more VirusTotal API keys, create them sequentially by changing the X in VT_API_KEY_X.

Instalation

git clone https://github.com/uleroboticsgroup/ciberlab-report-generator.git
cd ciberlab-report-generator
poetry install

Usage

Recommended usage is to resolve settings first and then build ReportGenerator:

from ciberlabreport.core import ReportGenerator
from ciberlabreport.settings import load_settings

s = load_settings()
rg = ReportGenerator(
    openai_api_key=s.openai_api_key,
    vt_api_keys=s.vt_api_keys,
    max_files=s.max_files,
    max_completion_tokens=s.max_completion_tokens,
    input_path_default=s.input_path_default,
    output_path_default=s.output_path_default,
    config_path=s.config_path,
    prompts_path=s.prompts_path,
    schemas_path=s.schemas_path,
    tmp_path=s.tmp_path,
    ai_model=s.ai_model,
    mode=s.mode,
    img_threshold=s.img_threshold,
    families_list=s.families_list,
)
rg.generate("input.json")

ReportGenerator parameters

Parameter Type Default Description
openai_api_key str required OpenAI API key.
vt_api_keys list | None None VirusTotal API key list (e.g. ["key1", "key2"]).
max_files int 3 Maximum number of input JSON files to process when input_data is a directory. If set to -1, all files in the directory are processed (no limit).
preprocess_limits PreprocessLimits | None PreprocessLimits() Custom limits for CAPE preprocessing (e.g. maximum signatures, processes, behaviors kept in the reduced report). If None, a default PreprocessLimits() instance is created.
max_completion_tokens int 25000 Limit the LLM to generate more or less tokens in the output. Use only if execution fails with finish_reason=length
input_path_default str | Path | None None Base input path. If omitted, current working directory is used.
output_path_default str | Path | None None Base output path. If omitted, current working directory is used.
config_path str | Path | None None Base config path (JSON configs). If omitted, current working directory is used.
prompts_path str | Path | None None Prompt templates directory. Defaults to config_path/prompts.
schemas_path str | Path | None None JSON schema directory. Defaults to config_path/schemas.
tmp_path str | Path | None None Temporary files directory. If omitted, current working directory is used.
ai_model str gpt-5 AI model used in LLM calls.
mode str PRO Execution mode (PRO, DEV).
img_threshold int 9 Minimum image count to trigger image processing flow.
families_list list | None None Allowed malware families. If set, postprocessing forces unknown values to Desconocida.

Families by profile

  • local and batch_uploader: use config/families.json (via families.source: json).
  • cape: uses database SELECT DISTINCT (via families.source: db).

Base YAML supports both modes:

families:
  source: json
  json_file: families.json
  db:
    host_env: DB_HOST
    port_env: DB_PORT
    user_env: MARIADB_ROOT_USER
    password_env: MARIADB_ROOT_PASSWORD
    database_env: DATABASE
    table: samples
    column: family
    where: "family IS NOT NULL AND family <> ''"

Caution

max_files=-1 may produce several errors if the number of files in the specified directory is very large

ReportGenerator.generate() parameters

Parameter Type Default Description
input_data str required Path to a single JSON file or to a directory containing JSON files. If it is a directory, all *.json files are collected and optionally limited by max_files. If it is a relative path to a file and it does not exist, the generator tries to resolve it under INPUT_PATH_DEFAULT.
output_data str | None None Output path. It can be: (1) an absolute *.pdf path, (2) a relative *.pdf name (saved under OUTPUT_PATH_DEFAULT), (3) a directory (one PDF per input), or None, in which case output files are created under OUTPUT_PATH_DEFAULT following the pattern <stem>-report.pdf.

You can see a simple usage in main.py

For more code details, check here

Development and testing

To run Linter, Formatter, Tests and generate Sphinx documentation in local you much use ci-local.sh:

chmod +x scripts/ci-local.sh
./scripts/ci-local.sh

Postprocesing guideline

This module has been created aimed to put here all the necesary transformations after the LLM call and before to trasmit to the PDFGenerator. The functionality rely on configurations entries These entries are stored in regex_config.json. Each entry has this struct:

{
  "description": "Description of the changes to apply",
  "pattern": "Regex to complile and apply. Be careful to escape the necessary characters.",
  "replace": "Regex or str to replace if the pattern matches"
},

Important

It is mandatory indicate the correct counter of configs in n_configs, as well as to increase version number each time the file got changes

Results

The ReportGenerator.generate method now returns a report as a dict object. This report contains:

{
  "input_files": <List[str]>,
  "output_files": <List[str]>,
  "time_spent": <float>,
  "money_spent": <float> | <str>
  "out_files_data": <List[dict]>
}
Field Type Description
input_files List[str] List of names or paths of the input files.
output_files List[str] List of names or paths of the generated output files.
time_spent float Time spent (in minutes) to perform the operation.
money_spent float or str Money spent (in $) to perform the operation. Explanation in str if it could not be calculated.
out_files_data ``List[dict]` List of dictionaries containing important values for each processed input.

Content of out_files_data

[
  {
    "input": <md5_hash>,
    "out_file_data": {
      "malware_type": <List[str]>,
      "family": <str>,
      "is_ransomware": <bool>,
      "signatures": [
        {
          "name": <str>,
          "description": <str>,
          "severity": <int>,
          "confidence": <int>
        },
        ...
      ],
      "analysis_summary": <str>,
      "initial_recommendations": <str>,
      "iocs": [
        {
          "type": <str>,
          "value": <str>,
          "observations": <str>
        },
        ...
      ]
    }
  },
  ...
]
Field Type Description
input str MD5 hash that identifies the analyzed input sample.
out_file_data dict Aggregated report data extracted for that input sample.
out_file_data.malware_type List[str] List of detected malware categories (for example: trojan, spyware).
out_file_data.family str Detected malware family name.
out_file_data.is_ransomware bool Indicates whether the sample is classified as ransomware.
out_file_data.signatures List[dict] List of behavioral signatures found during analysis.
out_file_data.signatures[].name str Signature name.
out_file_data.signatures[].description str Signature description.
out_file_data.signatures[].severity int Signature severity score.
out_file_data.signatures[].confidence int Signature confidence score.
out_file_data.analysis_summary str Executive summary of the analysis results.
out_file_data.initial_recommendations str Initial mitigation and response recommendations.
out_file_data.iocs List[dict] List of extracted Indicators of Compromise (IOCs).
out_file_data.iocs[].type str IOC type (for example: domain, IP, hash, URL).
out_file_data.iocs[].value str IOC value.
out_file_data.iocs[].observations str Additional context or notes for the IOC.

Automatic batch process

See the guide here: batch_upload/README.md

Cape module integration

See the guide here: cape/README.md

License

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

About

Ciberlab Project contains multiple malware samples dynamic analysis. In this project, analysis output will be sliced and processed to generate human readable pdf reports.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors