Skip to content

Explainable Predictive Models through Semantic Constraint Validation

License

Notifications You must be signed in to change notification settings

SDM-TIB/ConstrainTree

Repository files navigation

License: MIT

ConstrainTree

ConstrainTree aims at enhancing the explainability of predictive models through semantic constraint validation.

Preparation of the Environment

Machine Requirements

  • OS: Ubuntu 20.04.6 LTS or newer
  • Memory: 16 GiB

Software

  • Docker - v19.03.6 or newer
  • docker-compose - v1.26.0 or newer

Bash Commands

The experiment scripts use the following bash commands:

  • basename
  • echo
  • sleep

Experiments

Research Questions

  1. What is the overhead of validating integrity constraints as part of an ML pipeline?
  2. Can the results from the integrity constraint validation be used to improve the performance of the ML model?

Data & Integrity Constraints

The SynthLC 1k, 10k, and 100k datasets [1] are used, modeling data of 1000, 10000, and 100000 lung cancer patients, respectively. The integrity constraints are random combinations of biomarker, drug, and relapse. The SHACL shapes schema consists of 25 shapes. Two SHACL shapes schemas are evaluated that only differ in how the constraints are represented. One of them uses SPARQL constraints, a feature of full SHACL. The other one mitigates the costly validation of SPARQL constraints by utilizing a target query.

Prediction Task & Models

The predictive model is trained in order to predict whether a patient is likely to experience relapse. The baseline models are: (i) not performing any validation, and (ii) performing a naive SHACL validation. ConstrainTree is evaluated considering one, ten, and twenty of the shapes of each of the SHACL shapes schemas. Additionally, for the model analyst, also the consideration of the validation results as a feature is studied with respect to its impact on the model's performance.

How to reproduce?

The experiment regarding the execution time can be reproduced by executing:

./model_builder.sh

For reproducing the study about the impact of validation results as a feature on the model's performance, run:

./model_analyst.sh

Licence

ConstrainTree is licensed under the MIT license, see the license.

References

[1] Philipp D. Rohde, Maria-Esther Vidal. SynthLC. Leibniz Data Manager. 2024. DOI: 10.57702/oyfz6rmc