ConstrainTree aims at enhancing the explainability of predictive models through semantic constraint validation.
- OS: Ubuntu 20.04.6 LTS or newer
- Memory: 16 GiB
- Docker - v19.03.6 or newer
- docker-compose - v1.26.0 or newer
The experiment scripts use the following bash commands:
- basename
- echo
- sleep
- What is the overhead of validating integrity constraints as part of an ML pipeline?
- Can the results from the integrity constraint validation be used to improve the performance of the ML model?
The SynthLC 1k, 10k, and 100k datasets [1] are used, modeling data of 1000, 10000, and 100000 lung cancer patients, respectively. The integrity constraints are random combinations of biomarker, drug, and relapse. The SHACL shapes schema consists of 25 shapes. Two SHACL shapes schemas are evaluated that only differ in how the constraints are represented. One of them uses SPARQL constraints, a feature of full SHACL. The other one mitigates the costly validation of SPARQL constraints by utilizing a target query.
The predictive model is trained in order to predict whether a patient is likely to experience relapse. The baseline models are: (i) not performing any validation, and (ii) performing a naive SHACL validation. ConstrainTree is evaluated considering one, ten, and twenty of the shapes of each of the SHACL shapes schemas. Additionally, for the model analyst, also the consideration of the validation results as a feature is studied with respect to its impact on the model's performance.
The experiment regarding the execution time can be reproduced by executing:
./model_builder.sh
For reproducing the study about the impact of validation results as a feature on the model's performance, run:
./model_analyst.sh
ConstrainTree is licensed under the MIT license, see the license.
[1] Philipp D. Rohde, Maria-Esther Vidal. SynthLC. Leibniz Data Manager. 2024. DOI: 10.57702/oyfz6rmc