ConstrainTree

ConstrainTree aims at enhancing the explainability of predictive models through semantic constraint validation.

Preparation of the Environment

Machine Requirements

OS: Ubuntu 20.04.6 LTS or newer
Memory: 16 GiB

Software

Docker - v19.03.6 or newer
docker-compose - v1.26.0 or newer

Bash Commands

The experiment scripts use the following bash commands:

basename
echo
sleep

Experiments

Research Questions

What is the overhead of validating integrity constraints as part of an ML pipeline?
Can the results from the integrity constraint validation be used to improve the performance of the ML model?

Data & Integrity Constraints

The SynthLC 1k, 10k, and 100k datasets [1] are used, modeling data of 1000, 10000, and 100000 lung cancer patients, respectively. The integrity constraints are random combinations of biomarker, drug, and relapse. The SHACL shapes schema consists of 25 shapes. Two SHACL shapes schemas are evaluated that only differ in how the constraints are represented. One of them uses SPARQL constraints, a feature of full SHACL. The other one mitigates the costly validation of SPARQL constraints by utilizing a target query.

Prediction Task & Models

The predictive model is trained in order to predict whether a patient is likely to experience relapse. The baseline models are: (i) not performing any validation, and (ii) performing a naive SHACL validation. ConstrainTree is evaluated considering one, ten, and twenty of the shapes of each of the SHACL shapes schemas. Additionally, for the model analyst, also the consideration of the validation results as a feature is studied with respect to its impact on the model's performance.

How to reproduce?

The experiment regarding the execution time can be reproduced by executing:

./model_builder.sh

For reproducing the study about the impact of validation results as a feature on the model's performance, run:

./model_analyst.sh

Licence

ConstrainTree is licensed under the MIT license, see the license.

References

[1] Philipp D. Rohde, Maria-Esther Vidal. SynthLC. Leibniz Data Manager. 2024. DOI: 10.57702/oyfz6rmc

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
SynthLC_1000		SynthLC_1000
SynthLC_10000		SynthLC_10000
SynthLC_100000		SynthLC_100000
SynthLC_Configs		SynthLC_Configs
SynthLC_Shapes		SynthLC_Shapes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
execute.py		execute.py
model_analyst.sh		model_analyst.sh
model_builder.sh		model_builder.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ConstrainTree

Preparation of the Environment

Machine Requirements

Software

Bash Commands

Experiments

Research Questions

Data & Integrity Constraints

Prediction Task & Models

How to reproduce?

Licence

References

About

Languages

License

SDM-TIB/ConstrainTree

Folders and files

Latest commit

History

Repository files navigation

ConstrainTree

Preparation of the Environment

Machine Requirements

Software

Bash Commands

Experiments

Research Questions

Data & Integrity Constraints

Prediction Task & Models

How to reproduce?

Licence

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages