ModelSet is a labelled dataset of software models.
This repository contains:
- The ModelSet databases with the labelled datasets. See the Downloading ModelSet section for more information.
- The scripts to create the databases and generate the release. See the Building ModelSet section for more information.
You can find more information about ModelSet in the following Open Access paper: https://link.springer.com/article/10.1007%2Fs10270-021-00929-3.
To download ModelSet follow these steps:
-
You can download the latest release from https://github.com/modelset/modelset-dataset/releases
-
Unzip the package in some location of your workspace
-
The structure of the decompressed package is the following:
+ datasets + dataset.ecore/data/ecore.db + dataset.genmymodel/data/genmymodel.db + graph + raw-data + txt
3.1. The
datasets
folder contains the databases with the labelled models. The.db
files are SQLite databases containing the information about the models.The database schema includes a table called model with model data (i.e., unique identifier, source repository and filename) and a table called metadata with label data (i.e., unique identifier and a JSON object with the label information). The following figure illustrates the database schema:
+-------------------+ +---------------------------------+ | model | | metadata | +-------------------+ +---------------------------------+ | id : VARCHAR {PK} | | id : VARCHAR {PK, FK(model.id)} | | source : VARCHAR | | json : TEXT | | filename : TEXT | +---------------------------------+ +-------------------+
3.2. The
graph
folder contains the graph representation of the models.3.3. The
raw-data
folder contains the models serialized in XMI.3.4. The
txt
folder includes the strings of the models (e.g., to train simple NLP models).
Once you have downloaded ModelSet, you can use JDBC to query the databases.
For instance, the following code illustrates how to query the database using JDBC:
Connection dataset = DriverManager.getConnection("jdbc:sqlite:/path/to/dbfile");
PreparedStatement stm = dataset.prepareStatement("select mo.id, mo.filename, mm.metadata from models mo join metadata mm on mo.id = mm.id");
stm.execute();
ResultSet rs = stm.getResultSet();
while (rs.next()) {
String id = rs.getString(1);
String filename = rs.getString(2);
String metadata = rs.getString(3);
System.out.println(id + ": " + metadata);
}
To use ModelSet in a typical Python/Jupyter setting, we recommend you to use the modelset-py Python library we have developed. Visit the corresponding repository for more information.
We provide some examples of how to use ModelSet in the examples repository.
Note: these steps are only required if you want to create a new release of ModelSet. If you just want to use ModelSet, you can download the latest release (see Downloading ModelSet section in this file)
To create the ModelSet release, you have to follow these steps:
- Execute
./bin/download-data.sh
to recover the model files which will be stored in theraw-data
folder. These files are not stored here as they have already published in existing GitHub repositories. - Execute
./bin/generate.sh
to generate additional artifacts. - Execute
./bin/build.sh
to build the ModelSet release package. - The ModelSet release package will have the name
modelset.zip
.
If you find this dataset useful, please consider citing its associated paper: https://link.springer.com/article/10.1007/s10270-021-00929-3
@article{lopez2021modelset,
title = {{ModelSet: a dataset for machine learning in model-driven engineering}},
author = {L{\'o}pez, Jos{\'e} Antonio Hern{\'a}ndez and
C{\'a}novas Izquierdo, Javier Luis and
Cuadrado, Jes{\'u}s S{\'a}nchez},
journal = {Softw. Syst. Model.},
volume = {21},
number = {3},
pages = {967--986},
year = {2022},
url = {https://doi.org/10.1007/s10270-021-00929-3},
}
We welcome contributions of all kinds, including extensions to the dataset, new empirical studies, and new features. If you want to contribute to ModelSet, please review our contribution guidelines and our governance model.
Note that we have a code of conduct that we expect project participants to adhere to. Please read it before contributing.
This dataset is licensed under the GNU Lesser General Public License v3.0.