Praxi-Pipeline

Overwiew

In NERC, new container images are being generated every day. However as a research testbed, many images might be using some outdated or vulnerable packages. This issue can be hard to identify as many the researchers are focusing on the efficient or quick-delivery of their studies. To this end, we consider developping an automation tool to periodically introspect the dependencies in the recently built images and alarm those can potentially damage the cluster as a community.

The core design is motivated by Praxi, a software discovery tool previously developped by the AI-4-Cloud-Ops group. https://github.com/peaclab/praxi We collect data (package installation fingerprints generation), tokenize data and generate predictions. Furthermore, we build a decomposible ML system for efficiently incorperating new packages for discovery and minimize incremental training cost. The ML system is deployed and evaluated using Openshift Data Science Pipelines (Kubeflow Pipelines) in NERC.

Project Progress

https://docs.google.com/presentation/d/127wQdDaU1EWnZZRln63-D4_kqwltRWXAjWwCevHPhnk/edit?usp=sharing

Inference Pipeline

For this inference pipeline, the processing begins from left to right.

First we design a component to pull images from the Docker Hub based on cluster observability, i.e., Advance Cluster Monitoring (Prometheus&Grafana). We envision Docker Hub is one popular registry for the moment. And when RHODS Image Registry delivered, adapting to other Registries should be mainly changing the API used in this component.

Second, we generate the installation fingerprint by tokenize pathnames of file changes in each image layer.

Third, we generate predictions by feeding fingerprints to our pretrained Mixture of Expert motivated model

The detailed steps are shown in Praxi-study/Praxi-Pipeline/Praxi-Pipeline-xgb.py.

Running

Installing dependencies

pip install -r Praxi-study/Praxi-Pipeline/requirements.txt

Openshift Data Science Deployment

In Praxi-study/Praxi-Pipeline/Praxi-Pipeline-xgb.py,

Configure kubeflow_endpoint and bearer_token to access the Openshift Data Science Pipeline endpoint.

Configure aws_access_key_id and aws_secret_access_key to save predictions.

Run

python3 Praxi-study/Praxi-Pipeline/Praxi-Pipeline-xgb.py

Model Training and Testing Scripts

In Praxi-study/Praxi-Pipeline/prediction_XGBoost_openshift_image/function, model training and testing scripts are categorized by different package routing methods, i.e., random assignment (nover), cosin similarity based assignment (clustering) and package version clustering (verpak).

Examples:

To train package version clustering based model,

python Praxi-study/Praxi-Pipeline/prediction_XGBoost_openshift_image/function/verpak/tagsets_XGBoost_pickCVbatch.py

For package version clustering based model, to test with expert selection,

python Praxi-study/Praxi-Pipeline/prediction_XGBoost_openshift_image/function/verpak/tagsets_XGBoost_pickCVbatch_on_demand_expert_selector.py

To calculate the cosine similarity of packages in submodels,

python Praxi-study/Praxi-Pipeline/prediction_XGBoost_openshift_image/function/verpak/tagsets_XGBoost_pickCVbatch_model_share_token_verpak.py

Some test kfp examples https://github.com/rh-datascience-and-edge-practice/kubeflow-examples/blob/main/pipelines/1_test_connection.py

Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
.vscode		.vscode
Figures		Figures
csgen_base_image		csgen_base_image
fake-snapshot-volume		fake-snapshot-volume
gen_data_docker_image		gen_data_docker_image
get_layer_changes		get_layer_changes
get_pod_runtime		get_pod_runtime
load_model_s3		load_model_s3
prediction_XGBoost_openshift_image		prediction_XGBoost_openshift_image
prediction_base_image		prediction_base_image
prediction_openshift_image		prediction_openshift_image
private_registry		private_registry
study		study
taggen_base_image		taggen_base_image
taggen_openshift_image		taggen_openshift_image
.gitignore		.gitignore
01_test_connection_via_route.py		01_test_connection_via_route.py
Praxi-Pipeline-test-ops-s3.py		Praxi-Pipeline-test-ops-s3.py
Praxi-Pipeline-vw.py		Praxi-Pipeline-vw.py
Praxi-Pipeline-xgb-monolithic.py		Praxi-Pipeline-xgb-monolithic.py
Praxi-Pipeline-xgb.py		Praxi-Pipeline-xgb.py
Praxi-Pipeline.py		Praxi-Pipeline.py
Prediction-Pipeline.py		Prediction-Pipeline.py
README.md		README.md
Untitled.ipynb		Untitled.ipynb
compare.py		compare.py
generate_changeset_component.yaml		generate_changeset_component.yaml
generate_ittrain_component.yaml		generate_ittrain_component.yaml
generate_loadmod_op.yaml		generate_loadmod_op.yaml
generate_multilabel_component.yaml		generate_multilabel_component.yaml
generate_pred_component.yaml		generate_pred_component.yaml
generate_prediction_component.yaml		generate_prediction_component.yaml
generate_tagset_component.yaml		generate_tagset_component.yaml
get_tagset_component.yaml		get_tagset_component.yaml
get_traintype_component.yaml		get_traintype_component.yaml
heatmap.py		heatmap.py
mtime.log		mtime.log
mtime.pdf		mtime.pdf
plot_tag_distribution.py		plot_tag_distribution.py
plot_tag_distribution_per_clf.py		plot_tag_distribution_per_clf.py
requirements.txt		requirements.txt
test_heatmap.pdf		test_heatmap.pdf
test_pod.yaml		test_pod.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Praxi-Pipeline

Overwiew

Project Progress

Inference Pipeline

Running

Openshift Data Science Deployment

Model Training and Testing Scripts

About

Releases

Packages

Contributors 3

Languages

ai4cloudops/Praxi-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Praxi-Pipeline

Overwiew

Project Progress

Inference Pipeline

Running

Openshift Data Science Deployment

Model Training and Testing Scripts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages