P4: Custos Deployment Issues

Issues faced in Custos Deployment

In deploying custos on jetstream 2 instances, we came across the following issues:-

We were trying the deployment on jetstream 2 at the beginning and were not able to do so. There were some keycloak pods and vault pods stuck in CrashBackLoopOff state in 2 different attempts.
We then went on to try the deployment without using Rancher by trying a baremetal kubernetes deployment on the VM. Again in this deployment as well, we were facing some issues in the pods.
The cluster created on Rancher also sometimes went into a provisioning state and would not start
In all of the above mentioned scenrios, we tried to uninstall all the pods, services and deployments for all of the following:

cert-manager
keycloak
consul
vault We also deleted the pv's and pvc's for the mysql and postgres. Then we restarted the entire setup again. But this also caused some issues as mentioned below

Deleting of the pv's and pvc's caused the namespaces to go into terminating state and get stuck.
We also had to revert the rancher setup on the instances. This also caused some issues and we were still not able to reuse the same instances
To avoid the above mentioned scenarios, we had to delete the instances and create them from scratch for every attempt of deployment.

Once the attempts on Jetstream 2 did not succeed, we tried to deploy the custos application on the Jetstream 1 application. On Jetstream 1 as well, we faced some issues in the beginning.

Some of the minor issues on Jetstream 1 were that an instance took really long to create and become active, the VM's were slower as compared to Jetstream 2
If we made any mistake while setting up the deployment, we were still unable to delete the namespaces as they got stuck in the terminating state most of the time. We tried to delete these forcefully using kubectl proxy and modifying the finalizers. But since we were on the VM's, we had some authorization issues in updating the finalizer. So, we were unable to do so.
Similar to Jetstream 2, we had to delete and recreate the entire instances multiple times
We had some help from the steps provided by Team Terra, Team Scapsulators and also some guidance on resolving some of these issues from Team Neo

Issue Resolutions

In order to avoid the CrashLoopBackOff issues, we had to make sure we took some time after executing some of the commands which required some time for the pods to be working. The commands are as follows:-

helm install keycloak-db-postgresql bitnami/postgresql -f values.yaml -n keycloak --version 10.12.3

kubectl create -f https://raw.githubusercontent.com/operator-framework/operator-lifecycle-manager/master/deploy/upstream/quickstart/olm.yaml

kubectl apply -f custos-keycloak.yaml -n keycloak

The bitnami folder that is created while installation had to be deleted everytime we tried to restart the entire setup.

Observations

The custos-messaging-core-service is constantly in the CrashBackLoopOff state even after successfull deployment. We are unsure of the consequences of this on the application.
The performance of the application on deployment did not seem to perform too well and gave a failure percentage of about 97% for 1000 users on stress testing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P4: Custos Deployment Issues

Issues faced in Custos Deployment

Issue Resolutions

Observations

Clone this wiki locally