Skip to content
Jeff Lindsay edited this page Mar 12, 2017 · 6 revisions

March 12, 2017

Today I recreated the cluster and replayed the Cmd.io bootstrap procedure before documenting it. I also experimented with changes to the Kubernetes cluster with Cmd.io running. I spent all day tweaking timings and discovered an experimental drain feature of kops, but I was not able to achieve no downtime cluster changes. In the current configuration, I believe the up to 1 min downtime comes from the Kubernetes Service not updating quickly enough when kops cordons the node and evicts the pods.

I tried without this and just relying on the ELB health checks, but the timing is so rough to get right. The ASG is created with the default cooldown of 5 minutes, which means if you want to coordinate anything, you have to wait 5 minutes between actions. I could not find a way to tell k8s or kops to set this. And I didn't see a way to tell AWS to have a different default without applying policies to the ASG (which we can't since Terraform doesn't manage it). This cooldown, on top of the long delays of terminating and spinning up and waiting for them to register, I'm not sure this would work without a node rolling interval of 10 min or more. However, CI will terminate a job after 10 min of inactivity, and also it becomes extremely obnoxious to try because we're talking 20 minute debug cycles.

The current configuration uses kops draining, which for whatever reason still has some downtime. Perhaps this will be fixed when it comes out of feature flag.

Now that everything works and I've stopped debugging with CI (on both the CI environment image and kops), I've rebased master to eliminate all the pointless experimental commits necessary to debug and trigger CI. Luckily I don't think it matters, but it shouldn't become a habit. I should be able to do these experiments in a branch in the future.

-progrium

March 11, 2017

While trying to automate kops update cluster in CI, I had to find a way to make sure the config in manifold/cluster refers to the current/active infrastructure since it needs to be kept in sync with the kops managed config in S3. I decided to go by the "network ID" in config, which is the VPC ID. I checked it manually to see if I needed to pull and it was different from what I saw in the AWS console. I pulled and it was still different. I decided something must be wrong, so I decided to taint the VPC resource.

It turns out AWS console had me log in again and when brought back to the VPC page, it had not switched back to the manifold role, so I was looking at a different account. The role switching comes with some really weird UX bugs. As another example, if looking at a specific AWS resource, and you decide to switch roles, you're taken to a 404 because that resource won't exist in the switched to account.

It was too late, though, I had already done a Terraform apply. It initially failed because I forgot to source .env which I had the right TF_VAR variables set for the Kubernetes local-provisioner command to work. Luckily you can just run again. However, this time it ran, but the null_resource.cluster_ready resource did not run, letting null_resource.cluster_setup run immediately after null_resource.cluster, meaning right after kops create cluster, which will fail because it won't be ready yet. I'm not sure why the ready resource didn't get run, but again, luckily, I can just run it again. I'm not sure this case needs to be resolved since an operator would know how to handle (re-run), and it shouldn't happen in the normal automated CI workflow.

However, good to note down this behavior.

-progrium


It turns out aws s3 sync was not properly replacing cluster config files because they were the same file size. This is because the main differences would be the same length (VPC ID for example). So even after a make pull, the VPC ID would be wrong. Using --exact-timestamps does fix this, so the make pull task has been updated.

-progrium


Finally got infra.gl to run to completion from nothing in CI. It helped identify certain assumptions, like the default public key installed by kops is your user's RSA public key, which the CI environment didn't have. So I had added the gliderbot pubkey and made it use that.

In the end, it's great to know it works from scratch in CI, but kops writes your kubectl config file with the admin user. If this happens on CI then nobody has the admin credentials! Perhaps the right thing to do is to email them to us, however, for now, we don't need to run cluster creation from CI since it only needs to happen once and whenever we rebuild from scratch for whatever reason. So I'm going to recreate from my laptop.

After that, the plan is to deploy the beta channel of Cmd.io (again) and document the process and smooth any remaining issues. Then I'm going to test cluster changes ... mainly resizing nodes ... via CI workflow and see that it works and that k8s keeps everything running. And then I'll try a rolling update like changing the SSH key.

Once that's done, I think I'll mostly be done with infra.gl for now. I need to squash the commits from all the CI experiments. And then set up the Cmd.io beta channel to deploy via a beta branch. Then write a run profile for alpha and deploy alpha on it, switch DNS to it, make sure it runs a production app fine for a week or so, and lastly recreate the real production cluster in the right accounts after clearing them out, and migrate to THAT one. By then we'll know how to stand up new clusters and migrate our entire infrastructure pretty well!

-progrium

Clone this wiki locally