"Talos Upgrade and the Stupid Things You Do When Tired"

Upgrading Talos from 1.0.0 to 1.0.5

So some days ago I got aware, that I was falling quite a bit behind on upgrading my Talos kubernetes cluster. The project is up to 1.0.5, and I was still running 1.0.0. As I sometime lack time, I started upgradiing my cluster a bit half heartedly as a left-hand job to be honest.

I admit this is a story of me not paying to much attention to detail. On the positive node, I learned some things!

The upgrade procedure

I was so happy that upgrading was really easy. Simply a matter of:

talosctl upgrade --nodes <node ip> --image ghcr.io/siderolabs/installer:vX.X.X

That sounded easy, and I took the (apparently) heroic decision of jumping straight to 1.0.5 from 1.0.0 - and doing it on my first control-plane node. Little did I know, but these kind of decisions was leading me down a perilous path. To my surprise, it didn't work. The node rebooted after some time, but was still running 1.0.0. So I was puzzled, and thought better of it and tried it out on a worker node. Same result. "Hmm... okay, so let's bump one version at a time" I though to myself. And that worked better. After getting all workers to 1.0.1.

The downfall (almost)

I now got back to that first control-plane. It failed a number of times, somewhere in the first 3-4 phases, which Talos Linux goes through during an upgrade. So I tried going to the help of talosctl - and noticed I could do

talosctl upgrade --nodes <node ip> --image ghcr.io/siderolabs/installer:v1.0.1 -s

-s was for stage - meaning the upgrade would be staged and performed after a reboot.

Now what follows was a lot of DOH! and "did I really do that" moments... in a bit of self defense, I did this Friday evening, after a long work week with lots of challenges, and I was still a bit tired after the big company party Wednesday night. That being said, in retrospect, I was just plain careless.

When the control-plane host came up, it was complaining about it couldn't figure out what to do, and needed bootstrapping - even offering up the command for doing so.

talosctl bootstrap 

The cluster that disappeared

At this point, it dawned on me, I had lost access to my cluster, and it was DOWN. The alerts from uptimerobot was flooding in.

I couldn't perform kubectl commands and talosctl wasn't working against the master node anymore. No contact.

I now tried to change the api endpoint in both my ~/.kube/config and my ~/.talos/config, where the endpoint was the master node now unreachable.

That worked somewhat. I was able to get a list of nodes from the cluster, and I could perform talosctl commands again.

My cluster was still down however.

RTFM and Talos outstanding support

Reading a bunch of documentation did not make me any wiser. I decided to play with open cards, and flaunt my ignorance and stupidity (I didn't feel especially smart at this point ;-)) and headed over to the slack channel. Asking what I could do.

I was mostly expecting the cluster to be lost. Except... I did have 3 control-plane nodes, so I thought I might be able to save it. I just had no idea how.

Andrey Smirnov answered a short time later, and quickly sorted the situation. He clarified I didn't have an HA setup, since my controlplane endpoint was that one node. Now that came as both and "off coures" realisation, and a surprise.

I was surprised, because I had just assumed having the 3 control-plane nodes would cover me. It did - but only half ways. I had not considered the control-plane endpoint was tied to the node I had initially started out with.

Andrey pointed my to this guide on how to set up a virtual ip for the control-plane endpoint and this guide describing the possible different options

Although this is in no way new knowledge to me - having a virtual ip for the control-plane endpoint - I must admit I just hadn't thought of it.

Andrey did however first guide me to getting the master node back on track. Since my cluster did have 2 other control-plane nodes, I could scratch the bad node and have it rebuild it's etcd config and rejoin the cluster.

  1. have the node change it's endpoint to the 2nd master node:
talosctl edit mc -n <1st master node ip>
  • Find the cluster.endpoint: and have it point to the 2nd master's ip address
  • Reset the nodes config, reboot and join again:
talosctl -n <1st master node ip> reset --system-labels-to-wipe=EPHEMERAL --reboot --graceful=false

I did as suggested, and after some time, the node joined the cluster, and reconfigured it's etcd configuration.

My cluster was back up!

Getting an HA setup

I now performed the process of adding a VIP for the control-plane endpoint.

I edited my controlplane.yaml and added the vip configuration as described in the documentation:

network: 
    interfaces:
    - interface: eth0
    dhcp: true
    vip:
        ip: <available ip>

After adding that I did an update of the config to the 3 control-plane nodes:

talosctl apply-config -n <ip of node> -f <path to controlplane.yaml>

The VIP was now added an available, and I changed my ~/.talos/config and ~/.kube/config to point to the new virtual IP, and everything now worked.

One more thing needed to be done - thanks to Andrey for putting all the details out there - and that was to get the nodes in the cluster to realize the endpoint ip had changed.

So doing a talosctl edit mc and changing the control-plane endpoint was needed for the nodes to be aware of the change:

cluster:
  controlPlane:
    endpoint: https://<new control plane ip>:6443

Now ... Back to that upgrade

I was now back where I started, and it was almost midnight. I thought I might as well get my cluster up to the latest version, so I tried performing an upgrade of the master node again. This time it went as expected.

I was more than exited, to see my cluster up and now upgraded to 1.0.1.

The thought now occured to me, I had to try out if I could skip the intermediary updates, and jump straight to the newest one. So I did, and it worked like a charm. Hurray for Talos and easy upgrading! The cluster is now on version 1.0.5 - which at the time of writing this is the newest version.

And then kubernetes upgrade too

I wanted to get as updated as possible, so as easy as the image update of Talos Linux is, it's the same thing for Kubernetes.

Issuing:

talosctl --nodes <ip of master node> upgrade-k8s --to 1.23.6

Started the upgrade of all components needed, and some 5 minutes later, I had an updated Kubernetes as well :-D

Conclusions

The first thing to conclude here is to not perform a task like this on auto-pilot and tired. Especially not when it's the first time doing it. I realized during this Friday night, I was nowhere near familiar enough with Talos Linux and the upgrade procedure.

Not thinking things through and resarching a bit more, before launching myself head first into this with little thought on what I was doing is not a good cocktail!

I am thankful for the help I received on Slack, and I can't thank Andrey enough for the swift help! I am now wiser, and a bit more inclined to read up on the documentation the next time ;-)

As a final conclusion, Talos Linux is dead easy to upgrade - you just need to have the basics in place first. A good rule of thumb for any activity.

Next time

I look forward to upgrading to version 1.1.0, which is currently in Alpha testing. It includes an upgrade to version 1.24 of kubernetes, and some improvements to the CLI - like --dry-run.

I will look into automating this process, since it gets tedious to do by hand. So hopefully a new article about that in the future.

links

social