Upgrading Talos from 1.0.0 to 1.0.5
So some days ago I got aware, that I was falling quite a bit behind on upgrading my Talos kubernetes cluster. The project is up to 1.0.5, and I was still running 1.0.0. As I sometime lack time, I started upgradiing my cluster a bit half heartedly as a left-hand job to be honest.
I admit this is a story of me not paying to much attention to detail. On the positive node, I learned some things!
The upgrade procedure
I was so happy that upgrading was really easy. Simply a matter of:
talosctl upgrade --nodes <node ip> --image ghcr.io/siderolabs/installer:vX.X.X
That sounded easy, and I took the (apparently) heroic decision of jumping straight to 1.0.5 from 1.0.0 - and doing it on my first control-plane node. Little did I know, but these kind of decisions was leading me down a perilous path. To my surprise, it didn't work. The node rebooted after some time, but was still running 1.0.0. So I was puzzled, and thought better of it and tried it out on a worker node. Same result. "Hmm... okay, so let's bump one version at a time" I though to myself. And that worked better. After getting all workers to 1.0.1.
The downfall (almost)
I now got back to that first control-plane. It failed a number of times, somewhere in the first 3-4 phases, which Talos Linux goes through during an upgrade.
So I tried going to the help of talosctl - and noticed I could do
talosctl upgrade --nodes <node ip> --image ghcr.io/siderolabs/installer:v1.0.1 -s
-s was for stage - meaning the upgrade would be staged and performed after a reboot.
Now what follows was a lot of DOH! and "did I really do that" moments... in a bit of self defense, I did this Friday evening, after a long work week with lots of challenges, and I was still a bit tired after the big company party Wednesday night. That being said, in retrospect, I was just plain careless.
When the control-plane host came up, it was complaining about it couldn't figure out what to do, and needed bootstrapping - even offering up the command for doing so.
talosctl bootstrap
The cluster that disappeared
At this point, it dawned on me, I had lost access to my cluster, and it was DOWN. The alerts from uptimerobot was flooding in.
I couldn't perform kubectl commands and talosctl wasn't working against the master node anymore. No contact.
I now tried to change the api endpoint in both my ~/.kube/config and my ~/.talos/config, where the endpoint was the master node now unreachable.
That worked somewhat. I was able to get a list of nodes from the cluster, and I could perform talosctl commands again.
My cluster was still down however.
RTFM and Talos outstanding support
Reading a bunch of documentation did not make me any wiser. I decided to play with open cards, and flaunt my ignorance and stupidity (I didn't feel especially smart at this point ;-)) and headed over to the slack channel. Asking what I could do.
I was mostly expecting the cluster to be lost. Except... I did have 3 control-plane nodes, so I thought I might be able to save it. I just had no idea how.
Andrey Smirnov answered a short time later, and quickly sorted the situation. He clarified I didn't have an HA setup, since my controlplane endpoint was that one node. Now that came as both and "off coures" realisation, and a surprise.
I was surprised, because I had just assumed having the 3 control-plane nodes would cover me. It did - but only half ways. I had not considered the control-plane endpoint was tied to the node I had initially started out with.
Andrey pointed my to this guide on how to set up a virtual ip for the control-plane endpoint and this guide describing the possible different options
Although this is in no way new knowledge to me - having a virtual ip for the control-plane endpoint - I must admit I just hadn't thought of it.
Andrey did however first guide me to getting the master node back on track. Since my cluster did have 2 other control-plane nodes, I could scratch the bad node and have it rebuild it's etcd config and rejoin the cluster.
- have the node change it's endpoint to the 2nd master node:
talosctl edit mc -n <1st master node ip>
- Find the
cluster.endpoint:and have it point to the 2nd master's ip address - Reset the nodes config, reboot and join again:
talosctl -n <1st master node ip> reset --system-labels-to-wipe=EPHEMERAL --reboot --graceful=false
I did as suggested, and after some time, the node joined the cluster, and reconfigured it's etcd configuration.
My cluster was back up!
Getting an HA setup
I now performed the process of adding a VIP for the control-plane endpoint.
I edited my controlplane.yaml and added the vip configuration as described in the documentation:
network:
interfaces:
- interface: eth0
dhcp: true
vip:
ip: <available ip>
After adding that I did an update of the config to the 3 control-plane nodes:
talosctl apply-config -n <ip of node> -f <path to controlplane.yaml>
The VIP was now added an available, and I changed my ~/.talos/config and ~/.kube/config to point to the new virtual IP, and everything now worked.
One more thing needed to be done - thanks to Andrey for putting all the details out there - and that was to get the nodes in the cluster to realize the endpoint ip had changed.
So doing a talosctl edit mc and changing the control-plane endpoint was needed for the nodes to be aware of the change:
cluster:
controlPlane:
endpoint: https://<new control plane ip>:6443
Now ... Back to that upgrade
I was now back where I started, and it was almost midnight. I thought I might as well get my cluster up to the latest version, so I tried performing an upgrade of the master node again. This time it went as expected.
I was more than exited, to see my cluster up and now upgraded to 1.0.1.
The thought now occured to me, I had to try out if I could skip the intermediary updates, and jump straight to the newest one. So I did, and it worked like a charm. Hurray for Talos and easy upgrading! The cluster is now on version 1.0.5 - which at the time of writing this is the newest version.
And then kubernetes upgrade too
I wanted to get as updated as possible, so as easy as the image update of Talos Linux is, it's the same thing for Kubernetes.
Issuing:
talosctl --nodes <ip of master node> upgrade-k8s --to 1.23.6
Started the upgrade of all components needed, and some 5 minutes later, I had an updated Kubernetes as well :-D
Conclusions
The first thing to conclude here is to not perform a task like this on auto-pilot and tired. Especially not when it's the first time doing it. I realized during this Friday night, I was nowhere near familiar enough with Talos Linux and the upgrade procedure.
Not thinking things through and resarching a bit more, before launching myself head first into this with little thought on what I was doing is not a good cocktail!
I am thankful for the help I received on Slack, and I can't thank Andrey enough for the swift help! I am now wiser, and a bit more inclined to read up on the documentation the next time ;-)
As a final conclusion, Talos Linux is dead easy to upgrade - you just need to have the basics in place first. A good rule of thumb for any activity.
Next time
I look forward to upgrading to version 1.1.0, which is currently in Alpha testing. It includes an upgrade to version 1.24 of kubernetes, and some improvements to the CLI - like --dry-run.
I will look into automating this process, since it gets tedious to do by hand. So hopefully a new article about that in the future.