Upgrading Talos from 1.0.0 to 1.0.5

So some days ago I got aware, that I was falling quite a bit behind on upgrading my Talos kubernetes cluster. The project is up to 1.0.5, and I was still running 1.0.0. As I sometime lack time, I started upgradiing my cluster a bit half heartedly as a left-hand job to be honest.

I admit this is a story of me not paying to much attention to detail. On the positive node, I learned some things!

The upgrade procedure

I was so happy that upgrading was really easy. Simply a matter of:

talosctl upgrade --nodes <node ip> --image ghcr.io/siderolabs/installer:vX.X.X

That sounded easy, and I took the (apparently) heroic decision of jumping straight to 1.0.5 from 1.0.0 - and doing it on my first control-plane node. Little did I know, but these kind of decisions was leading me down a perilous path. To my surprise, it didn't work. The node rebooted after some time, but was still running 1.0.0. So I was puzzled, and thought better of it and tried it out on a worker node. Same result. "Hmm... okay, so let's bump one version at a time" I though to myself. And that worked better. After getting all workers to 1.0.1.

The downfall (almost)

I now got back to that first control-plane. It failed a number of times, somewhere in the first 3-4 phases, which Talos Linux goes through during an upgrade. So I tried going to the help of talosctl - and noticed I could do

talosctl upgrade --nodes <node ip> --image ghcr.io/siderolabs/installer:v1.0.1 -s

-s was for stage - meaning the upgrade would be staged and performed after a reboot.

Now what follows was a lot of DOH! and "did I really do that" moments... in a bit of self defense, I did this Friday evening, after a long work week with lots of challenges, and I was still a bit tired after the big company party Wednesday night. That being said, in retrospect, I was just plain careless.

When the control-plane host came up, it was complaining about it couldn't figure out what to do, and needed bootstrapping - even offering up the command for doing so.

talosctl bootstrap

The cluster that disappeared

At this point, it dawned on me, I had lost access to my cluster, and it was DOWN. The alerts from uptimerobot was flooding in.

I couldn't perform kubectl commands and talosctl wasn't working against the master node anymore. No contact.

I now tried to change the api endpoint in both my ~/.kube/config and my ~/.talos/config, where the endpoint was the master node now unreachable.

That worked somewhat. I was able to get a list of nodes from the cluster, and I could perform talosctl commands again.

My cluster was still down however.

RTFM and Talos outstanding support

Reading a bunch of documentation did not make me any wiser. I decided to play with open cards, and flaunt my ignorance and stupidity (I didn't feel especially smart at this point ;-)) and headed over to the slack channel. Asking what I could do.

I was mostly expecting the cluster to be lost. Except... I did have 3 control-plane nodes, so I thought I might be able to save it. I just had no idea how.

Andrey Smirnov answered a short time later, and quickly sorted the situation. He clarified I didn't have an HA setup, since my controlplane endpoint was that one node. Now that came as both and "off coures" realisation, and a surprise.

I was surprised, because I had just assumed having the 3 control-plane nodes would cover me. It did - but only half ways. I had not considered the control-plane endpoint was tied to the node I had initially started out with.

Andrey pointed my to this guide on how to set up a virtual ip for the control-plane endpoint and this guide describing the possible different options

Although this is in no way new knowledge to me - having a virtual ip for the control-plane endpoint - I must admit I just hadn't thought of it.

Andrey did however first guide me to getting the master node back on track. Since my cluster did have 2 other control-plane nodes, I could scratch the bad node and have it rebuild it's etcd config and rejoin the cluster.

have the node change it's endpoint to the 2nd master node:

talosctl edit mc -n <1st master node ip>

Find the cluster.endpoint: and have it point to the 2nd master's ip address
Reset the nodes config, reboot and join again:

talosctl -n <1st master node ip> reset --system-labels-to-wipe=EPHEMERAL --reboot --graceful=false

I did as suggested, and after some time, the node joined the cluster, and reconfigured it's etcd configuration.

My cluster was back up!

Getting an HA setup

I now performed the process of adding a VIP for the control-plane endpoint.

I edited my controlplane.yaml and added the vip configuration as described in the documentation:

network: 
    interfaces:
    - interface: eth0
    dhcp: true
    vip:
        ip: <available ip>

After adding that I did an update of the config to the 3 control-plane nodes:

talosctl apply-config -n <ip of node> -f <path to controlplane.yaml>

The VIP was now added an available, and I changed my ~/.talos/config and ~/.kube/config to point to the new virtual IP, and everything now worked.

One more thing needed to be done - thanks to Andrey for putting all the details out there - and that was to get the nodes in the cluster to realize the endpoint ip had changed.

So doing a talosctl edit mc and changing the control-plane endpoint was needed for the nodes to be aware of the change:

cluster:
  controlPlane:
    endpoint: https://<new control plane ip>:6443

Now ... Back to that upgrade

I was now back where I started, and it was almost midnight. I thought I might as well get my cluster up to the latest version, so I tried performing an upgrade of the master node again. This time it went as expected.

I was more than exited, to see my cluster up and now upgraded to 1.0.1.

The thought now occured to me, I had to try out if I could skip the intermediary updates, and jump straight to the newest one. So I did, and it worked like a charm. Hurray for Talos and easy upgrading! The cluster is now on version 1.0.5 - which at the time of writing this is the newest version.

And then kubernetes upgrade too

I wanted to get as updated as possible, so as easy as the image update of Talos Linux is, it's the same thing for Kubernetes.

Issuing:

talosctl --nodes <ip of master node> upgrade-k8s --to 1.23.6

Started the upgrade of all components needed, and some 5 minutes later, I had an updated Kubernetes as well :-D

Conclusions

The first thing to conclude here is to not perform a task like this on auto-pilot and tired. Especially not when it's the first time doing it. I realized during this Friday night, I was nowhere near familiar enough with Talos Linux and the upgrade procedure.

Not thinking things through and resarching a bit more, before launching myself head first into this with little thought on what I was doing is not a good cocktail!

I am thankful for the help I received on Slack, and I can't thank Andrey enough for the swift help! I am now wiser, and a bit more inclined to read up on the documentation the next time ;-)

As a final conclusion, Talos Linux is dead easy to upgrade - you just need to have the basics in place first. A good rule of thumb for any activity.

Next time

I look forward to upgrading to version 1.1.0, which is currently in Alpha testing. It includes an upgrade to version 1.24 of kubernetes, and some improvements to the CLI - like --dry-run.

I will look into automating this process, since it gets tedious to do by hand. So hopefully a new article about that in the future.

Skov Codes

"Talos Upgrade and the Stupid Things You Do When Tired"

Upgrading Talos from 1.0.0 to 1.0.5

The upgrade procedure

The downfall (almost)

The cluster that disappeared

RTFM and Talos outstanding support

Getting an HA setup

Now ... Back to that upgrade

And then kubernetes upgrade too

Conclusions

Next time

Upgrading Talos from 1.0.0 to 1.0.5

The upgrade procedure

The downfall (almost)

The cluster that disappeared

RTFM and Talos outstanding support

Getting an HA setup

Now ... Back to that upgrade

And then kubernetes upgrade too

Conclusions

Next time

links

social