Talos Linux security with Andrey Smirnov

In this episode, I discuss into the security features of Talos Linux with Andrey Smirnov. Andrey explains how Talos focuses on its immutability and minimal attack surface. Discover how these enhancements fortify your systems against vulnerabilities, ensuring a secure and resilient infrastructure. Join us as we explore the security advancements that make Talos Linux not only a super easy way to run Kubernetes, but also a very secure way.

Episode Links

This episode is also available as a podcast, search for “Open Source Security” on your favorite podcast player.

Episode Transcript

Josh Bressers (00:00) Today, open source security is talking to Andrey Smirnov engineering lead for Talos Linux at Sidero Labs. Andrey, I am so excited to have you on the show. Thank you for coming.

Andrey Smirnov (00:10) Thank you, Josh. Excited to join and talk about security. Yeah, that is.

Josh Bressers (00:13) Yes, yes. Okay. So I’m

going to start by with maybe a little preface that I told Andrey this before we hit record, but so as a podcast host, whenever I learn of a project, I always want to try it out. And we had a listener some time ago suggest, I talk to Andrey about Talos Linux security and I apologize. I did not write down who suggested this. So if you’re listening, like give me a shout out and I’ll add your name to the show notes or something. But, anyway, so I’m like, okay, I’m going to take a look at this because I’ve heard of Talos and never used it. I installed Talos Linux and I started using it.

And like, holy crap, it has changed my life. is, I’m never going back to like mini-Kube or Kind or any of the other like small Kubernetes orchestrators. Cause this thing goes from tiny to huge and it’s amazing. So, okay, Andrey, let’s just start up and like, tell us what Talos is cause holy cow, it is cool.

Andrey Smirnov (01:01) Yeah, I think it started with an exactly like you described, started with an idea that, I mean, I’m not the founder, right? I’m, founder is Andrew, like almost same as my name, but different letter at the end. ⁓ He was ⁓ managing like, like many ops people do, like a huge number of Kubernetes clusters rolled out with Kubespray and Sybil and all that stuff.

And I think the biggest problem was like, it’s easy to roll out a cluster, but how do you keep it up to date, keep it running, updated, upgraded, whatever, right? And ⁓ there came the idea that general Linux distros, whether it’s like Debian or Ubuntu or Red Hat, whatever, they might not work so well when your only goal is to run container-based workloads or Kubernetes in general. Because what…

Josh Bressers (01:35) Yup. ⁓

Andrey Smirnov (01:57) does Kubernetes need from the host OS? It’s probably not that much, right? And also there is like this, you don’t need like users on the host. What are the users? Kubernetes doesn’t care. ⁓ And then like one idea on top of another idea, let’s make it more secure. For example, what if we make the root file system immutable and read only, that will improve the security. What if we drop some stuff that…

Josh Bressers (02:01) Yeah, yeah.

Andrey Smirnov (02:25) You don’t need, like there is no shell in rootFS on Talos at all. And what if we make it small? What if we make it like upgradable by changing the whole image? So there are many, many ideas layered one on top of another. together, I think they form that feeling that Talos is very easy to run Kubernetes on because literally you put Talos and Kubernetes up. ⁓

Josh Bressers (02:50) Yeah.

Andrey Smirnov (02:52) I don’t think in general that Talos is limited to Kubernetes, but I think the way our industry looks today is that everybody wants Kubernetes if they want to run anything in containers, even though if Kubernetes is like an overkill for their specific use case, like we have users who run single node Kubernetes clusters, which seems why, right? Kubernetes was designed for exactly opposite use case, but still like people…

I think Kubernetes is the way to run containers today, at least that’s what it is.

Josh Bressers (03:24) Yeah, yeah.

And I want to add, so if you look at the docs, there is the ability to run like basically a test instance on a single node, which I use often for like demos and things for work and whatnot. And you can also scale this thing to the moon, which I feel like there aren’t a lot of systems on the planet that can do very small and very big, which is cool.

Andrey Smirnov (03:50) Yeah. And also what we’re really proud of, it’s literally the same Talos image that runs anywhere. Whether you put it like on Amazon or you run it in QEMU or you run it on Raspberry Pi. Well, Raspberry Pi might be different architecture, but still like comparing ARM, like ARM AWS instance and ARM Raspberry Pi, it will be exactly same byte by byte Talos image that runs on it, which gives you like almost same experience, like plus minus.

Josh Bressers (04:16) Yeah, yeah.

Andrey Smirnov (04:20) Still environment ⁓ gives like a bit of a difference, but you expect almost like same 99%, same Talos Linux everywhere where it runs.

Josh Bressers (04:30) Yeah, yeah, yeah. And

I want to stress the horrible experience of trying to manage a Kubernetes cluster, like if it’s just some Debian nodes, because that’s what I used to use in my basement when I would do like Kubernetes testing, is I had a bunch of Debian systems that I would SSH into, like Talos doesn’t even have SSH, so it’s not an option, but I’d SSH into, and then I’d say there’s a 95 % chance I was gonna screw something up so bad I just threw everything away and started over. Which is…

Andrey Smirnov (04:58) Yeah, exactly.

Josh Bressers (04:59) I think a lot of Kubernetes experiences for people where it’s just like, it’s broken so bad, it can’t be fixed and we’re just going to start again, which in the case of Talos, I can do with command, like one command and it works. Okay.

Andrey Smirnov (05:14) And

I think it goes into some way to, before working on Talos, like long time ago, I was working on open source tool, which is still alive. I was just called aptly. And the whole idea of aptly was to be able to take snapshots of Debian package repositories. So, because if you do like apt update, apt upgrade, right, in general, the result depends on the time you’re running this comment, right? It might be a bit different.

Josh Bressers (05:40) Yep. Yep.

Andrey Smirnov (05:43) And if you want all your machines to be exactly in sync on the same package version, that tool was actually trying to do one step of it by providing you a way to snapshot. But snapshot was the basic function of it. So you take a snapshot of the upstream repository, publish that as a repository, and point all your machines to it so you know they run exactly the same versions. And when you want to upgrade, you upgrade, but it’s in your control, right?

And I think Talos takes it to the next level, but saying there is no package manager at all. ⁓ You just upgrade the baseOS and you get the new set of packages and it’s immutable set that was tested together and you don’t have a way to upgrade one component without touching another for most of them.

Josh Bressers (06:32) Yeah, yeah. And we should also add, so you have a command line tool. It’s what TaloCTL it’s called. And it is how everything happens. There is all of the communication between the nodes. The magic is your own protocol that is kind of built into, I guess, I don’t know what you call it. just, okay, APL, okay, yeah, yeah.

Andrey Smirnov (06:36) Yeah.

It’s Talos API, how we, yeah,

Talos API. And if you think on the surface, like most of the Talos API is actually read only. You want to know the state of the machine. ⁓ Most of that state is exposed as resources, which are very similar to Kubernetes resources. So you can see how my like time sync is going or what are the addresses assigned to that machine. That will be all resources.

⁓ And some comments which actually changed the state, I would say like, you want to reboot a machine or you want to apply a new machine configuration to it. that’s the only, like applying new machine configuration is the only path to actually change the machine. And Talos will just accept the new machine and on the fly reconfigure itself ⁓ to do something. don’t know, whatever your machine config changes, whether you attach a new address.

statically to the interface or you want to change kube-api server version, Thalos will figure it out and reconfigure itself on the fly for the new. But that’s literally that replaces the shell. So there’s no shell, but the only way you apply the change is by applying the machine config.

Josh Bressers (08:03) And I must also say, will embarrassingly say my favorite feature I think you have is that I can acquire the config file I need to run kubectl anywhere I’m at by running one command. Cause like that is always such a pain. It’s like, wait, which kubectl do I have in this machine? is it that one? no, it’s not that one. crap. Where do I, where’d I put it? I can’t find it. And so that is like been the most delightful part of this whole experience is just being able to easily grab the kubectl I need. And it just works. It’s amazing.

Andrey Smirnov (08:33) Yeah, yeah, exactly. So the idea of Talos once, we even tried to see that as the OS you shouldn’t be worried much about, right? It’s kind of a zero OS in the sense that if you figured out your machine configuration, figured out your provisioning process, you can like scale up your cluster by throwing more hardware, applying the same machine config to it. They will all join your cluster. And once you have kubeconfig, you have the Kubernetes.

what you were looking for. You weren’t looking for Talos specifically. Your idea is to run your workloads. Like you said, you have your kubeconfig and now you can start kubectl applying your stuff or whatever you want to run on top of your cluster and Talos gets out of the way ⁓ as much as the next time you need it is something goes wrong and you want to troubleshoot things or whatever. Apply the change.

Josh Bressers (09:28) Yeah, yeah, I haven’t broken it yet. So we’ll see if I change my opinion when I break something. okay, so this is this is supposed to be a security show. Let’s talk about the security of this because you folks have done I think some really clever things around, you know, immutability, not running SSH, kind of the minimal attack surface, kind of next level almost, I would say in what you’ve done. So explain kind of the security ideas and some of the things you’ve done and we’ll kind of weave our way from there.

Andrey Smirnov (09:59) Yeah, I think one which you mentioned, one, I think I have like three things that I think we can split this all into. One is yes, don’t put something you don’t need basically minimal set of dependencies that you ship and naturally you will have, you don’t have to worry about that. That’s probably like, once again, like going back to an Ubuntu or something.

In the default install, might have something that you actually don’t need or never use, but it might have a vulnerability on its own, which might force you to upgrade. So being minimal is one thing. Another, I guess, ⁓ is being like bleeding edge. So we try to update all our dependency to the absolutely whatever is available at the time Talos is released, whether it’s like the Linux kernel.

or whether it is Kubernetes version that we support, that’s diversion. So we try to use the latest ones. get, whenever Kubernetes release is available in the next one, there is a Talos release which supports this version. There’s no that you have to wait one year to try a new Kubernetes version. And the third thing, I guess, is these design choices or when you say your rootFS is read-only and immutable.

That’s how you design the UX or that for example, in Kubernetes we enable by default pod security. That like we enable pod security to make it the best for our users. Certainly what we get back is users asking, how can I deploy my workload? Now Kubernetes yells at me that, I don’t know, my privileged pods are not running. Yeah, so we have to teach and explain and I think kind of to push the community.

To explain, yes, this is a cool thing that actually enforces best security standards. Yes, it’s a bit painful if you want to run something which is privileged. You have to explicitly mark and say, OK, I’m OK with that being privileged in that namespace because I know I’m running whatever my rootFS have storage and it needs to be privileged for good reasons. But in general, to force things to be in the best way possible.

Also, I think we pushed a lot of ⁓ third party products because for example, once again, I’m talking about storage solutions, but what they ⁓ love to do is to have a privileged pod. Then from that pod enter PID1 namespace and execute bash from it. And that doesn’t work on Talos for the best reason that there is no bash, right? So come on, like either you ship your tools with you or if you do need to…

get into root namespace for whatever reason to run some tools. At least don’t rely on shell being there. So we had to like, actually many, many projects had to do some changes to run on Talos for that way.

Josh Bressers (13:01) that’s the best reason, right? I mean, this is, think, the typical security attitude of if it’s not default, no one uses it. I mean, by forcing the immutable file system, by forcing the secure pod, all of that, that changes behaviors because you have to change. Like, I love it.

Andrey Smirnov (13:21) Right.

Yeah, exactly. Yeah. Immutable file system is also like sometimes whatever someone comes up with in the community question, how do I run this workload? But it tries to mount from the host something from ETC as writable and Talos like doesn’t let you do that. And how do I run it? Does your actually workload needs to run something from host ETC as writable? Right. Is it like, is it doing a good job? Probably not. Yeah.

Josh Bressers (13:47) The answer is no. The answer is always no. Like you don’t

need to do that. You just do it cause it was easy and you got lazy and then you never went and fixed it again. Like there’s a joke in the world of machinists I’ve heard that is there’s nothing more permanent than a temporary fix. In software, I totally feel like that happens. So, okay, let’s talk about, I want to talk about immutable file systems because I feel like that is not necessarily a totally understood concept. I think.

Andrey Smirnov (14:04) Thanks.

Josh Bressers (14:16) It is, it sort of makes sense, but I’m sure even I don’t totally understand it. So tell us what that means and how it works. Cause obviously when we think of immutable, like we can never change it. Right. And obviously that’s the goal, but we have to upgrade sometimes too.

Andrey Smirnov (14:32) Right, exactly. So if you think about that, like Talos today ships as two files, if we simplify it. One is the Linux kernel image, and another one, which is like in Linux world called initRD, initRMFS, whatever we call it. And this is whole Talos. Today we package that into a single file even called the UKI.

Unified whatever kernel interface. I think it’s basically EFI bootable so you can literally boot that single file over whatever network and Or like USB stick and that’s the whole Talos But like Linux kernel we understand that it’s kind of immutable because that’s the kernel image and the initramfs itself in traditional operating systems might contain some

small initial set of tools, which is enough to mount the actual root file system, which will be on disk. In Talos, initramfs is actually self-contained. There is no root file system on disk at all. So inside the initramfs it’s more about Linux specifics, initramfs can be made read-only.

But inside the initramfs what we ship is a SquashFS image. And SquashFS is one of the read-only Linux file system, not the only one, but one of the possible options. So what happens is that in the early boot, the initramfs is there available. Inside there is a SquashFS image, which is exactly read-only. And what Talos does, it mounts that SquashFS and makes it the root of the whole mount tree.

So when we boot, there is nothing writable by default because the whole, everything under slash is readonly at this point. And we start mounting on top things that need to be writable. Like we want slash tmp to be writable. So we do tempfs. We need some required file systems like slash dev and stuff like that, which is Linux specific things. are not.

They’re neither writable nor read-only. They’re kind of system files, like file system. But that’s the initial state of Talos. And Talos can, up to some point, run in the state where there is no block storage available yet. So everything is in memory. Most of that is read-only.

And then once we push configuration to Talos or Talos finds its already configuration that was persisted to disk, it will start mounting additional things. And the only big writable part of the Talos file system tree is slash var. So that will basically come from disk. That will contain things like container images. If that’s a control plane machine that

well, etcd by default stores its database, it’s where there is like all these emptier like container state, all of that goes to slash var. And in general, if it’s not like a control plane machine, if it’s a worker, you can consider this slash var to be somewhat ephemeral.

Because the container state is usually something that can be recreated if it is lost or it is wiped. If you want some persistence, for example, you run a database, probably you would rather allocate a specific volume for its contents, right? But you want like mix it with in general like slash var.

But that’s the only thing that is writable. Like usually when a container starts, like something like a pod, which is a bad port from pod security standards, which tries to do hotspot mounts. But still when it starts, the only thing it can mount as writable is something under slash var ⁓ in general.

Josh Bressers (18:30) Okay, so I have a question then that I don’t know the answer to. So, you’re talking about hosts being ephemeral. So if I reboot a Talos node, is the configuration information stored on disk or does it reacquire that from the cluster when it boots up?

Andrey Smirnov (18:46) ⁓ Yes, the default behavior is actually stored on disk. So I probably skipped that part, but I was more talking about how the boot process works. But in general, Talos will, it will, when you install Talos, when by installing literally means that you put a bootloader and that.

Josh Bressers (18:53) Yeah.

Andrey Smirnov (19:05) boot asset file, which is a single file of the UKI to disk. Talos will locate some system partitions, which are very tiny. One of them will be used to store the files, including the machine configuration. So once you like install it’s there. But some people, for example, want to run completely opposite of that. They want to run ephemeral Talos node, which boots over the network.

Everything is tempFS and if you reboot that, no, that usually is a worker obviously in Kubernetes world, right? It’s all gone. Like there is no trace left. It is completely fresh. But usually people, yeah, they would rather have some persistence.

Josh Bressers (19:39) Yeah. Yeah.

For

sure, for sure. Okay, so what does an upgrade look like then in this immutable universe?

Andrey Smirnov (19:50) Upgrade is very simple. So we literally pull new boot assets over the network from some source. In case of Talos, usually it’s the container registry. So it’s another image. We unpack that and write new boot assets to the boot partition and we reboot. That’s it. So we, as the whole Talos is contained, this UKI file and this UKI file is what? It’s around

100 megabytes, probably a bit more today. Yeah, that’s the whole thing. You just lay down a new one and you reboot to use this new image and that’s it. You get the new Linux kernel, probably if you use the new Talos version and you get new initramfs which contains that new immutable image which comes with new tools or new Talos core code and that’s it. There is…

Josh Bressers (20:18) It’s really small, yeah.

Andrey Smirnov (20:43) There is nothing else. And you can also roll back, right? So we keep at least two boot assets available. So if things go wrong, you can basically in the bootloader say, I want to boot previous version. It’s kind of this A-B scheme. I think it’s very simple when you talk about that. ⁓ And it is very simple, but this idea is quite different, right? From what…

Josh Bressers (21:12) Yeah, yeah.

Andrey Smirnov (21:13) many operating systems do.

Josh Bressers (21:15) It feels better though, because again, I’m not SSHing into a Debian system that I’ve messed up more ways than I can count. Right? Like I love the idea. Immutability, I don’t even see as much as a security feature as I see it. it is a way to protect myself from myself. Cause like present me hates future me and I’m always screwing that guy over. Right? I cannot do that with this. So I love it.

⁓ Okay, let’s talk about I want to talk about the the minimal aspect as well because you You you say on your website that like there’s no sshd running in these systems, right? It is strictly over your API that things communicate that configuration gets set and I feel like that is a Concept we are not used to in the world of computers, right? We are used to connecting to a system sitting down at a system Monkeying with it in my case breaking it

and then walking away, right? Whereas that is not at all what this does. And it is amazing and I think terrifying at very first to think you like, can’t do anything to this.

Andrey Smirnov (22:21) Yeah, you have to accept that, I guess, because there are certainly pros and cons. On one hand, like you said, if you know you can touch it manually or manually modify something, that means that the machine is totally in the state that was set by the configuration. On the other hand, you feel that you don’t have all the tools and all the power you had over

Decades of Unix Linux that are available to you usually like you want to run an IP utility. Now you have to learn that that’s now a resource and you have to do like, I don’t know, get addresses if you want to see addresses instead of “ip a” or something. And I think it’s always a compromise we have ⁓ because we hit, we try to build that.

new, completely new system that operates on a different paradigm, like the API level, right? And we certainly can’t be like feature complete ever. If you compare that to a regular Linux distro and someone says, oh, I want to set up some VLAN and connect it to a bridge this way. And if we don’t support this today, we have to add support for it, right? So…

That is always there, that kind of going back and forth. And in the end, have to sometimes, until we add like real first-class support for something, you have to rely on some hacks to make it work, which certainly is not the best. So you have to kind of accept the fact that you will be limited in some ways. If that works for you, if what we Talos Linux offers is enough, you’re in a good spot.

because you get all the benefits. If you need something that is absolutely not supported by Talos today, probably you either have to wait, right, but you might not have the best experience. So I guess there is this compromise.

Josh Bressers (24:37) That’s a fair point.

Yeah. I, it’s not something I’d really thought, but I’m not doing anything exciting either. I’m just running simple Kubernetes instances either on my laptop locally or a couple of nodes I have that are just VMs, you know, out in my network. And I don’t have VLANs or anything. That’s, that’s interesting. That’s fair. And yes, it should be. Okay. Okay.

Andrey Smirnov (24:55) We do support VLANs, right? But it sometimes comes to that

I want to have a bridge, like for example, I mean, dumb example, but still, like Talos assumes that if it manages something, it manages it. For example, like if you create a bridge with Talos and attach some ports to it, Talos will ensure that this bridge exists and only these ports are attached to it. And then…

Someone says, I want to run kube-virt and kube-virt will attach something else to that port or it could be not kube-virt. I don’t remember exactly. And then Talos sees, ⁓ there is a new port that I’m not aware about, removed, right? Because what Talos is built internally on is like Kareni’s concept of like constant reconciliation. So if it notices a change, it will try to reconcile it back to the state it should be in. So it will, for example, like revert. If you like go and ⁓

Josh Bressers (25:27) ⁓ okay.

Andrey Smirnov (25:51) I don’t know what is a good example. Like go and manually delete an address from the link that was managed by Talos. Talos will go and add it back. It will say, no, it should be there. So I will add it back. It won’t let you even like drift from the configuration it was supposed to.

Josh Bressers (26:08) like this because I am notorious for SSH-ing into a node, like creating some weird firewall rule or port forward or something, forgetting I did it, and then spending like 10 hours in a month trying to figure out why something isn’t working. it’s because I rebooted the machine and now my firewall rule is gone. So I, I, I feel like that is a good thing.

Andrey Smirnov (26:33) Yeah, it’s certainly, I do feel the same way, right? And I try to, we try to add as much stuff as we can with each release and support new things. Sometimes it takes time to actually figure out. We don’t want to like copy or like basically you re-implement the way you, I mean, let’s take a simple example. How do you manage your storage in general?

Like in traditional Linux, will SSH in, you will do something like gdisk or whatever, fdisk like partition that, mkfs something, right? Then go to a fstab mount that. Certainly for example, with systemd things are changing and it offers like better ways, like auto mounting based on partition UIDs and stuff like that. There are many changes there, but like traditional ways more.

very manual, you have to configure this and that. And with Talos, we thought, okay, we don’t want to do that. We want that experience to be closer to how Kubernetes allows you to allocate resources. So we want to have something which is, ⁓ I want to have a user volume and I specify the size I want it to be, like the file system, should be, the name of it, if I want to be specific about which disk or disk type it needs to use, like I want it to be on NVMe disk.

That’s all I put. And Talos will figure out, allocate that on a proper drive. If like for the first time, next time it’s there, it will just use it. That’s a bit different experience from like traditional OS that we’re offering in many places. And sometimes it takes time even to figure out what is that correct way to, I don’t know, allocate your disks, for example.

Josh Bressers (28:20) Yeah, yeah, mean, your docs are pretty good. I figured that out quickly. And I’ll tell you my experience. You know what I did? I SSH into my system. I set up the file system. I forget to add it to the FS tab. I mount it manually. And then when I reboot, the disk is gone and everything explodes. Like, that’s my usual experience with this. So that, you know, I think that’s the reason I love this is so much. And why I just like was instantly enthralled by Talos is because it has like removed my ability to do dumb things.

Andrey Smirnov (28:37) Right.

Josh Bressers (28:49) That I think is the most appealing aspect of this. But, okay, Andrey, I could talk to you all day. I’m not going to. We’ll have to have you back at some point, like, let’s take us home. I’ll give you the floor, talk about whatever you want us to know, like what’s a good way to try it out? How can we get involved? What does this project look like? Like fill us in on kind of next steps for anyone listening that’s interested.

Andrey Smirnov (29:13) So mean, yes, everything on GitHub, github.com/siderolabs/talos yeah.

Josh Bressers (29:19) And I’ll have links in the show notes for everything. So

anyone who’s interested, just go check those out.

Andrey Smirnov (29:24) So we have, we are pretty open in the way we do development. ⁓ like issue pull requests, like it’s all open. We even have our planning board in GitHub projects also like open to everyone. ⁓ We do need help in many areas, right? I think lots of projects would say that we need more testing. We need like…

great bug reports and I really love our community when they find some very non-trivial bugs. When actually finding the bug or finding the root cause of the issue is way more complicated than fixing it. Fix might be one-liner, right? But go figure out why this happens. We certainly do need help in these areas in the documentation. ⁓ We…

like coding is, for example, I know, I can say that we hired two people to our engineering team just because they opened cool enough PRs. Yeah, you just like see, there was like a person comes up with a PR to add like hardware watchdog support for TALOS. And I know they never asked any question. They just.

Josh Bressers (30:30) Awesome.

Andrey Smirnov (30:44) The PR is not perfect, but the way they figured out all the way how to do this, how to do machine configuration changes, how to connect things together. And I say, my God, right? This person never asked a question and they’re ⁓ able to do such a PR. They figured out how to test that and all that stuff. So yes, we need to talk. So certainly PRs were very, very open, very open to that. ⁓ We’re recently starting adding more

I mean, it’s interesting we started adding more ⁓ like security related stuff as for example, SBOMs I would say they’re certainly ⁓ they don’t guarantee security on their own, right? But we hit ⁓ lots of interesting issues on that way, for example, with the CVE database not being accurate all the way, like you run the scan on your like

Perfect SBOM what you think. mean, it perfectly describes all the components, but then you get CVEs which are just incorrectly reporting the fixed versions or sometimes they’re like appear two year old CVEs randomly appear once again with the wrong version attached to them. So we started doing this vulnerability exclusion file to fix it up, to have actually a clean run on our, but we try to be serious about.

For example, you can’t merge a pull request to Talos if it doesn’t pass the Grype scan on SBOM. So if there is a new vulnerability and same for Go dependencies, specifically with the Go Vuln Check tool, we just say, okay, we have to stop. Whatever we were doing, go back and update something that needs to be updated, whether it’s Go toolchain that reports a vulnerability or whatever.

We need a new Linux release to get it fixed. We’ll go back and try to fix that. We kind of tried to fly these zero CVE at least at the moment it gets released because certainly over time Linux kernel will certainly publish like at least a handful of CVEs guess in a month. Right.

Josh Bressers (33:01) handful, a bucketful I think, with those guys.

Andrey Smirnov (33:06) That’s what we try to do. We try to have these patch releases which update components, affected components. ⁓ And we really love like how our users to upgrade and get it. We try to make this upgrade as much like not scary, I would say, as we can. Because certainly there is also think in the industry is a thing that people hate to upgrade in general. ⁓

That doesn’t help, right? And we try to make sure that at least there is a release available. ⁓ I know some users trust us that much that they set up automatic upgrades, which is certainly sounds a bit scary, right? But probably if it’s your home lab or something and you know a way you can recover, right? Or roll back, that’s fine. But certainly, I mean, I would recommend to, yeah.

Josh Bressers (33:51) Very brave.

Andrey Smirnov (34:04) We have lots of stuff like that. And you mentioned like a way to create local Talos clusters. And ⁓ we had support for like QEMU on Linux for a long time. And in the very last release, added support for ⁓ Mac OS basically to use their native, ⁓ well, QEMU using their native hypervisor on Mac OS because it turns out many people today who are doing stuff they have there.

Mac laptops. ⁓ this is, think, very cool because I like the way it runs inside of VM that you can actually experiment with anything. I mean, you can boot your small cluster in the VM, apply your workloads, your production ones, and try out an upgrade or try out a config change, see how it goes. That is way… It certainly doesn’t replace staging environment, but sometimes that’s way easier to experiment with.

I would say.

Josh Bressers (35:05) And it runs in Docker. Like I’m running it on my laptop in Docker, which is amazing.

Andrey Smirnov (35:07) In Docker as well,

Yeah, Docker still doesn’t give you the full experience, like upgrades are not available and stuff like that. Or you run with the host Linux kernel, not with the Talos one.

Josh Bressers (35:14) Yeah, yeah.

Yeah, yeah. But still, it’s very easy. Like, very easy.

Andrey Smirnov (35:25) But yeah,

talking about security, for example, we have a kernel hardening checker running against our kernel config all the time. This cool project, includes things like kspp and other recommendations built in. we try to make sure. We certainly do have exceptions, I should admit. We had to flag some things that as acceptable for us, right? But we still try to.

Josh Bressers (35:33) Nice.

Andrey Smirnov (35:53) to do whatever is the best security practices in kernel configuration. Or getting back to, I think security is painful. I can give one more example of that. ⁓ For example, when we build kernel, Linux kernel I mean, we create a completely ephemeral signing key, which is discarded.

Josh Bressers (35:58) Nice.

Andrey Smirnov (36:19) after the build and we sign the kernel modules with that key and the key itself goes into the kernel. Which means that after this is built, nobody in the world can actually build a kernel module that Talos will accept and load. Which certainly makes it painful sometimes when someone says, I have this proprietary driver kernel module that I want to run on Talos and there is no source available so we can’t like…

Josh Bressers (36:22) Yep. Yep.

Andrey Smirnov (36:48) really give it to you guys so you can include it into your build system and sign it. ⁓ But at the same time, that means that you can’t like at least provide any exploits by ⁓ loading a kernel module. ⁓ That path is just not available.

Josh Bressers (37:06) Right,

right. And you probably don’t have the tools, because I know like I used to work at Red Hat and we did exactly this, where we would sign all the modules with an ephemeral key, throw it out. And then if someone needed to sign or wanted to load signed modules, they had to then load another key into the kernel key store and then it could do it. But that’s a bunch of extra tooling and probably not as friendly on an ephemeral or I’m sorry, a read-only file system. Anyway.

Andrey Smirnov (37:35) Yeah, we were looking into that. Usually the problem is that you want that available super early. What if that thing is a network driver, right? So we’re looking into actually offering a way to embed your own custom key into the kernel image, which kernel supports, but that would still require you building a bit of your own image. But I mean, what I’m trying to say that is painful, or same thing will apply to, for example, Nvidia We can’t…

Josh Bressers (37:41) Yeah, yeah. Right, right?

Right.

Yeah, yeah, yeah, super painful.

Andrey Smirnov (38:05) offer like the way Nvidia operator usually works is that that builds literally like checks out like Linux source tree, like Nvidia source and builds on the same machine that kernel module and loads it. I’m sorry, this is not available with Talos. So we have to pre-build Nvidia kernel modules, make them available as what we call like extensions. So actually Talos, when I say it’s minimal, it’s still true. But when people want something else to be there, they have a…

variety of extensions they can pick from and use what we call Image Factory to actually create your own image when you can pick with the checkboxes, I want this and this. So you can pick your NVIDIA extensions bundled in, you will have your image which already contains the NVIDIA Kernel module. So you have just to load that or like, but still we can’t offer how NVIDIA envisioned that for, because of that module signing, which is just won’t work.

Josh Bressers (39:03) Right?

Well, this is hard to do on a traditional Linux distro and you’re running it in hard mode.

Andrey Smirnov (39:12) Yeah.

Yeah. So we had like lots of pains with that or whatever, like, or like Nvidia, once again, like strikes again, the way it works, it relies very heavily on G-Lib C. It assumes that it is there on the host and it will mount that up into the actual like container workloads, which are, ⁓ are enabled with an ID. And Talos doesn’t have G-Lib C, it’s based on musl C for once again, for security reasons, we try like the…

Josh Bressers (39:25) Yeah. Yeah.

⁓ right.

Andrey Smirnov (39:42) C library is musl We don’t have glibc in the default install, but if you do Nvidia, we actually have once again an extension which provides you glibc but that is already installed to non-default location because we can’t install to default one because musl is there, right? So, but that non-default location is once again, we have cascading pains that to teach Nvidia container runtime where it is and it’s in a different location, you have to use it yet.

Josh Bressers (39:54) ⁓

Right.

my goodness.

Andrey Smirnov (40:10) So lots

of, I guess, lots of pain we made ourselves for ourselves like suffer, but trying to offer actually better security out of the box, I think.

Josh Bressers (40:21) I love it.

I love it. And that we’re going to end it there. That’s perfect. Yes. Better security out of the box. That’s fantastic. And Andrey, I want to thank you so much. This has been fascinating and I have learned so much. I appreciate it.

Andrey Smirnov (40:34) Thank you, Josh. Thank you for having me. That was a great conversation. Yeah.

Episode Links#

Episode Transcript#

Episode Links

Episode Transcript