In this episode, Josh and Otto dive into the world of Debian packaging, exploring the challenges of supply chain security and the importance of transparency in open source projects. They discuss Otto’s blog post about the XZ backdoor and how it’s a nearly impossible attack to detect. Otto does a great job breaking down an incredibly complex problem into understandable pieces.
Episode Links
This episode is also available as a podcast, search for “Open Source Security” on your favorite podcast player.
Episode Transcript
Josh Bressers (00:00) Today, Open Source Security is talking to Otto Kekäläinen independent consultant. I asked Otto to come on the show. He wrote a blog post that, let me pull up the title. I have to scroll from the bottom of this post here, but he wrote a post called, Could the XZ backdoor have been detected with better Git and Debian packaging practices? And I feel like this is one of the most reasonable and insightful and technically accurate blog posts about XZ I have ever read. So thank you so much, Otto, for agreeing to chat with me.
Otto Kekäläinen (00:28) Yes, nice to be here.
Josh Bressers (00:31) All right, so let’s just start with your post. I mean, let’s kind of start at the top. obviously if anyone listening who doesn’t know what XZ is, like just go figure it out, right? It’s been in the news and I’ve talked about it at length, but just let’s start at the top and kind of walk us through this post. Cause there’s so much, I red meat in this. I love it so much.
Otto Kekäläinen (00:56) Yeah, so I’m here with my Debian developer hat on, and I was curious to see that if the Debian developers like packaging and importing new upstream parts, if it’s correctly, then what can the Debian developer like see in the Git diffs and other places, and would it have been feasible to discover that there was something off with this package? And yeah.
So X Zed is a compression library and there was a very elaborate supply chain attack into it. A malicious actor was contributing to the project and then gained like a maintainership in the project and was able to add
like obviously not a visible backdoor but was able to add some pieces that then together was then a masqueraded backdoor and make an actual release that then would have got it got out to like users at large and Linux distributions. And we were extremely lucky that Andres Freund who was doing micro benchmarking on SSH noticed that it slowed up. It’s like
the start of SSH slowed down by half a second. And he was curious enough to figure out what’s going on, what is it actually loading, and he figured out that it was loading something extra with this compression library. And that’s how this got uncovered.
Josh Bressers (02:34) Yes. Okay. And now let’s talk about what a Debian package means in this context. Cause I have a suspicion there are going to be a lot of people listening that don’t actually understand what it means to go from, let’s say something like source code to a Debian package that gets installed. Because I think this is a fascinating process and obviously all the distributions kind of do it their own special way. But Debian I feel like is unique for maybe reasons that are good and bad.
Otto Kekäläinen (03:04) Yeah, that’s a great question Josh, because to be able to discuss the supply chain, we have to like, spend a bit time on the individual chains, what’s going on here. So I imagine most of the listeners are like, have experience with running apt install on Debian or maybe on Ubuntu or if not that, then at least like, DNF.Install on Fedora or something else. And people probably know that part, but people might not be that aware of what is like.
what happened to the package before it ended up on the Linux distributions server. And in the case of Debian, there’s about 1,300 people who have permissions to upload stuff to the Debian archives. And ⁓ Debian and like Fedora and most Linux distributions, they work slightly differently than what modern mobile app stores work. That it’s not the original maintainer that, or like the original upstream that uploads.
their stuff themselves. In the Linux ecosystem, there’s like a specific role of packagers who choose what they think are like important enough piece of open source that they should be included in the Linux distribution. And then they take the open source upstream and then they do the so-called packaging process for it and put it in the Linux distribution so it’s available for the users there.
And there’s whole bunch of integration work going on. So that’s also like, it’s not arbitrary that there are these packaging people. It’s kind of necessary to be able to get all of these libraries and everything that they compile together and they are in sync and integrated with each other. And Linux distributions, because they do it differently, it requires a bit of expertise in that domain to be able to do the packaging.
Now, if somebody is interested in seeing the Debian source code, you can actually just run apt source and then the package name on a Debian or a Ubuntu system and it will upload the upstream source code and it will also, I mean, download the upstream source code package and also the Debian piece. And people who are familiar with some packaging, might know that there’s spec file in the RPM world.
And in the Debian world, there’s not one single spec file equivalent. There’s like a folder, subfolder called Debian, and that has a whole bunch of files. And it has a slightly different philosophy because it tries to be like deterministic that the file list and things are listed out there, while a spec file is more like a…
Enhanced shell script basically that puts together the thing in in Debian There’s only one file the rules file, which is kind of a shell script equivalent. It’s actually a make file and then there’s a lot of other files that like are just lists and kind of like deterministic in how they’re going to behave so that is Kind of like slightly more complexity, but then it’s also like more controlled
it’s doing. And yeah, so these 1,300 people, take the upstream source code and then they add the packaging or they take an existing Debian package and then import the new version of the upstream and they maybe might need to do some adjustments there and then they upload it to the Debian archive from where the Debian machines, they upload the source code and the Debian machines will then build it into binaries and those will
all the users. So this is the change that you have like clearly a separate upstream, which is the origin, and then you have the people in the Linux distributions doing the importing And it could be a really good resource or like a service to all the users that it’s not actually the same. It’s like you have an extra pair of eyes looking at the upstream release before it goes out to all of the users and improve their quality there.
Josh Bressers (07:25) Right, well that’s the intent, right? And the devil’s in the details, of course. So this is where you kind of start in your blog post and you talk about verifying the checksums and the signatures of the upstream sources, which is there have been supply chain attacks in the past where malicious entities have modified source code on a mirror or something like that. And obviously this is where checking a digital signature, like an open PGP,
Otto Kekäläinen (07:28) Yeah.
Josh Bressers (07:55) signature or check even just that checks some of the file will kind of protect against that particular instance, right?
Otto Kekäläinen (08:02) Yeah, so there’s a lot of different signature schemes at the moment and there’s like new ones, sigstore and others. But the traditional and the most prevalent one is that people use OpenPGP. Each person has their own private public key pair and then you sign your stuff. So I, as a Debian developer, I can upload stuff to Debian because a package that signed with my OpenPGP key
will be accepted by the Debian servers when I upload it. And a lot of upstreams sign their releases with their open PGP keys. Well, in this case, all the keys match everywhere and all the like git tags and everything matched because this is a supply chain attack in the upstream source, like project itself. It’s not that somebody like diverted the download traffic or
that and try to go in between. So here all the signatures are actually completely valid. But in my blog post I show that how do you check them because in some case somebody could try to like circumvent that.
Josh Bressers (09:18) Right. Which is, mean, wasn’t, I can’t remember who is the packager, but there, was recently, it may have been Debian not to pick on you too much, but it was what Python changed to sigstore And then someone realized that I think I don’t remember which distro is one of the distros was it wasn’t checking the PGP signatures like accidentally for years and years and years and no one noticed, which is that’s rough
Otto Kekäläinen (09:46) ⁓ I don’t remember that one, but there are currently gaps in Debian’s infrastructure regarding that all of this key checking is not done as good as it should be doing and I hope that we will have enough people to fix that going forward.
Josh Bressers (09:52) Yeah, sure, sure.
It’s a tough problem, right? mean, mean, public key infrastructure has always been horrible. mean, but then, okay. So I want to keep going. Cause I know we could, we could complain about signatures for hours and not run out of content, but we’re not gonna.
Otto Kekäläinen (10:13) Yeah.
Yeah, and it’s
challenging because it’s like you need the technology, but also like the topic itself is kind of a little bit complex. So people who just want to ship their software, if they mess up with this kind of quality issues, it’s not visible like anywhere. It’s only visible once you get the security issue. And it’s hard to kind of motivate people that, we need to do this properly because of some abstract threat somewhere.
I think this blog post is interesting and it was interesting for me to do because this was not abstract. We have here an actual backdoor going on. So that motivated me to check that what was all the steps here and could we have discovered it earlier.
Josh Bressers (10:56) Yeah. Right, right.
Okay, I wanna keep going, because I actually, love this next one where you talk about reviewing the changes between the two source packages using Difiscope, which I don’t even know what Difiscope is, but I guess I can fathom the concept here, right? Where you are, when you say source packages, you are specifically talking about the Debian packages, right?
Otto Kekäläinen (11:22) ⁓ So in Debian when you do this up download you will get both the upstream source package and you get the Debian live apart as well. what I’m doing here is with Difoscope I’m comparing, with Difoscope you can compare anything and I’m comparing a couple ways here and I’m comparing for example like the upstream the source package I can download myself from upstream and then the
source package that existed in Debian archives. So one good part here actually with the Debian infrastructure is that there is all these versions that ever existed in the Debian archives are archived forever. So you can go back and audit as much as you want any old versions. And this Diffoscope is also originates from Debian. It was Debian developers who created it originally. And it’s a very handy and very
Josh Bressers (12:03) Yes.
Otto Kekäläinen (12:19) versatile tool, you can basically give it any two versions of the same file type or you can even give it two directories and it has a massive amount of plugins so that it can like not just compare text but also different binary formats and show what is the structural difference between those binary formats. And I’m showing some of these tools in my blog post. So, Dittoscope is a generic
tool to compare anything, it can be binary as well. And another tool that I like a lot is called meld merge. So that’s it only compares text, but it’s a very nice visual representation of what’s the difference between two or three text documents. And I use that for comparing source code and it has an integration with Git. So if you use git diff tool command, it will automatically open stuff in meld.
So it’s to compare what are the changes.
Josh Bressers (13:20) nice. have not, ⁓ I forgot the name meld merge. So I worked at a company called Progeny Linux Systems like a million years ago. And so I’ve done a generous amount of Debian shenanigans for them. And I remember, yeah, creating backports was bananas always from, from the upstream and making it work anyway. Anyway. Okay. So the, I also, I want to call out in your blog post, this made me laugh out loud.
You have a section, it’s called reviewing Debian source packages and version control. And I’m gonna read this verbatim, because you say, as of today, only 93 % of Debian source packages are tracked in Git and Debian’s GitLab instance. 93 % would be a roaring success in nearly any other industry. And you said only 93%, which I love. But explain what that means to us, right?
Otto Kekäläinen (14:09) Yeah, so once the package is in Debian, it’s very like permanently and very well archived. So there’s this snapshot servers where you can get any version of any package that ever was in Debian. But the development part, like ⁓ the Debian archives only archive like the versions that were released. It doesn’t, if you like, I think most developers
in the modern world expect that there should be a Git repository somewhere of version control, somewhere where you could go and look at not individual releases, but drill down into the commits that the developer was doing that led to that release. And for Debian, the most popular place right now is called salsa.debian.org. It’s a custom GitLab instance for Debian, and 93 % of the packages are there.
So that’s good, but it’s not all of them. And there’s like key packages like bash, for example, that is not in version control anywhere. And we are not just trusting that the bash maintainer, that all the releases are good and we track them by seeing the new releases in Debian.
And this is kind of like, what’s down to the quality part that like, it’s directly wrong. Like it works and like it has been working for 30 years. It’s, you can’t like say it’s broken. It’s just like this, like where do you, where do you set the bar and what are the abstract expectations and how do you convince like a thousand people or a thousand and 300 people to behave, to like do a specific.
in a specific way.
Josh Bressers (15:59) Yeah, yeah, for sure. And you actually have a section in here. I don’t understand this one completely, but you talk about creating synthetic Debian packaging Git repositories. There’s a tool that does this. Yes, that’s cool.
Otto Kekäläinen (16:11) Yep. Yes.
So, this build package has this built-in feature that not only normally use git build package to like manage your Debian packaging in git, but it has this feature that it can take all the historic versions from the Debian snapshot servers and then kind of like synthetically create the git history. But in that history, one commit represents only one release. So it’s not granular and it doesn’t like
tell you anything extra, but it makes it easy that if you have some real history somewhere else and you put those in the same Git repository and then you can compare the branches and commits.
So there is, my post is pretty long, it’s like 5,000 words, but it’s kind of necessary because there are several ways to compare these raw packages and then to use different git commands to compare versions in git itself and then also you can synthetically create git history with
Josh Bressers (17:22) Yeah, right, right. Which you then go on after that to show how you can compare the Debian Git repository, which obviously you can make if it doesn’t exist, to the upstream source code. And it’s like, we don’t need to go into all the details. I’m going to put a link to this in the show notes for anyone interested. This blog post is 100 % worth the read. It is incredibly long as Otto pointed out, but it is so good. But I mean, so like this, I guess comes back to the whole in the XZ universe.
Otto Kekäläinen (17:35) Yep.
Josh Bressers (17:51) Everything Debian was packaging completely matched upstream because obviously the problem was like literally in the upstream developer, right?
Otto Kekäläinen (18:02) So, XZ had a… I found that one change log entry, which is just text describing what’s the new version, wasn’t exactly the same in the Debian Git packaging repository as what was uploaded to the actual Debian archives. But mostly this package has been kept pretty well in the Git repository. And it has a structure that the Debian Git repository
also has the upstream Git history. So you can like Git blame on files and see what upstream commit it actually originated from, even if you’re looking at the Debian branch. And you can also use freely any Git commands to compare and look at and drive into the, drill into the upstream history. And then there’s also like the special thing in Debian packaging that when you import the upstream,
package and X Zed is a great example of that. You can’t necessarily just import the Git history and the Git release of the upstream because sometimes the upstreams do a custom tarball release and the custom tarball release might have extra documentation or extra make files or they might have removed some large amounts of binary
test files. I know for example some games that have a lot of binary stuff which is necessary but it’s then removed from the release so the tarball is way smaller than if you would have done a git checkout of that. And the way this git build package works is that it actually first takes the upstream git history and then after that it has a special branch called upstream and then after that it takes the tarball import and it
Josh Bressers (19:44) Yeah. Yeah.
Otto Kekäläinen (20:00) puts that on top of the Git history. And then that branch is then merged onto the Debian branch. So if you use this structure with this tool, you can see exactly like all the upstream Git commits, exactly what happened in them. And you can see what is the difference between the contents of Git at release time and what was in the tarball And then still it’s the tarball contents that wins and goes into Debian because if it exists and it’s used, there’s a reason.
do it. And in this case it’s very interesting because we have upstream commits that introduce part of the backdoor. So there are two test files, binary test files, that have extra stuff in them and those are permanently, they were committed to the upstream git repository. And then there is the upstream tarball release which has all these make files which is
If you use AutoMake or AutoTools, it’s common to have these extra make files. I think it’s kind of stupid, but that’s like a historic thing. And some of those make files had like a custom modification. And when I was checking this, it’s not like it’s one file that has a couple lines changed. It’s like a completely new file. So you can’t even, it’s really hard to notice that that file is not the normal.
AutoTools files at this version at what it’s supposed to be. So I did compare it to other packages with the same file and I can see from that diff what the modifications are, but to like discover that is like really, really hard. The only thing that might have, like the only tiny thing that’s my conclusion here that might have caught the eye of a very, very diligent person is that this AutoTools
Makefile had a very high version number, it was saying 33 when the expectation should be like three or four or five. But spotting that, like there’s 33 instead of a three, it’s really hard. So I wouldn’t be blaming anyone here. I would say that this was like very well executed and hidden. Backdoor and like…
mainly a failure of the upstream project and what we should focus on here is maybe to give more support to the upstream.
Josh Bressers (22:40) Yes, yes. And in fact, you have towards the end a section called, I like how you say not detectable with reasonable effort because like it is theoretically detectable, but I think reasonable effort is especially important in the context of something like really any open source project, especially Debian, because you’re working with volunteers fundamentally, right? These aren’t people that are going to spend hundreds and hundreds of hours pouring over diffs. They’re like, whatever it builds, it ships. We’re done here guys, you know?
Otto Kekäläinen (23:10) Yeah, I wouldn’t blame the volunteers. Actually, I’ve worked in large corporations and I would say that some of the volunteers in Debian have way like higher sense of quality and higher diligence than somebody who’s just doing it because of their paycheck. So I don’t think that’s the part. But the part is that if you are like taking a new upstream release that can have like thousands or tens of thousands of lines of code changed and you put that into Debian, there’s no way you can like…
review all of those and what if there’s a new release every second month? You’re not going to be reviewing 10,000 lines of code every two months just for the theoretical issue that some of those lines might be bad. And then also, when you review that code, there are things that are bad. then how do you distinguish what is bad because of quality issues, like the bucket term I’m using here, and what is bad because it’s act… ⁓
actually malicious, like intentionally is bad and is being used to do something.
Josh Bressers (24:17) Yeah, yeah, for sure, for sure. And so, okay, I want to jump to kind of the end because we’re running ourselves out of time on this discussion. But I think the most interesting aspect of everything you wrote and the most reasonable, I guess, to do for anyone around XZ ever is just you point out the fact that like Debian doesn’t have as many shared workflows and as much maybe reusability in some of their build environments as they maybe could because
Obviously, if there was more reproducible, I guess, functionality, which is a whole discussion in itself of how practical that might even be in many instances. But obviously, when things are more similar, we can now use more of these tools you point out and look at those results. probably, I think you brought up LLMs at one point in here as a way to review some of this stuff, which I do agree with. I think that’s a fair bit of advice.
Otto Kekäläinen (25:14) Yes, so let me start with the LLM part first. So this diffoscope tool I spoke about, it has a mode that it can output markdown. And that markdown goes very nicely to LLM. And you can ask an LLM, but do you spot anything malicious in this? And I actually tried doing that, but it didn’t spot anything. But maybe in the future, if they are smarter, they might provide, they could be used to contribute like extra reviews that then help humans to.
Josh Bressers (25:24) Yes.
Otto Kekäläinen (25:45) focus on suspicious things. And yes, my conclusion, and like what I’m worrying about is that the basic problem we’re trying to solve here is like, do we make efficient code reviews? How do we scale the code reviews? And if you have a project that has widely different like practices of how you can do this one thing, then
you’re not going to get the efficient code reviews because every time somebody is code reviewing something, they have to like select from like 10 different options and what is the style being used here. And then, then like energy goes into like deciphering what are the differences between the styles and things like that. And you, you, you have less energy to focus on the actual code changes. And also I think humans are very good at recognizing patterns. It’s like built into our brain.
So in places where the workflows are more uniform, it’s way easier for people to spot what is unusual in this one case because it stands out. But if the baseline is too varied, then things won’t stand out and the reviews are slower and they will have less correct.
Josh Bressers (27:04) Yeah, yeah, for sure, for sure. It’s, ⁓ I love how human sometimes our pattern matching ability is our greatest strength and sometimes it’s our greatest weakness. So it’s, it is always amusing. Okay. So I guess here’s the thing I also appreciate about your blog post Otto is instead of making up some crap about how we could have detected this by doing some ridiculous thing that no one’s actually ever going to do.
I think you just have a very level-headed approach of like, this was a really good attack. No one caught it. And well, I mean, I guess we did catch it, but you know, not, for maybe the right reasons, but just the fact that it is, it should be expected that an attack like this gets through, right? I mean, that’s just the sad truth of the whole situation.
Otto Kekäläinen (27:48) Yeah, I think we should like, the supply chain thing is like if the upstream itself has already been like taken over by malicious maintainer, it’s really hard to catch that at the distribution level. So what I would hope to see is like Debian developers and all the distribution people who are packaging some upstream that they would engage more with that upstream and collaborate with it.
And then maybe like be able to detect like cultural shifts or something like it did the upstream burnout and now there’s suddenly a new person. And maybe even if the packaging part has very good workflows and we have more and more automation, maybe the Linux distribution packaging person, maybe they could like contribute more to upstream and maybe be reading the commits. Like reading the upstream commit.
And in this case, maybe like if you are looking at upstream and you see that new binary files are added to the test suite in a commit, but that commit doesn’t actually add any tests at all, only the binary files, then like that should ring bells in the upstream commits. But we need more people like looking at the upstream commits themselves. And maybe the Linux maintainer, like the Linux distribution packages can go in there or maybe
if they don’t have resources to actually do that, then maybe they can.
kind of like assess different upstreams in their maturity and health and maybe call out if there are very widely used and important upstreams that are not getting maintained properly. And then I know we have great programs by the Linux Foundation and others that are nowadays giving out some funding to key projects to keep them rolling by the original creators.
Josh Bressers (29:47) Yeah, yeah, I mean, I get it. And I chuckled a little when you said about under-resourced upstreams, which I think is every upstream at this point. I can’t think of any that I would say are properly resourced.
Otto Kekäläinen (30:03) Yeah, but that’s also a tricky question because you can also go to any corporate world anywhere and ask any department and everyone’s going to say that they don’t have enough resources. So it’s not just about like adding resources, like defining what is the process we need to run and is that process running? So it is complex and well, that’s also like one of the reasons why I like doing this because it’s
Josh Bressers (30:21) That’s fair.
Otto Kekäläinen (30:33) It’s rewarding to be juggling this part of technical parts that certain things happen because of technical reasons. But then there’s a lot of things going on that is happening also without actually any technical reasons because just of the culture and how humans are collaborating and what is the background for things and how they evolve into being that state. And then juggling this that what is actually a
change in technology and what is the change that we need in maybe how we are collaborating or how are the humans running some of the kind of the process.
Josh Bressers (31:12) Sure, sure, okay. I’m gonna call one thing out at the bottom of this post and then I’m gonna let you take us home. So at the very bottom it says if your organization is using Debian or derivatives, which we kind of all are, and you are interested in sponsoring my work to improve Debian, my being Otto, not Josh, please reach out. So I’ll have Otto’s links to various contact things in the show notes. If anyone’s interested, by all means reach out. But Otto, the floor is yours, take us home. What do you want us all to know? What should we do next?
Like what, what’s going on and, and yeah, give it, give us the inspiring conclusion we yearn for.
Otto Kekäläinen (31:50) think what’s inspiring here is that all of these commits and everything that happened here is fully transparent. So yes, it happened. We are looking at it after the fact. But I think it’s really cool that we can, like a year later, still be looking at every single commit and drill into this as much as we want. It’s not hidden by some bigger organization that’s not giving out any details. And also the investigation part is not
like monopolized by some government authorities doing the investigation, like anyone can drive into, drill into these things and look at it. And I think, yes, it’s scary what happened and what almost got out there, but also at the same time, it’s kind of like, I feel like trust in the system because we can look at all of that. This happened and anyone can step up and participate into improving this. And as you said,
I’m calling out here that if anyone wants to sponsor this kind of work, I quit my day job back in March and now I’m focusing quite a lot of my time on Debian development and this is an area that hasn’t had that much ⁓ kind of attention. So I’m trying to do something in this space.
Josh Bressers (33:12) Awesome. And thank you so much for the work because you’re right. This is not a space that has a lot of attention and it needs it. It desperately needs it. It’s okay. Otto, I just want to thank you. I mean, thank you so much for the time. I know like you had some random weirdo reach out to you on email and thank you so much for replying. Cause this is, I’ve learned a lot. Like your blog post is amazing. Your work is great. Like just thank you so much for everything.
Otto Kekäläinen (33:36) Thanks Josh, and thanks for your blog and your podcast. I’ve been listening to bunch of episodes and it’s very interesting especially that people who are interested in this topic should go and listen to ⁓ episode about the Eclipse Foundation and their SBOM and their management of their project at a scale.
Josh Bressers (34:02) For sure, for sure. All right, awesome. Otto, have a good one. Thank you so much.