In this episode I chat with the authors of a recent paper on open source security: Open Source, Open Threats? Investigating Security Challenges in Open-Source Software. I chat with Ali Akhavani and Behzad Ousat about their findings. There are interesting data points in the paper such as a 98% increase in reported vulnerabilities compared to a 25% growth in open source ecosystems. We discuss the challenges of maintaining security in a rapidly expanding digital landscape, and learn about the role of community engagement and automated tools in addressing these discrepancies. It’s a great paper and a fantastic discussion.
Episode Links
- Open Source, Open Threats? Investigating Security Challenges in Open-Source Software
- Ali Akhavani
- Behzad Ousat
This episode is also available as a podcast, search for “Open Source Security” on your favorite podcast player.
Episode Transcript
Josh Bressers (00:00) Today, open source securities talking to Behzad Ousat last year PhD student and researcher in web security at Florida International University and Ali Akhavani last year PhD student and researcher in web security and privacy at Northeastern University in Boston. And the two of you wrote a paper
about open source security and I reached out and asked you to come on the show and so you’re both here. So holy cow, thank you so much. I’ll just kind of give you both the floor, of tell us what you’ve done and we’ll go from there. Cause I’m, I will also give a warning to kind of the two of you and the audience. Like I have so many notes about this paper. It’s like, I’m so excited. I am so excited. All right, anyway, I’m done now for a minute. Tell us what you’ve done.
Behzad (00:21) Yeah.
You
Ali (00:32) You
Behzad (00:38) You go Ali first
Ali (00:40) ⁓ Yeah, first of all, thank you for the opportunity to just be talking about this serious matter because when we started, didn’t know how interesting the results are going to be. But the more that we did, we found out, okay, some of these numbers don’t sound right to us because there’s so we saw some interesting results when we were doing the research that just made us really, really curious about this topic more and more. ⁓ So, yeah, the whole work is about open source.
security and how it has evolved over time. Because every time people talk about open source, they have this assumption that, okay, if the source is open, it’s probably safe to use because a lot of people can take a look at the source code, a lot of people have used it, and it sounds good to use, but the reality is not like that. So there is a lot of security flaws, intentionally and unintentionally, in all the open source world.
and it got us curious to look into. We had a couple interesting examples that made us look into this matter that we can talk about a little later about them, that’s the motivation for this work, what was. But yeah, if you want to say something, you can also mention that,
Behzad (01:57) Yeah,
just thank you for reaching out. It’s a pleasure being here. As Ali said, yeah, we started like doing something and we found that there is like a lot of things to talk about in the area. So the paper maybe changed a little couple of times during our like analysis and the direction of where we were going. But at the end of the day, we kind of tried to get a big picture of what is happening and
what are the kind of the good practices, bad practices, any points of improvements and those stuff in the area.
Josh Bressers (02:32) Awesome, awesome. Okay. I just realized we haven’t said the title of your paper yet. So let’s start with that. It’s titled, Open Source, Open Threats, Investigating Security Challenges in Open Source Software. And I’ll ruin the ending for everyone listening, but kind of the key takeaway here was that the rate at which vulnerabilities are being disclosed in open source software is growing faster than the rate open source software is growing.
And I think it’s what, 98 % increase in vulnerabilities and 25 or 24 % increase in the open source is what the key finding was. Okay. Now let me, let me pull up the bit you said, but I think the, there’s a comment you had exactly on this. Oh, I have too many notes here. Is it here? There’s, I don’t, I don’t want to misquote you. Okay. Here it is.
Here’s the sentence I want the two of you to defend for me before we kind of go on, is you say that this, the 98 % growth and vulnerability, 25 % growth in packages, indicate a deteriorating security posture in open source ecosystems. explain to me that conclusion. Because I have a feeling most open source people are going to read that and take issue with it, we’ll say.
Ali (03:35) Alright.
⁓ So first thing that everyone needs to know is that the goal of this paper is not to grill all the open source community because the whole ⁓ internet and the whole software world did not exist and cannot exist without open source and every single person that is in this world has at least used I don’t know how many open source software in their life with
Josh Bressers (04:09) Yes.
Yes.
Ali (04:30) by knowing it or without even knowing it because it’s been there and it’s been a dependency or something. ⁓ The thing that’s ⁓ our aim is to just shed light into some of the issues that exist and try to find solutions and see, okay, how we can fix the issues that exist out there. But yes, so we looked at the reports, vulnerability reports through two different sources, which are…
really known in the community and really useful. One of them is GitHub advisory. And the other one is Snyk which is a company that just does a really good job in just reporting advisories and just doing a lot of interesting security work. But by looking at the numbers that we found out over there, we saw, okay, the number of vulnerability reports that we’re seeing over time since 2017, onto the current date.
We see, okay, there’s 98 % annual increase in the number of reports. But if you take a look at how the whole open source world is getting bigger each year, that number is not as big as that. It’s like 25%. But it doesn’t mean that, okay, it doesn’t necessarily mean that, okay, everything is becoming less secure and more vulnerabilities are happening. So one argument might be, okay, people are more interested in this topic now.
They are ⁓ putting more effort here to just find vulnerabilities. So actually it’s a good sign because we see more reports, so people are more interested. That’s where the other ⁓ results that we looked into come into part and say, okay, we cannot say that open source community is doing a better job because they really are doing a better job in reporting. But the pace that is growing is not the same because
we see that the lifespan of the vulnerabilities that exist is not getting shorter. So if there’s more people working in this field, and if you’re gonna say, okay, the reports are getting more because people are being more active, the expectation at some point would be to see, okay, every vulnerability that’s being found gets fixed faster too. But that’s not what we saw. That’s also growing, yeah.
Josh Bressers (06:27) Yeah.
Yes.
Yes. Yes. And I want to kind of echo what you said at the beginning about like, this isn’t a hit piece against open source. I think five or 10 years ago, I would have been accusing you of being like funded by someone like Microsoft and it’s a hit piece against open source. Yes. At this point, open source has won so badly. Like it just is, it is what it is. Like there’s nothing anyone could say at this point that would make open source go away. Like it’s over, it won, but okay.
Behzad (07:00) You
Josh Bressers (07:13) Okay. So, yeah, let’s talk about those numbers. So in, your report, you said there’s 31,267 vulnerabilities from GitHub and Snyk So I’m curious, GitHub is an open dataset. Like as researchers is Snyk, like, do they, do they let you work with the data? I have no idea how this works. Is it like something like what happens?
Behzad (07:35) I can actually answer that. reason we kind of the alternative to this was to actually go through the CVEs and the NVD data set. No, I mean as an alternative but the good thing about a Snyk was that it provides us with like the categorization per package managers, the language and more detail compared to like other resources that we were kind of comparing these.
and basically but the problem with the Snyk was that it was like a subset of the whole kind of the things that are happening so that’s where the advisory came in to work for us and the amount of the content that we had from the advisory is probably kind of larger than Snyk and they are also peer-reviewed so that’s kind of ⁓ an important note that we had to take
Josh Bressers (08:28) Yeah, yeah, for sure, for sure. And I also, I’m sure you both know this, but the audience maybe doesn’t. If you’re working with 31,000 vulnerabilities and right now, so I track the open source packages in, there’s a service called Ecosystems, ecosys.ms. Andrew Nesbitt runs it, he was a guest on the show a couple episodes ago, but Andrew is tracking over 10 million open source packages. So 31,000 versus 10 million is…
Ali (08:49) Mm-hmm.
Josh Bressers (08:56) a hilariously skewed number. I mean, I want to ask, you know, I’ll ask it now. So this is, you’re talking about a 98 % growth, but this also feels like how when there’s a startup that says, we have 400 % growth this year because you went from a revenue of a thousand dollars to a revenue of $4,000 versus in a couple of years when your revenue needs to go from a, you know, $10 million to $20 million. And that’s a significantly different jump. And I’m, I’m curious, is that.
Do you think that is relevant in this context? Just the, how lopsided these numbers are.
Behzad (09:34) I didn’t quite catch this question here. mean, what are we comparing?
Josh Bressers (09:39) Well, okay. So you said there’s a 24 % growth in open source, right? Or 25%. Right, right, right. And there is a 98 % growth in vulnerabilities. But when you’re talking about a growth of 24 % against 10 million packages versus a growth rate of 98 % against 31,000 vulnerabilities, right? Those are extremely lopsided numbers. Yes?
Behzad (09:43) Yeah, based on this dataset, yeah.
Yes?
Josh Bressers (10:07) Yeah, yeah. And I don’t have a point to it. It’s just more of an observation of, of it is easy to grow 98 % when you are dealing with numbers that are what, a thousand times smaller than, you know what I mean?
Behzad (10:10) ha, ha, I see.
Mm-hmm.
Ali (10:21) Yeah. There’s something that needs to be mentioned here is that ⁓ so in order to, so our goal was to not only focus on only one language or one package manager and the whole idea was to just go over the famous ones and as try to be cover as much platforms as we can technically do. So ⁓ in order to be able to just
Josh Bressers (10:22) That’s all.
Ali (10:50) be able to investigate all those issues. ⁓ Our hands were tied in just the sources that we could gather because we needed to find the advisory reports. First of all, the thing that we say everywhere is that, okay, this is not necessarily all vulnerabilities in the word that have been out there because a lot of them are unreviewed and they do not have enough tags to be able to be investigated in our study.
So there were a few measures that we needed for each report to ⁓ contain. we were able to just, okay, deep dive into that part. So that’s why we chose the GitHub advisory review data sets, which is ⁓ available to the public and everywhere. Also, Snyk also has a lot of tags that we were also looking for to just be able to just do this study over all the platforms in a unified format. And that was really important for us.
Behzad (11:16) Exactly.
Ali (11:45) But for the number of packages in each package manager, it’s pretty simple to just ⁓ get that data set because there’s a lot of, all package managers, almost all of them kind of say how many packages exist at the moment in their website. Or there is some third party libraries that have scraped the whole package managers and give you a report of, okay, at this time, how many packages exist in each ecosystem?
Although the historical data was not easy to find for us, so we had to use the Wayback Machine and just go back through the homepages of different platforms to see, at each time of the year and in each year how many packages existed in each one of them. But eventually we were able to find that data. But for the vulnerability reports, finding a unified data set that contains every single thing that makes it easy for us.
to investigate all the issues and be able to just collect something that’s useful for everyone to just get takeaways from, that’s very difficult. So that’s where we just had to rely on the GitHub advisory review dataset and Snyk Yeah.
Josh Bressers (12:55) Yeah, yeah, I work with vulnerability data every day and I know exactly what you’re talking about. It is incredibly frustrating and the data is terrible.
Behzad (13:04) yeah the
formatting is like each mean each version of the things that we see is different even in Snyk that is like our choice we have seen some kind of discrepancies here and there and we try to kind of cover up and normalize stuff as much as possible
Josh Bressers (13:21) That’s a big job. Like, holy cow. I get it 100%. So another thing you two did that I was very pleased by is you covered malware in the ecosystems where I think quite often we just talk about vulnerabilities in the context of I have a bug in this piece of code versus this is a package that’s only job is to be malicious. And of course the people like NPM yank it. And tell us about that.
Behzad (13:22) Yeah.
Josh Bressers (13:50) What does that mean and why did you decide to include that?
Behzad (13:55) I can explain a little bit and Ali provides you with kind of interesting examples actually. To be honest, at first we didn’t plan to go deep into malware or any kind of because we didn’t plan to do basically intentional vulnerabilities which is basically the intentional malware code, malicious codes and the kind of the ones that are unintentional and just happens. But we saw that for especially in NPM and Python package manager
it’s been a kind of I think in one of our graphs in the paper it’s like a huge number of the vulnerabilities are like in that part so we decided to actually it’s worth kind of spending time on it and kind of go deeper on what is happening and why is this the case for these two specifically yeah Ali can definitely complement this
Josh Bressers (14:32) Yeah.
Ali (14:32) Yeah.
⁓
Yeah, so yeah, that number is, so if you look into the all ⁓ malicious package reports, 99 % of those reports are covered by just npm and python packages. There could be lot of reasonings for that part, but we didn’t want to just come to conclusions because that was not the goal of our study.
Behzad (15:00) You
Josh Bressers (15:01) Yeah.
Ali (15:14) One of the things that could be assumed and we tried to check it out ourselves is publishing package to an NPM is very easy. just, if you know what you’re going to do, it takes you less than five minutes to publish something, even less than five minutes. We tried that and it’s so simple. And you can also get the provenance check marks and everything else if you want to that adds another check mark next to your package, but it doesn’t necessarily mean that it’s secure because the provenance is for another purpose.
Behzad (15:23) Yeah.
Josh Bressers (15:24) Yep.
Ali (15:43) Yeah, and also as Behzad also mentioned, work, the direction of this work shifted from the beginning multiple times because we were aiming to look at something and then we figured, okay, no one is using this. For example, the beginning, idea to start this work, first of all, was to look into supply chain security of open source. It wasn’t looking into the vulnerabilities themselves. Then we went and checked out NPM because it’s one of the popular package managers.
Behzad (15:58) Yeah.
Ali (16:13) And we saw, they have introduced provenance a few years ago. And we said, okay, let’s look into provenance and see, okay, is there anything interesting that we can find there? But we figured almost like less, I think less than 10 % of the total packages out there use provenance. Even in their provenance introduction post in NPM, they use the examples of three packages that have been ⁓ targeted by an attack.
that could be fixed by provenance and if you go check those three packages right now, two of them still are not using provenance and provenance was introduced three years ago, I guess. So we were like, okay, there’s no point in checking that because people are not really interested in that. But to just answer the question about the malicious packages, one important thing to remember here is that
Behzad (16:49) You
Josh Bressers (16:53) Yeah, yeah.
Behzad (16:56) you
Ali (17:08) If a advisory report is saying, okay, this is a malicious package, that specific version and that specific vulnerability is intentional, but it doesn’t mean that that package has been out there forever to be malicious. So there’s a lot of examples that it’s a legitimate package. Thousands of people use that package, but for a specific version,
Behzad (17:33) for
Ali (17:34) for some reasoning that the
Behzad (17:35) some reason that you
Ali (17:36) maintainer had decided to put something in there that was malicious and that specific version has been reported as a malicious package. And there was a very interesting example in our work that was a Node IPC package that right now it’s being, it has like 500,000 weekly downloads on Node.js platform. And for two specific versions, it was geolocating people in Russia and Belarus and just corrupted their file system and replaced it with a heart emoji.
Behzad (17:57) you
Josh Bressers (17:59) Yep.
Ali (18:05) But the whole package is pretty useful because it’s not IPCs used in neural networks, microservices, but yeah, that’s what happened.
Josh Bressers (18:10) Yeah, yeah.
For sure. And so I’ll add a bit of color to this maybe. So in your graph, there’s, I don’t remember, it’s table three is what it says. I just screenshotted it. So I don’t remember where I got it from in the report, but you show the number of packages in each ecosystem that has, you use CWE 506, which is the malicious software CWE. And Maven has one.
And I talked to Brian Fox, who’s the, I think he’s a CTO of Sonatype but they run Maven Central. And this is something he talked about is if you look at Maven, they made it hard to register projects on Maven on purpose for this reason. And he’s convinced this is, I have a suspicion he’s right, but he is convinced that is why like NPM and PyPy or PyPI, I always pronounce it wrong, PyPI have all this malware and Maven does not.
just because the bar of entry is difficult. So obviously attackers are opportunistic and they’re going to go to the low-hanging fruit. So yeah, that’s just more of a comment.
Behzad (19:17) Definitely
yeah, I mean that’s kind of I didn’t know about that the point that you mentioned But we test the test that we did with NPM was like super easy I mean if we put our so in the place of the adversary it’s kind of very straightforward less than five minutes You have whatever package you want to put out there
Josh Bressers (19:35) Yeah, yeah.
Ali (19:37) But it also
makes it easier for people with good purposes too. I mean, if you have a package and you want to publish, it’s so easy. You don’t have to go through the trouble of just, I don’t know, days and weeks to be able to publish something. It’s a double edged sword.
Josh Bressers (19:40) Yes.
Behzad (19:41) Yeah, definitely.
Josh Bressers (19:48) Right, right. Which is all
right. And that could be why NPM has like literally a hundred times more packages than anyone else, you know, because it is easy. so good and bad, I guess. But okay. The other, the other data point, I think that was crazy interesting. You found was kind of, I, I, I don’t think dwell time is the right word. Like you call it lifespan, vulnerability lifespan, where the vulnerability lifespan is increasing.
Ali (19:55) Yeah, yeah, yeah.
Behzad (19:55) Definitely.
Josh Bressers (20:17) Right. this is just, you guys tell us what that means. I’ll let, I’ll let you two go from here. Cause I do a bad job.
Behzad (20:23) Sure. So, as Ali mentioned earlier, this kind of increase in the number of reported vulnerabilities does not necessarily mean we are in a kind of bad position. Maybe there are tools out there because every day you see a paper talking about tools to kind of detect vulnerabilities. So, maybe people are using actually those tools and things are getting better. But, so to kind of complement this, we try to find out how much it
Ali (20:24) Behzad do you want to talk about it.
Behzad (20:53) takes, how much time it takes for developers, for the maintainers to actually fix the packages. What we decided to do at first was to find like the discovery time, I would say. To find out whenever the package was kind of accidentally or kind of intentionally with a malicious code or kind of general term vulnerability to the point that it was discovered as vulnerable.
However, that was kind of that proved to be like a impossible task at least at the moment. Yeah. What we decided to kind of shift this kind of approach was to measure the time that it has been kind of the first version that has been kind of polluted with the vulnerability until the first version that has been fixed. That data is kind of available. So we define this as the lifespan.
Ali (21:26) Yeah.
Josh Bressers (21:27) Yes.
Behzad (21:48) and try to kind of in a standard form across the ecosystems measure that. That’s basically the kind of how we came up with this life span experiment.
Josh Bressers (22:00) Yeah, yeah. And I don’t know what to think of this one. I’m curious. Do either of you have opinions or thoughts on why this is or what it means or if it’s just an interesting data point we should not dwell on?
Behzad (22:15) The interesting part is definitely about, mean, rather than just taking a look at the number of the days that it has passed from the first volume version to the fixed version, the more interesting was the trend actually, because if you look at the trend, it’s actually increasing. So this is definitely not something that we can say there is that reason, maybe it’s a good thing. It’s definitely not a good thing that the trend for the time, the time that is volume
Josh Bressers (22:41) Yes.
Behzad (22:45) that the package is fixed is being kind of increased over the years.
Ali (22:51) ⁓ And so it’s very hard to just find the reason for the specific reason and pinpoint, okay, this is the reason that the lifespan is being increasing. But there’s a lot of factors out there that could play a major role. As you also mentioned at first, you said, okay, the number of packages that are out there is way more than the number of vulnerabilities.
Josh Bressers (22:58) Yeah.
Ali (23:15) And for a specific person to just go look into a package, look into the source code and figure out, okay, there’s an issue here. And then a lot of packages are not being actively maintained. So yes, there are a lot of open source packages that are very popular in the community and being pretty maintained very fast, but it’s not the same case with everywhere else. So there’s a lot of packages that just one developer developed it and just put it away and people might just…
Use it and every once in a year it might receive an update. And also the incentives that the people have for finding vulnerabilities in open source packages is totally different. For example, in a bug bounty process in big tech companies, they pay you a lot of money for just finding a bug in them. So if someone wants to invest time in just finding bugs or vulnerabilities,
Josh Bressers (24:02) Yeah.
Ali (24:09) If they have financial reasonings, that would make more sense for them because they are going to get a reward. But for the open source community, they only get the honor of just and the good feeling that, I’m helping millions of people out there because this package no longer has that issue.
Josh Bressers (24:30) That’s a good point. That’s fair. Yeah. Yeah. Okay. The other thing I wanted to ask about is one of your data points, you talk about the, what do you call it? Vulnerability concentration, where the number of vulnerabilities like per package on average is what you have. And for example, in, in, from the paper I copy and pasted composer had a 4.92 vulnerabilities per vulnerable package, but NPM is 1.75.
And I didn’t see in your paper anywhere. I might’ve missed it because I’m not going to lie. It’s a long paper. It’s pretty long, but it’s very dense also, which makes it difficult. But NPM is way bigger than Composer. And like, was there any sort of work done to understand like, does the size of the repository also affect the concentration or something like that?
Behzad (25:06) Yeah.
This isn’t about getting the concentration in the whole ecosystem, for example, for NPM. This is just the number of packages that we have seen as vulnerable versus the number of vulnerabilities. At first, as I mentioned earlier, at first when we were not separating this CW506, the intentionally malicious ones, the NPM number is basically almost equal to one.
Josh Bressers (25:36) Sure, sure.
⁓
Behzad (25:54) yeah this is after kind of separating that and only focusing on the other unintentional vulnerabilities that has happened across the packages if we include that kind of intentional malicious packages the number i think is like 1.1 1.2 something like that
Josh Bressers (26:13) Interesting. Which I guess in by definition, a malicious package would be one, right? A repo of all malicious packages. Right, right. Cause one malicious pack equals, okay. wow. That’s interesting. So what that, so here’s my feeling here. And I would appreciate a gut check is in an ecosystem with a very low number, close to one, you’d be dealing with
Behzad (26:18) Yeah, yeah, Yeah, yeah, Yeah, exactly.
Ali (26:22) Yeah.
Josh Bressers (26:39) I suspect more of a drive-by environment where someone finds a thing, they report the thing, and they move on to doing whatever it is they’re doing versus if the number’s high, that’s probably researchers actually doing work because anyone who’s done this work, and I suspect you both have at some point, it is not that hard, like once you find one vulnerability, to find some more in the thing you’re looking at, because there often are like very, very similar bugs that you just have. I mean, literally grepping for patterns will often find the same bug, right?
Behzad (26:58) Exactly.
Yeah, definitely it’s both like for detection and also patching the vulnerabilities. This, concentration graph and also the concentration of CWEs like the type of vulnerabilities across the ecosystems. That’s where we have like tried to complement these two experiments to kind of give a takeaway about how much effort it would take to do like a best effort kind of approach.
We probably don’t want to resolve all vulnerabilities, but given that a certain amount of work is assumed to be put on one of these repositories or one of these ecosystems, what should we target? That’s basically what we are trying to answer here.
Josh Bressers (27:56) That’s, that is interesting. And that’s not something I’d thought of prior to this moment is yes, there are ecosystems that are going to be more susceptible to certain CWEs like cross-state scripting is a great one, right? I would expect Node.js and your research showed this Node.js is full of cross-state scripting problems, right? Obviously versus like Rust has no buffer overflows, you know, like obviously that’s just how it works.
Behzad (28:01) Mm-hmm.
You
Ali (28:22) Yeah. Yeah.
Josh Bressers (28:26) ⁓ that, yes, and so,
Behzad (28:29) The interesting
part is that everybody knows about the extreme cases, as you mentioned, for example, XSS in Node.js or the buffer is only in C or C++, let’s say. But what is happening in this middle range, for example, for past reversal, we didn’t know the answer and we have tried to shed light into that a little bit.
Ali (28:33) Yeah.
Josh Bressers (28:35) Yeah, yeah, yeah.
Yeah.
Yeah, yeah, yeah. Okay. I mean, this is one of those things that now that someone says it, it’s like, obviously, but you don’t see, this is why I love papers like this, because there are things we assume that we can’t prove. And then there are things we didn’t even know we should have been assuming that are also true. ⁓ Okay. Okay. So.
Ali (28:57) Yeah.
Josh Bressers (29:11) Let me pull up my notes. had one other thing I wanted to ask about, and then I will turn this over to both of you to take us home. And okay, right. I wanted to, this is more of just a data point than anything else I think. So you mentioned that in the GitHub advisories, there were the unreviewed, which is the things GitHub is not, well, I won’t say they don’t wanna track, but maybe a bunch of it is stuff they just don’t care about. Because obviously GitHub is never going to care about Windows vulnerabilities or things like that.
But what I found interesting is so you had, I think, 17,000 GitHub advisories that you worked with. The CVE is like 300 and change thousand vulnerabilities now in total. So it’s not like a gigantic number in that context, but also GitHub is a CNA and they have assigned 8,000 vulnerabilities just by themselves. So I feel like that felt interesting to me that there’s still like more than half of the data in their data set.
is from somewhere else that they have parsed, which again, I don’t know if that has any relevance of any way. It just, it amused me as a data point.
Ali (30:21) ⁓ yeah, the sometimes, ⁓ the unreviewed ones will eventually get reviewed, but because the number of reports is way more than the time that people can spend on, ⁓ just reviewing everything is just not easy. makes it not easy. Also these two obviously are not the only sources of finding out vulnerability reports, but, ⁓ to keep up, to keep the data set non-biased, we had to go.
for the data sets that we’re not only focusing on one specific language. Otherwise, when we are comparing these languages and platforms together, so it was very difficult to just come up with this decision, okay, these are the data sets that we are going to look at ⁓ and just be dumb. Okay, we’re not gonna look into anything else because we don’t wanna have the bias data set, but also we have enough data points to be able to just get conclusions.
Josh Bressers (31:19) That’s a good point. Yeah, yeah. This is the sort of thing I don’t have to worry about as a non-academic person because I can make wild accusations and no one cares. I don’t have an ethics board I have to go talk to. Okay. ⁓
Ali (31:28) Yeah.
Yeah. Unfortunately,
Behzad (31:34) Ha ha!
Ali (31:38) or fortunately, we do. And we have to convince a lot of people for that reason.
Josh Bressers (31:42) Right, right, exactly. Yes. Yes. That’s good though. Right? That’s good. That’s the point. We need more academic researchers looking at this. Okay. I’m going to give you to the floor. We’ve come to the end. I could dwell on, I could nitpick this paper to death on like just silly little things I maybe think I don’t agree with. There’s little things we could talk about forever, but I’m not going to do that. Anyone listening?
Ali (31:46) Yeah, that’s really good.
Josh Bressers (32:07) It’s 16 pages. It’s very consumable. is dense, but it’s a lot of fun. So I’ll make sure there’s a link in the show notes, but I will give Ali and Bezad, like the floor is yours, land this plane for us.
Behzad (32:16) You
Ali (32:19) First of all, thanks again for the opportunity to ⁓ give us some chance to talk about this matter. It’s a really serious matter and it’s also huge. There could be millions of time spent on this research and just expand it more. And we are happy to just have other people reach out and say, okay, this is something also interesting. This is a good data set that I found or this is an idea of how to expand this.
because as I said, this is only a glimpse and we try to just be as broad as possible and try to compare it between ecosystems. And one thing that I could ⁓ see easily ⁓ implemented and just integrated into a lot of package managers is automatic code review when people are publishing the packages. Now that is the era of AI and everyone is using AI.
And AI for code review and finding vulnerabilities and doing code scanning easily can detect a lot of these, specifically those ⁓ intentionally malicious ones. Some of them are really obvious. So you don’t need to just do a lot of crazy things to figure out those things. Some of them are basic existing attacks, or just very basic. So yeah, if you…
Yes, of course, it makes the publishing of the packages a little slower, the CI CD process a little slower, but it’s not going to be huge with the rise of like AI and it could be used for these purposes too. But yeah, that’s another topic of discussion that could be done later. Yeah. Yep. And thanks again.
Josh Bressers (33:58) All right, Behzad
No, and I will say AI can find XSS remarkably well, but also there are a lot of XSS bugs. So, all right, Behzad take us home.
Behzad (34:00) Yeah
Yeah, definitely just to kind of add more to what Ali said We are saying like a ton of paper for detection actually I XSS all kind of things and also using AI or non AI methods those graph methods and all of this stuff But I don’t think we are actually using them and they are can be easily kind of incorporated into like these Automate automated pipeline because at the end of the day, we will not gonna have the like the manpower to
like review the open source because they don’t have the incentives, nobody can put that much effort into this matter. But if we can make this automatic using AI or in any of these methods, it’s going to definitely help everyone. And also the other thing is just to try to make people more aware of the NPM install that you are doing.
isn’t going to happen a lot in the background so you have to be more careful even at like the company policy or however you want to use this it has to be kind of more how do you say more intentional I would say yeah
Josh Bressers (35:20) Yeah, yeah, for sure,
for sure. No, 100%. And I have, I could probably name six guests I’ve had on in the last couple of months that have talked about exactly that point where when you do an NPM install, a ⁓ PIP install, whatever, it’s things happen you don’t even know are happening. this again, back to Brian Fox with Maven, is they purposely didn’t allow this.
Behzad (35:28) Mm-hmm.
Yeah, yeah, All of them.
Ali (35:34) Yeah, anything, yeah.
Yeah.
Behzad (35:43) Yeah that’s an interesting
point to be honest, I have to check that point that you mentioned. And thanks for kind of the time and the invitation.
Josh Bressers (35:47) Yeah, yeah, it’s a video.
No, this has been a treat. I mean, this is so much fun. I love it. I mean, thank you for the work. I appreciate anyone doing this kind of research because I think it’s super important. It makes me feel better because sometimes you feel like a crazy person where no one will listen. It’s like, see, someone actually did the work. I’m right, you know? So I’m really excited about that. But yeah, I mean, this is fantastic. you know, good luck in the future. I can’t wait to talk to you both again because I have high expectations for what comes next. So thank you so much.
Behzad (35:53) Yeah.
Ali (35:55) Yeah.
Yeah, yeah, yeah.
Behzad (36:19) Thank
Ali (36:19) Thank you so much too.
Behzad (36:19) you. Thank you for having us.