Broken vulnerability severities

This blog post originally started out as a way to point out why the NVD CVSS scores are usually wrong. One of the amazing things about having easy access to data is you can ask a lot of questions, questions you didn’t even know you had, and find answers right away. If you haven’t read it yet, I wrote a very long series on security scanners. One of my struggles I have is there are often many “critical” findings in those scan reports that aren’t actually critical. I wanted to write something that explained why that was, but because my data took me somewhere else, this is the post you get. I knew CVSSv3 wasn’t perfect (even the CVSS folks know this), but I found some really interesting patterns in the data. The TL;DR of this post is: It may be time to start talking about CVSSv4.

It’s easy to write a post that made a lot of assumptions and generally makes facts up that suit whatever argument I was trying to make (which was the first draft of this). I decided to crunch some data to make sure my hypothesis were correct and because graphs are fun. It turns out I learned a lot of new things, which of course also means it took me way longer to do this work. The scripts I used to build all these graphs can be found here if you want to play along at home. You can save yourself a lot of suffering by using my work instead of trying to start from scratch.

CVSSv3 scores

Firstly, we’re going to do most of our work with whole integers of CVSSv3 scores. The scores are generally an integer and one decimal place, so for example ‘7.3’. Using the decimal place makes the data much harder to read in this post and the results using only integers were the same. If you don’t believe me, try it yourself.

So this is the distribution of CVSSv3 scores NVD has logged for CVE IDs. Not every ID has a CVSSv3 score which is OK. It’s a somewhat bell curve shape, which should surprise nobody.

CVSSv2 scores

Just for the sake of completeness and because someone will ask, here is the CVSSv2 graph. This doesn’t look as nice, which is one of the problems CVSSv2 had, it tended to favor certain scores. CVSSv3 was built to fix this. I simply show this graph to point out progress is being made, please don’t assume I’m trying to bash CVSSv3 here (I am a little). I’m using this opportunity to explain some things I see in the CVSSv3 data. We won’t be looking at CVSSv2 again.

Now I wanted something to compare this data to, how can we decide if the NVD data is good, bad, or something in the middle? I decided to use the Red Hat CVE dataset. Red Hat does a fantastic job capturing things like severity and CVSS scores, their data is insanely open, it’s really good, and its’ easy to download. I would like to do this with some other large datasets someday, like Microsoft, but getting access to that data isn’t so simple and I have limited time.

Red Hat CVSSv3 scores

Here are the Red Hat CVSSv3 scores. It looks a lot like the NVD CVSSv3 data, which given how CVSSv3 was designed, is basically what anyone would expect.

Except it’s kind of not the same it turns out. If we take the NVD score and subtract it from the Red Hat score for every CVE ID and graph the rest we get something that shows NVD likes to score higher than Red Hat does. For example let’s look at CVE-2020-10684. Red Hat gave it a CVSSv3 score of 7.9, while NVD gave it 7.1. This means in our dataset the score would be 7.1 – 7.9 = -0.8

Difference between Red Hat and NVD CVSSv3 scores

This data is more similar than I expected. About 41 percent of the scores are within 1. The zero doesn’t mean they match, very few match exactly. It’s pretty clear from that graph that the NVD scores are generally higher than the Red Hat scores. This shouldn’t surprise anyone as NVD will generally error on the side of caution where Red Hat has a deeper understanding of how a particular vulnerability affects their products.

Now by itself we could write about how NVD scores are often higher than they should be. If you receive security scanner reports you’re no doubt used to a number of “critical” findings that aren’t very critical at all. Those ratings almost always come from this NVD data. I didn’t think this data was compelling enough to stand on its own, so I kept digging, what other relationships existed?

Red Hat severity vs CVSSv3 scores

The graph that really threw me for a loop was when I graphed the Red Hat CVSSv3 scores versus the Red Hat assigned severity. Red Hat doesn’t use the CVSSv3 scores to assign severity, they use something called the Microsoft Security Update Severity Rating System. This rating system predates CVSS and in many ways is superior as it is very simple to score and simple to understand. If you clicked that link and read the descriptions you can probably score vulnerabilities using this scale now. Knowing how to use CVSSv3 will take a few days to get started and long time to be good at it.

If we look at the graph we can see low are generally on the left side, moderate in the middle, high toward the right, but what’s the deal with those critical flaws? Red Hat’s CVSSv3 scores place things as being in the moderate to high range, but the Microsoft scale says they’re critical. I looked at some of these, strangely Flash Player accounts for about 2/3 of those critical issues. That’s a name I thought I would never hear again.

The reality is there shouldn’t be a lot of critical flaws, they are meant to be rare occurrences, and generally are. So I kept digging. What are the relationship between the Red Hat severity and NVD severity? The NVD severity is based on the CVSSv3 score.

This is where my research sort of fell off the rails. The ratings provided by NVD and the ratings Red Hat assigns have some substantial differences. I have a few more graphs that help drive this home. If we look at the NVD rating vs the Red Hat ratings, we see the inconsistency.

NVD severity vs Red Hat severity

I think the most telling graph here is the Red Hat Low vulnerabilities are mostly medium, high, and critical from the NVD CVSSv3 scoring. That strikes me as being a problem. I could maybe understand a lot of low and moderate issues, but there’s something very wrong with this data. There shouldn’t be this many high and critical findings.

Red Hat severity vs CVSSv3 scores

Even if we graph the Red Hat CVSSv3 scores for their low issues the graph doesn’t look like it should in my opinion. There’s a lot of scoring that’s a 4 or higher.

Again, I don’t think the problem is Red Hat, or NVD, I think they’re using the tools they have the best they can. Now it should be noted that I only have two sources of data, NVD and Red Hat. I really need to find more data to see if my current hypothesis holds. And we can easily determine if what we see from Red Hat is repeated, or maybe Red Hat is an outlier.

There are also some more details that can be dug into. Are there certain CVSSv3 fields where Red Hat and NVD consistently score differently? Are there certain applications and libraries that create the most inconsistency? It will take time to work through this data, I’m not sure how to start looking at this just yet (if you have ideas or want to try it out yourself, do let me know). I view this post at the start of a journey, not a final explanation. CVSS scoring has helped the entire industry. I have no doubt some sort of CVSS scoring will always exist and should always exist.

The takeaway here was going to be an explanation of why the NVD CVSS scores shouldn’t be used to make decisions about severity. I think the actual takeaway now is the problem isn’t NVD, well, they sort of are, but the real problem is CVSSv3. CVSSv3 scores shouldn’t be trusted as the only source for calculating vulnerability severity.

Episode 198 – Good advice or bad advice? Hang up, look up, and call back

Josh and Kurt talk about the Krebs blog post titled “When in Doubt: Hang Up, Look Up, & Call Back”. In the world of security there isn’t a lot of actionable advice, it’s worth discussing if something like this will work, or ever if it’s the right way to handle these situations.

Show Notes

Show Tags

  • #phishing
  • #spearphishing

Comment on Twitter with the #osspodcast hashtag

Episode 197 – Beer, security, and consistency; the newer, better, triad

Josh and Kurt talk about what beer and reproducible builds have in common. It’s a lot more than you think, and it mostly comes down to quality control. If you can’t reproduce what you do, you’re not a mature organization and you need maturity to have quality.

Show Notes

Episode 196 – Pounding square solutions into round holes: forced updates from Ubuntu

Josh and Kurt talk about automatic updates. Specifically we discuss a recent decision by Ubuntu to enable forced automatic updates. There are lessons here for the security community. We have a history of jumping to solutions rather than defining and understanding problems. Sometimes our solutions aren’t the best. Also murder bees.

Show Notes

Episode 195 – Is BGP actually insecure?

Josh and Kurt talk about the uproar around Cloudflare’s “Is BGP safe yet” site. It’s always interesting watching how much people will push back on new things, even if the new things is probably a step in the right direction. The clever thing Cloudflare is doing in this instance is they are making the BGP problem something anyone can understand. Also send us your funny dog stories.

Show Notes

Show Tags

  • #BGP

Episode 194 – Working from home security: resistance is futile

Josh and Kurt talk about the new normal that’s working away from an office. It’s not exactly working from home as there are some unforeseen challenges that we just took for granted in the past. There are a lot of new and strange security problems we have to adapt to, everyone is doing amazing work with very little right now.

Show Notes

Episode 193 – Security lessons from space: Apollo 13 edition

Josh and Kurt talk about space. We intended to focus on Apollo 13 but as usual we have no ability to stay on topic. There is a lot of fun space discussions in this one though. Do you think you can hack Voyager 1? Only if you have a big enough satellite dish.

Show Notes

Episode 192 – Work without progress – what Infosec can learn from treadmills

Josh and Kurt talk about Kurt’s recent treadmill purchase and the lessons we can lean in security from the consumer market. The consumer market has learned a lot about how to interact with their customers in the last few decades, the security industry is certainly behind in this space today. Once again we display our ability to tie even the seemingly mundane things back to a discussion about security.

Show Notes

Episode 191 – Security scanners are all terrible

Josh and Kurt talk about security scanners. They’re all pretty bad today, but there are some things we can do to make them better. Step one is to understand the problem. Do you know why you’re running the scanner and what the reports mean?

Show Notes

Who are the experts

These are certainly strange times we are living in. None of us will ever forget what’s happening and we will all retell stories for the rest of our days. Many of us asked “tell me about the depression grandma”, similar questions will be asked of us someday.

The whirlwind of confusion and chaos got me thinking about advice and who we listen to. Most of us know a staggering number of people who are apparently experts in immunology. I have no intention of talking about the politics of the current times, goodness knows nobody in their right mind should care what I think. What all this does have me pondering is what are experts and how can we decide who we should listen to?

So I’ve been thinking a lot about “experts” lately. Especially in the context of security. There have been a ton of expert opinions on how to work from home, and how to avoid getting scammed, which video conferencing software is the best (or worst). There are experts everywhere, but which ones should we listen to? I’m not an expert in anything, but there are some topics I know enough about to question some of these “experts”.

It seems like everyone has something to say about almost everything these days. It feels a bit like the market outside the train station. Whatever you need, someone is selling it, but you better buy it fast because everyone else also wants one!

I have a tweet from a few weeks ago when I really started to think about all this, I called it “distance to the work”

The basic idea is if someone is trying to post themselves as an expert on a topic, how close are actually to the topic. One of my favorite examples is when I see talks about DevSecOps. I’ve known people who have given DevSecOps talks that have never been developers or system administrators or worked in the field of security. In my mind you aren’t qualified to impart knowledge you don’t have. There are certain ideas they can grasp and understand, but part of being an expert at something is having done it, often for a long time. Would you let someone operate on you because they thought about the problem really hard and decided they are now a surgeon? Of course not!

So this brings to a place where we have to start deciding who we should be listening to. I like to break people up into a few groups in my mind when deciding if they should be listened to.

  1. Have they ever done actual work in this space?
  2. Do they have a history of doing work in this space, but aren’t currently?
  3. Are they doing work in this space now?

It’s not hard to see where I’m going with this. I think we all know people who fall into every group. It’s very related to my distance to the work idea. If someone has never done the work, I’m not going to consider them an expert. One the poster children for this is whenever someone titles themselves a “thought leader”. That’s usually double speak for “I have no idea what I’m doing but I have very nice clothes and speak very well”. For a number of these people, their primary skill is speaking well, so they can sound smart, but they can’t fool the real experts.

There are also groups of people who did a lot of work in a space long ago, but aren’t very active now. An easy example here would be the Apollo astronauts. Are these people experts on going to the moon? Yes. Are they experts on space? Yes. Would I trust them to help build a modern day rocket? Probably not.

There are plenty of parallels here in any industry. There are plenty of people who did amazing things a decade ago, but if you look at what they’ve done recently, a resume of “talking about the awesome thing I did a decade ago” doesn’t make them an expert on modern day problems. Look at what people are doing now, not what they did.

And lastly we have our group of people who are actual doing the work. These are the people who are making a real difference every day. Many of these people rarely talk about what they do, many don’t have time because they’re busy working. I find there are two challenges when trying to listen to the people doing the real work.

Firstly, they’re usually drown out by other making more noise. If your job is getting attention, your incentive is, well, getting attention. When your job is doing technical tasks, you’re not going to fight for attention. This means it’s up to us as the listener to decide who is full of gas and who can teach us new things. It’s a really hard problem.

The second problem is finding the people doing the work. They aren’t going to a lot of conferences. They’re usually not publishing blog articles 😎. You won’t find them on social media with millions of followers. A lot actively avoid attention for a variety of reasons. Some don’t have time, some got burnt and don’t want to stick their neck out, some just don’t want to talk to anyone else. The reason is unimportant, it is what it is.

I could end this one with some nonsense about getting outside your comfort zone and making more effort to encourage other to talk about what they’re doing, but I don’t want to. If people don’t want to give talks and write blogs, great, I’m tired of seeing an industry that bases success on how many conferences you attend each year. My suggestion this time is to just look around. You are working with people who are making a real difference. Find them. Talk to them (don’t be a pest). Go learn something new.