Part 2: Scanning the code

If you just showed up here, go back and start at the intro post, you’ll want the missing context before reading this article.

The first type of scanner we’re going to cover are source code scanners. It seems fitting to start at the bottom with the code that drives everything. Every software project has source code. It doesn’t matter what language you use. Some is compiled, some interpreted, it’s all still source code. The idea behind a source code scanner is to review the code a human wrote and find potential security problems with it. This sounds easy enough in theory, but it’s extremely difficult in practice.

Strongly typed languages like C, C++, and Java lend themselves to code scanning. An oversimplified explanation would be a strongly typed language is one where a named variable has to be a certain type. For example if I have a variable named “number” that is a number, I can’t assign a string to it. It can only be a number.

Weakly typed languages, such as JavaScript and Python are incredibly difficult to properly scan. These are languages where I can assign the string “potato” to my variable named “number”. While weakly typed languages offer great flexibility to developers, they are a nightmare for code scanners.

I have software and nobody knows how it works

Software today is infinitely complex. That statement isn’t a joke, it really is infinitely complex. There is no limit to what computers, and by extension software, can do (this is a concept called Turing Complete). An infinitely complex problem will have an infinitely complex solution. It’s important to keep in mind how big infinity is. Since humans can barely solve finite problems, it’s safe to say we can’t actually solve problems that are infinitely complex, even with a scanner. Now just because you can’t solve a problem doesn’t mean you can’t make things better. There’s a lot of space between “solved” and “do nothing”.

So the real problem is basically if you have software running today in any environment, it’s so complex nobody really knows how it all works. If you write software, you’re going to accidentally include security vulnerabilities. Finding those vulnerabilities is a nearly impossible task in many instances. One way to try to uncover some of them is, you guessed it, scanning the source code for security vulnerabilities.

Trying to scan for those flaws is really really hard problem it turns out.

The only thing harder than writing secure software is writing a code scanner

So if software is infinitely complex, it’s safe to say building a scanner is more complex than infinity. I’m not sure what that is, but I’m comfortable assuming it’s really hard. Being able to scan code that can do anything is an incredibly difficult problem. Now, just because it’s really hard doesn’t mean we should do nothing, but it’s important we have reasonable expectations. When I point out shortcomings in something it doesn’t mean we should throw our hands up and declare the problem too hard to solve. This has been the default reaction in the security industry to many problems. It doesn’t work.

A code scanner isn’t going to catch all your bugs. It’s probably not going to catch half of your bugs. Code scanners are plagued by the problem of very high false positive rates and extremely high false negative rates. Most code scanners can only find a certain subset of security vulnerabilities, and of the subset they can find, they will be wrong a lot.

I mentioned strongly and weakly typed languages in the intro. You can imagine that weakly typed languages are incredibly difficult to scan. The flexibility you gain from not defining types can lead to a lot of complexity. Having a subroutine that can return an integer or a string means now your scanner has to try and figure out what is getting returned, and hope it can solve if there are going to be problems when processing the output.

Scanning a strongly typed language will have a slightly higher level of success if you structure your code in a way the scanner likes. Some scanners can be augmented with certain comments to help it understand what’s happening. Even if you do everything right your scanner will have a high number of false positives. Scanning code is hard.

The other important thing to keep in mind is these scanners generally only pick up a subset of possible security vulnerabilities. Even if you ran a code scanner and it came back clean, you should not assume your code is free from security vulnerabilities. Scanners tend to be good at finding problems like buffer overflows, but not good at finding logic problems for example.

Every scanner will also have false positives. Some scanners will have a lot of false positives. As mentioned in the last post, make sure you report false positives to the scanner vendors. They are bugs. False negatives are also bugs, but they’re a lot harder to pick out and report.

What can we do?

I would love to tell you security code scanners will get better with time. They’re already about a decade old and the progress we’ve seen is not super impressive. Like most technology you should understand your return on investment for using a code scanner. If that return is negative, you’re wasting resources scanning the code.

One of the most dangerous traps we can fall into in security is using tools or processes “because that’s the way we do it”. We should always be evaluating everything we do constantly and making it better. Because of the arrow of time, a process that isn’t getting better is getting worse. Nothing ever just stays the same. using this logic I would probably argue code scanners are mostly staying the same (feel free to draw a conclusion here). Newer, safer, languages are likely the future, not better cod scanners.

In the next post we will cover composition scanners. Composition scanner is newer and currently shows promise. It’s also a problem that’s a lot easier to understand and solve than code scanning is.