Bearer | Improving Bearer CLI's precision and recall

Previously, we talked about the first phase of our battle testing process. If you haven’t already, give it a read for background on this article. After Bearer CLI proved itself solid against a variety of real-world projects, it was time to take things to the next level and compare the quality of results over time, and against the results of other static application security testing (SAST) tools.

The plan

To measure the quality of our tool, we decided to run Bearer and a top SAST tool on the same list of projects and compare the results. Since Bearer currently supports Ruby and JavaScript, we chose Brakeman (Ruby) and Semgrep (JavaScript) as the comparison SAST tools.

From our earlier battle testing, we had a good selection of open source projects that were active, well-maintained, and had a level of code quality comparable to “real world” and proprietary projects.

Selected Ruby Projects

Selected JavaScript Projects

What makes a meaningful comparison?

At this point, we had the tools (Bearer CLI, Brakeman, Semgrep), we had the input (a curated list of open-source projects), and we could get the output (a given tool’s security report for a given project).

The question now became: how do we compare the output in a meaningful way? Specifically:

How is Bearer CLI improving with each new release?
How does Bearer CLI measure up against other SAST tools?

To compare Bearer CLI against itself and against another tool, we decided to track two metrics; namely, precision and recall.

What is precision? What is recall?

When classifying data, precision and recall are metrics we can apply to a set of results, such as a set of findings—like Bearer CLI’s security report.

Generally speaking, precision is the percentage of relevant results out of all returned results. Recall is the percentage of relevant results out of all expected results, whether returned or not. In this way, expected results include any “missed” items: things that were not returned as results, but that should have been.

In the diagram above, circles represent relevant results, while triangles represent irrelevant results. Any results inside the large circle, whether relevant or not, are the retrieved results.

The calculations are as follows:

precision = relevant retrieved results count / total retrieved results count

recall = relevant retrieved results count / total relevant results count

Precision and recall for security tools

For our purposes, the data classified is a security report, and a result is a finding in that report. In Bearer CLI, a finding is a line or piece of code that has been “classified” as a security or privacy risk by a rule.

In this context, we defined a relevant result as a “true positive” (see below for more on this), and so precision is a measure of Bearer CLI’s accuracy: out of all findings in the report, how many were of actual interest to the user?

precision = true positive count / total findings count

We defined a missed item as any true positive that was reported by some other security tool, but not by Bearer CLI (we’ll look at this in more detail below). Therefore, recall is a measure of how we compared to other SAST tools.

recall = true positive count / (true positive count + missed findings )

By tracking both precision and recall, we would address our two-fold aim of measuring Bearer CLI’s improvement against itself, and against other security tools.

True versus false positives

In order to measure both precision and recall, we needed to evaluate the findings in each security report in order to decide which findings were “true positives” and which were not.

When we are classifying data into one of only two states, such as red or black—a classification result can be either a true positive (classified as red and actually red) or a false positive (classified as red but actually black).

In the context of SAST tooling, a false positive could mean a few different things:

A finding that should not have triggered the rule. In this case, the rule is malfunctioning. For example, if we had a rule that matches on error logs but not warning logs, we would consider it a false positive if this rule alerted us to a warning log in the code base.
A finding that should have triggered the rule, but that isn’t considered (by a human) to be an actual security issue given the wider context. For example, if we had a rule that matches plaintext passwords, and it matched a dummy password in a test file, we might consider this a false positive since it is not an actual security breach.

We decided to go for the second definition of a false positive because it relied on the report (output) and the code (input), but not the inner workings of the tool itself. This was therefore a definition that we could apply consistently to both Bearer CLI (which we know inside and out) and other SAST tools (which we only know as a user).

Comparing lines of code

To calculate recall, we also needed a way to compare findings across tools to see which findings Bearer CLI missed, but which another SAST tool had found. Here we decided on the simplest—albeit imprecise—algorithm: we considered two findings equal if they occurred in the same file and on the same line.

The results

The initial metrics were middling. Overall, Bearer CLI’s Ruby rules had around 80% precision and 75% recall. Precision for our JavaScript rules was also sitting at around 80%, with recall at 78%. We were able to use these results—broken down per project—to pinpoint areas for improvement.

Over a few focused sprint cycles, we increased precision for both Ruby and JavaScript rules to 85%. Recall for Ruby rules increased to 80%, and to 85% for JavaScript rules.

Lessons learned

It was insightful to have a numerical representation of Bearer’s accuracy, and how it was comparing to other SAST tooling. As mentioned, it highlighted areas for improvements, and it was also motivating to track these metrics over time. We could see the quantitative effect of our bug fixes and rule engine improvements, and this effect could be easily communicated across our company.

On the level of an individual project, our metrics didn’t provide a wholly comprehensive number. It was still necessary to evaluate a project’s results in the context of how many overall findings were returned.

If a project had only three findings, for example, and one was a false positive, the precision for this project would be around 60%, while another project with, say, 100 findings including one false positive, would have a precision of 99%.

In lieu of adding some kind of weighting here, we took the simpler approach. We combined all project results to give overall numbers for precision and recall of a given programming language. This gave a general indication of how we were doing and in which direction we were going (were we getting better or worse? More or less accurate?) and was enough for our purposes.

Apart from this insight gained, by evaluating the reports of other SAST tools, we also learned a bit more about their philosophy especially around, for example, false positives. With security tooling, there is always a balance between false positives and potentially missed findings: too many false positives and the report is useless; too many missed findings, and the report is useless, and it was insightful to see the different approaches here.

The last reflection would be on our process. Our definition of a false positive required us to evaluate each report in detail, and to dig into the code and the surrounding context to determine whether or not we were dealing with a false positive. This proved, quite expectedly, to be a tedious task. There is likely a balance to be struck between our previous approach to finding false positives (investigating only anomalies) and our more rigorous (but painstaking) approach here, which we should strive for next time.

How about you?

We hope our battle testing adventures have been interesting to you. If you track metrics in your work, what are you tracking? And how are you using these figures? Do you find them motivating? We’d love to hear from you at @trybearer.

Improving Bearer CLI's precision and recall

The plan

What makes a meaningful comparison?

What is precision? What is recall?

Precision and recall for security tools

True versus false positives

Comparing lines of code

The results

Lessons learned

How about you?

Bearer Entered into an Agreement to be Acquired by Cycode

Redefining SAST: When AppSec Meets Developer Experience

More blog posts

Using Bearer to scan your code for Privacy risks

Celebrating 100,000 scans

How to establish a data security policy that works