Announcement
Introducing Bearer Assistant beta, our AI-powered explainer and fixer.
Learn more ->Cross icon
<- Back to the blog

How we battle test Bearer CLI

Applications are usually considered battle tested if they've been around a while and work as expected in most known situations. In software, we want the binary releases we make to have this level of confidence. The problem is, when you’re building something new how can you make up for the time and active user advantage of established software? In this article we’ll look at the early stage of our battle-testing process and how it influenced our future testing.

In the beginning

Bearer CLI is a command line tool that runs on Linux and MacOS in a couple of different architectures and supports various distribution methods like docker, Homebrew, and direct download. To make things even more complex, we parse unknown inputs from multiple languages to gather the insights that Bearer provides. This means there's a large scope for the unexpected to happen.

Illustration representing a list running through programs.

Unknown unknowns is a difficult problem to solve. We could wait for our users to file bugs, but that’s not fair to early adopters. This is where battle testing came in. The plan was as follows:

  • For the major languages, we created a list of the top projects based on star count. For context, we had about 1.5k projects for Ruby alone.
  • We created a simple Go worker program that would work through this list, scan each project with Bearer CLI, and upload the report output to S3.
  • We then created a GitHub action that would take a pre-release binary (we call this canary) and create a new task in our ECS cluster and spawn a number of workers to run bearer against all these repositories
{
  "html_url": "https://github.com/rails/rails",
  "stargazers_count": 51816,
  "size": 252146
},
{
  "html_url": "https://github.com/jekyll/jekyll",
  "stargazers_count": 45626,
  "size": 50089
},
{
  "html_url": "https://github.com/huginn/huginn",
  "stargazers_count": 36945,
  "size": 8260
},
{
  "html_url": "https://github.com/discourse/discourse",
  "stargazers_count": 36877,
  "size": 526999
},
{
  "html_url": "https://github.com/mastodon/mastodon",
  "stargazers_count": 36358,
  "size": 182046
},
{
  "html_url": "https://github.com/fastlane/fastlane",
  "stargazers_count": 36051,
  "size": 81131
},
{
  "html_url": "https://github.com/Homebrew/brew",
  "stargazers_count": 34089,
  "size": 64184
},
// ...

In the snippet above from our 1.0 target projects database, you can see that sorting by star count gave us quite a variety of project types. More on how that variety worked out later.

At this stage we were only interested in the stability of the binary—could it run consistently across different projects without crashing. We inspected any failed scans manually to identify issues with specific platforms, individual rules, or general problems with the scanner. We found a lot of bugs this way. It's a wild world out there and by not just restricting ourselves to standard projects we encountered tons of unique use cases. You name it, we saw it.

Beyond not crashing

Not crashing is great and all, but how is the actual output? The next step was to assess how we were doing—how the scan performed at finding security issues in real codebases. False positives are a big problem in SAST tools and again we didn't want to waste our user's time on things that were obviously of no value. Again we leveraged our tests as before but this time we did the following:

  • Download all the JSON reports from scans on languages we support.
  • Merge all the data together in a ruby script and do some basic analysis such as comparing similar findings across different projects.
  • On subsequent runs, we can just find the difference between the old results and the new ones to verify we are moving in the right direction.

Again we initially had great results doing this we could identify rules that were very match happy and dial them in more we were able to learn a lot about writing good rules and things to avoid.

The early results were valuable. We were able to identify rules that matched too aggressively (false-positive heavy) and dial them in. These findings helped our team improve the way they write new rules as well.

At this stage, we also looked at the scanner’s performance. We assessed how long scans were taking across various projects and investigated accordingly. In the end, however, this turned out to be labor-intensive. We were checking a lot of results and this is when the quality of the projects involved really started to slow us down.

Remember we created a list of top projects based on GitHub start count? We noticed the following issues with this list of testable projects:

  • Applications and libraries are not created equal. Libraries often didn’t provide much in the way of reasonable results.
  • Some were just very strange. Shoutout to the “I will accept any PR” JavaScript repo we found.
  • Even amongst top projects, there are quite a lot of toy projects. We didn't want to start optimizing for codebases that are nothing like you would produce on a real-world commercial or open-source project.

All this meant we were getting diminishing returns from this form of battle testing. By now our rules and parsing engines were proven solid on the menagerie of GitHub projects and we could move on to something more focused.

Interested in the next step in our battle-testing journey? Stay tuned, as next time we will look at testing projects in detail and working out our precision and recall. You can follow us on Twitter or LinkedIn to keep up to date.

Engineering
Share this article: