Over the past two decades we have seen security get more and more granular, going deeper into the stack generation after generation, from hardware, to network, server, container and now more and more to code.
The next frontier of this evolution is data, especially sensitive data. Sensitive data is what organizations don’t want to see leaked or breached. This includes PHI, PII, PD, financial data. Sensitive data, if breached, carries real penalties - both tangible as GDPR fines (€10m or 2% of annual revenue), FTC fines (re.g. $150m against Twitter), legal fees and intangible as loss of customer trust (e.g Chegg exposed data belonging to 40 million users), restructuring pain, etc.
Today’s data protection technology overly embraces bolt-on approaches — such as identity management to verify who’s who. In reality, these approaches contain inevitable points of failure. For example, once authorized by identity management, users have carte blanche to access important data with minimal constraints. What would happen if you made data the center of the security universe?
First, let’s acknowledge that data is a weird concept by itself in security. Data doesn’t exist in a vacuum. Contrary to what EU lawmakers may think, if you’ve struggled to comprehend and abide by GDPR you know that data is tightly coupled to many systems. Data is processed, essentially stored, copied, modified, transferred by and between systems. At every step, the vulnerability potential increases. That’s because the systems associated with them are vulnerable, not because the data is.
At a time when one of the most precious assets organizations want to protect is data, when massive data breaches and data leaks are reported too often, maybe it is a good time for a new evolution of cybersecurity. One with a data focus. This is what data-first security is all about.
Data-first security: Making data the center of the security universe
The concept is simple: instead of focusing on every system individually – without any knowledge of the data and links between them – we start with data, and then pull the thread. Is sensitive data involved in chatty loggers? Is data shared with non-authorized third-parties? Is data stored in S3 buckets missing security controls? Is data missing encryption? The potential vulnerabilities list is long.
The challenge with data security is that data flows almost infinitely across systems, especially in a cloud native infrastructure. In an ideal world, we should be able to follow the data and its associated risks and vulnerabilities across every system, at any time. In reality, we are far from this.
At Bearer, we strongly believe the best approach for a data-first security approach is to start at the beginning of the journey, following the shift-left security trend. Data-first security should start in the code.
According to GitLab, 57% of security teams have shifted security left already or are planning to this year.
Why start with code? Ownership and business logic.
The same advantages as implementing security from the code today applies to a data-first approach, but there are also three additional and very important advantages: Not being yet another security liability, the ability to understand ownership context, and custom business logic.
Not yet another security liability
Security is about mitigating risk. Adding a new tool or vendor goes against this basic principle. We all have SolarWinds in mind, but others emerge daily. Having a new tool integrating with your production environment is a big ask. Not only for the security team, but also for the SRE/Ops team.
Moreover, when it comes to data discovery, doing this on production infrastructure means looking at actual values, potential customers’ data, essentially what we are trying to protect in the first place. Maybe the best way to not become yet another risk is to simply not access sensitive infrastructures and data?
Since a data-first security approach relies on sensitive data knowledge, it might be surprising to be able to perform this discovery only from the codebase - especially when we’re used to DLP and DSPM solutions that perform discovery on production data. It’s true that in the codebase we don’t have access to actual data (values), only metadata. Interestingly, it’s also very accurate to discover sensitive data this way. Indeed, the lack of access to values is counterbalanced by the access to a massive amount of contexts, which is key for classification.
As true as it is for traditional shift-left security, a data-first security approach provides even more value when it comes to not yet be another risk for the organization.
When it comes to data security and data protection, not everything is black or white. Some risks and vulnerabilities are extremely easy to identify. Examples include a logger leaking PHI, or an SQL injection exposing PD, but others require a certain level of discussion to assess risk and ultimately decide on the best remediation. Now we are entering the borderline territory of compliance, which is never very far away when we are talking about data security.
Questions such as why are we storing this data? What’s the business reason for sharing this data with this third-party? These are questions that organizations must answer at a certain point. Today they are increasingly handled by security teams, especially in cloud-native environments. Answering those questions, and identifying associated risks, is nearly impossible without unveiling the “ownership.”
By doing data-first security from the point of view of the code, we have direct access to massive contextual information. In particular when something has been introduced and by whom. This let’s us ask those questions, and eventually resolve any associated issues. DSPM solutions simply can’t provide this context by looking exclusively at production data stores.
In the field we have seen many organizations performing “manual assessment.” They send questionnaires to the entire engineering team to understand which sensitive data is processed, why and how. Developers loathe these questionnaires and often don’t understand many of the questions. The poor data security results are predictable.
As with most “technical” things, the most effective approach is to automate tedious tasks with a process that drops into existing workflows with minimal or no friction if you are serious about data security, especially at scale.
Custom business logic
As every organization is different, coding practices and associated policies differ, especially for larger engineering teams. When it comes to security and especially securing data and access-control, this is particularly the case. We’ve seen many companies doing application level encryption, end-to-end encryption or connecting to their data warehouse in very specific ways. Most of these logic flows are extremely difficult to detect outside the code, resulting in a lack of monitoring and security gaps.
Let’s take Airbnb as an example. They notoriously built their own data protection platform, and what’s interesting to look at here is the custom logic they implemented to encrypt their sensitive data. Instead of relying on a third-party encryption service or library (there are dozens), they built their own, Cipher. Cypher provides libraries in different languages that allow developers to encrypt and decrypt sensitive data on the fly. Detecting this encryption logic, or more importantly lack of it on certain sensitive data outside of the codebase would prove very difficult.
But is code enough?
Starting a data-first security journey from code makes a lot of sense, especially since many insights found there are not accessible anywhere else. Though it’s true that some information might be missing and only found at the infrastructure or production level.
Reconciling information between code and production is extremely difficult, especially with data assets flowing everywhere. Airbnb shows how complex it can be. The good news is that with the shift to infrastructure as code (IaC), we can make the connections at the code level and avoid dealing with painful reconciliation.
Finally, a data-first security model provides massive value when it comes to prioritization.
Alert fatigue, due to the flood of false positives reported by many tools across the stack, is a major pain point in security today. Prioritization is difficult and resources are scarce. Nothing will change that.
But with a data-first security approach, issues are natively tied to data sensitivity, making prioritization much simpler. For example, between an issue reporting a PHI missing encryption and a leaky S3 bucket not containing sensitive data, a prioritization algorithm heavily weighing on data sensitivity (e.g. PHI > Financial > PD > PII) would easily put the first one at the top.
Without a data-first approach, this is just not possible.
The path forward
Considering the challenges associated with security and data, every security solution will have to become at least “data aware” and possibly “data-first” at whatever layer of the stack they exist. We can already see cloud security posture management (CSPM) solutions blending with data security posture management (DSPM), but will it be enough?
At Bearer, we think a data-first approach is associated with a drastic change in how security teams operate, thanks to DevSecOps, and the extension of their scope of responsibility with compliance related activities - requiring more than just a “data coating”.
Will the ability to follow data and its associated risks & vulnerabilities across every system remain just a vision? We can’t say, but there is one thing for sure, we can do better and we are working on it.