<- Back to resources

How to discover sensitive data across your products

Mapping data flows is the first step to ensure your data security and privacy policies are well-implemented. 

Technology companies need to understand how the software products they build process sensitive data to prevent risks such as data breaches and non-compliance to data regulations. 

By “sensitive data” here, we mean any type of data that you want to keep safe, whatever the reasons. It may be personal data because you’re subject to GDPR, health or cardholder data because they are regulated, or any proprietary information that is valuable to your company. Learn more about sensitive data in our article "What is sensitive data?"

Tools and processes to discover and classify data flows vary from one organization to another. Here is a quick overview of the main ones and their characteristics.

Data Protection Impact Assessments (DPIAs)

A DPIA is a process which consists in systematically reviewing new projects or products to minimize personal data protection risks. It is an obligation under the GDPR and often the responsibility of the Data Protection Officer (DPO). When a product manager designs a new product, the DPO requires them to fill a questionnaire to describe the nature of data processing, assess privacy risks for individuals and identify mitigation measures. 

Pros: 

  • DPIAs are easy to conduct.
  • DPIAs are effective to identify major privacy risks early in the product development cycle.

Cons: 

  • DPIAs focus on personal data only.
  • DPIAs are done on early product specifications only, not on your actual products. 
  • DPIAs are high-level and don’t provide a detailed view of data processing by product components such as services and databases.
  • DPIAs cover the needs of privacy teams, but not those of the security team.

Tools: ICO template, DPIA software (Onetrust, TrustArc).

Surveys

Manual surveys

Manual surveys are a cheap, effective and customizable way to collect information at a small scale. You can send a survey to your product or engineering team to understand what type of data is processed, and what security and privacy controls are implemented. It is the most straightforward way to collect all the information you need to assess the data security and privacy risks of your products and features.

Pros:

  • Manual surveys are effective at a small scale.
  • Manual surveys allow you to document information that you could never retrieve automatically, like security and privacy controls.

Cons:

  • Manual surveys don’t scale well when your engineering team grows beyond 50 developers. 
  • Manual surveys lack engineering context. It is difficult to know when and to whom to send them as you lack visibility over product developments.

Tools: survey software (Typeform, SurveyMonkey).

Automated surveys

As your engineering organization grows, you will want to automate surveys and send them automatically whenever specific events happen. For instance, when a new database is deployed or when a new integration with a third-party is built. The hard part consists in detecting these events proactively.

Some companies build automated workflows on top of their database of assets (CMDB file, Jira Asset Management) so that whenever a new asset is added or updated, a survey is sent. Other companies ask their developers to add parameters to their code whenever it involves new engineering components or sensitive data processing. Thus they can dynamically retrieve this information by scanning the code, notify the security team and send a survey to the engineering owner.

Pros:

  • Automated surveys are effective at scale.
  • Automated surveys are 100% customizable.

Cons:

  • Automated surveys are very resource-consuming to develop and maintain.
  • Additional code guidelines can be difficult to enforce on developers.

Tools: Jira workflows, GitHub workflows.

Database scanning

Open-source software

Database scanning software finds personal and sensitive information in your databases. Open-source software like PIICatcher connect to your production databases, scan column names and content, and match them with regular expressions to detect and catalog critical data.

Pros: 

  • Database scanning open-source software is free and easy to set up in a monolithic software environment.

Cons: 

  • Data discovery and classification methods are rudimentary. 
  • The output - often a raw table - can be difficult for security and privacy teams to exploit.
  • Database scanning requires direct access to production data.  

Tools: PIICatcher.

Data cataloging software

Data cataloging software also scans your data storage systems to discover and classify data. Their data discovery and classification methods are much more advanced than the above-mentioned open-source software. They were primarily built for Business Intelligence teams at Enterprise companies to help them find and use data scattered across numerous IT systems. Security and privacy features were added later to help their customers mitigate data risks.

Pros: 

  • Data cataloging software provides advanced data discovery and classification methods across numerous data sources (applications, data warehouse, SaaS, etc.). 
  • Data cataloging software covers a wide range of needs and use cases: digital transformation, analytics, data operations, governance, security, and privacy.

Cons:

  • Data cataloging software is costly and complex to set up, maintain and use. They are built for Enterprise companies and do not fit the pace of fast-growing technology companies.
  • Database scanning only covers the scope of data storage systems. So for instance it does not provide you with visibility over how you share sensitive data with third-parties.
  • Data cataloging software requires direct access to all your production databases so it adds a security liability. 
  • Data cataloging software allows you to identify risks in production only. It is late in the development cycle and risks can go unnoticed for weeks before being noticed and mitigated.

Tools: BigID, Collibra, Alation.

Data privacy management software

Data privacy management software offers data discovery capabilities as well. An accurate data map is a prerequisite to answer Data Subject Access Requests (DSARs) and build the Record of Processing Activities (ROPA) required by GDPR and most major privacy regulations. Most solutions also integrate with your data storage technologies to discover and classify data.  

Pros:

  • It answers privacy use cases very well: they help build your ROPA, automate DSARs, and ease privacy-driven operations.

Cons:

  • As for data cataloging tools, data privacy management software requires connecting to your production databases. It is costly to set up and maintain, and it adds a security vulnerability. Plus, it only brings partial visibility over your data flows as you can’t monitor data sharing with third-parties.
  • It does not cover the needs of security teams. They focus on privacy use cases only so security teams can’t use their data mapping capabilities to monitor the implementation of data security policies and identify risks like data leaks. 

Tools: DataGrail, Osano, Ethyca, Immuta.

Traffic monitoring

An alternative to database scanning is traffic monitoring. It consists in analyzing traffic between your engineering systems in real-time to detect data flows. There are several technical approaches there:

  • You can implement proxies between your services to catch traffic at the application level.
  • Agents: you can implement agents within your applications to see incoming and outgoing traffic at the application level.
  • ePBF: a technology that allows you to monitor traffic at the network level.  

Pros:

  • Traffic analysis is the most exhaustive and accurate way to map data flows since it looks at the real data flows. For instance you can monitor data transfers to third-parties, contrary to database scanning.
  • Traffic analysis allows you to proactively detect data leaks since you monitor data flows in real-time. For instance you can detect sensitive data leaks in your logging environment.

Cons: 

  • Proxies are costly to configure in complex software environments since you may have to deploy and maintain dozens or hundreds of them. Plus, they can cause latency issues.
  • Agents are also costly to deploy and maintain. They run in production and may conflict with other libraries. As a result they can cause downtime and latency issues. Plus they can also be a security liability as shown by the recent Log4j vulnerability.
  • Proxies, agents and eBPF are still not much used for data mapping purposes. Few solutions relying on those technologies provide mature data discovery and classification features.

Tools: 

  • Soveren (proxy), Laminar (eBPF).

Static Code Analysis (SCA)

Static Code Analysis is an approach used by Static Application Security Testing (SAST) tools to identify security flaws in the source code. A static code analyzer integrates with your Git repositories to examine the source code without actually running it. It can catalog your engineering components (repositories, databases, third-party dependencies) and identify sensitive data processing by scanning the codebase, including data structure files like OpenAPI, GraphQL, and SQL files.

Pros:

  • SCA is easy and quick to set up. It only requires integrating with your Git repository software like GitHub or GitLab.
  • SCA adapts very well to large and complex codebases, contrary to all the above-mentioned approaches.
  • SCA automatically checks your code as you write it. It allows you to incorporate data mapping into the early stages of development, so you can significantly reduce the cost of downstream data security and privacy risks.
  • SCA streamlines data mapping processes, reduces code reviews and frees up developers’ time for other important tasks.

Cons:

  • SCA is a technical approach that is still recent for data mapping. Its quality can vary considerably from one supplier to another. Semantic analysis capabilities range from basic (just regular expression matching) to very advanced (using machine learning techniques). Make sure you evaluate the quality of data discovery and classification on your codebase.
  • SCA does not allow monitoring traffic in real-time. 

Tools: Bearer, Privado.

Start small, automate as you grow.  

Are you a startup with less than 50 developers? Manual surveys are probably the way to go. 

Above this number, engineering organizations tend to be so complex that manual surveys can’t keep up with the pace of product and engineering changes. Automated surveys, database scanning, traffic monitoring and code scanning all have their pros and cons. They may even be complementary according to your needs. We are happy to discuss them with you to guide you towards the best fitted solution(s) for your business.


Share this article:

Say goodbye to manual and outdated data inventories.

Learn how Bearer helps security and privacy teams protect their organization at scale.