Sensitive data is a minefield for big data. If a single record containing Personally Identifiable Information (PII) or Personal Health Information (PHI) finds its way into your enterprise data lake, you have a problem on your hands.

The presence of PII/PHI may trigger a compliance issue with local privacy regulations, or it may simply increase your security risk beyond what is acceptable to your organization. Either way, it is important to recognize that sensitive information has the potential to poison your data lake. 

For some organizations, the answer is simple: they just avoid PII/PHI data altogether. But this means ignoring the potential value these data could bring to the organization. A better approach is to embrace sensitive data, but also make preserving privacy a primary objective of your data strategy.

What’s the Problem with PII?

Emerging regulations like GDPR in Europe and CCPA in California are game-changers for organizations operating in these jurisdictions. They recognize the rights of individuals to own their data, and they make significant demands of any organization collecting PII. Failure to meet regulatory expectations may result in significant penalties; in the case of GDPR, the fines, up to 4% of global revenue, can level an unprepared company. 

PHI has long been regulated by HIPAA (and indeed, healthcare records also fall under the jurisdiction of GDPR). This healthcare regulatory framework is well entrenched, though recent changes appear to soften its bite in response to violations. Nevertheless, PII and PHI aren’t like anonymous log files. They demand special treatment. 

This creates a conundrum for many organizations. On the one hand, leaders want to squeeze every bit of value from all the data they possess—both sensitive and non-sensitive alike. But the minute sensitive data such as PII finds its way into a data set, technical leaders must ask tough questions about who can access the data. Too often, these questions lead directly to reactive decisions, such as locking sensitive data out of any kind of analysis. If analysts can’t access data, they can’t make insights.   

A better approach is to build a data strategy around the proper handling of sensitive information. This doesn’t mean simply bolting on access control and enabling encryption; instead, what is needed is a holistic perspective on the overall lifecycle of sensitive data in your organization. Privacy by Design provides valuable guidance on how to proceed. 

The Need for Privacy by Design

Privacy by Design is a framework for engineering information systems, first proposed by Dr. Ann Cavoukian in 1995. Ann’s work became internationally recognized by the International Assembly of Privacy Commissioners and Data Protection Authorities in 2010. It has seen widespread support, including input into the EU’s GDPR. 

Privacy by Design advocates being proactive, anticipating risks, and building effective countermeasures into the design of an information system. What makes Privacy by Design so effective is that rather than dropping an onerous list of SHOULDs and MUSTs in the laps of developers, it offers a simple framework to focus the attention of system designers on protecting every individual’s privacy. It avoids the trap of focusing on technology, and instead promotes the idea that system, infrastructure and business practices all have a role to play in preserving privacy. 

It is based on seven foundational principles:

  1. Proactive not reactive; preventative not remedial 
  2. Privacy as the default setting 
  3. Privacy embedded into design 
  4. Full functionality – positive-sum, not zero-sum 
  5. End-to-end security – full lifecycle protection 
  6. Visibility and transparency – keep it open 
  7. Respect for user privacy – keep it user-centric 

Simple ideas, but very powerful when applied consistently throughout the design process and on through execution. 

Here at PHEMI, we are big fans of Privacy by Design. We think it is the right way to approach Big Data, and we use its methodology in our own product design.

Written by: K.Scott Morrison, PHEMI CTO

In part two of this series, I will explore each of seven foundational principles and discuss the impact they have on big data containing sensitive information.