Data protection
Data protection
In this blog post I will give a high level overview of topics relevant for data protection. The objective is to make sure any initiative to review the data protection approach would start with a broad enough scope avoiding common oversights and issues.
The well-known triad Confidentiality, Integrity and Availability (CIA) may serve as a general frame for data security.
We add the process – store – transmit triad as another structuring factor.
The basics of data protection: confidentiality, integrity and availability.
Considering the CIA triad is always a viable starting point for security. More than in other instances, confidentiality is the main focal point for data protection. One goal of this blog post is to break open the sometimes too narrow focus.
Data confidentiality
Data must only be accessed by authorized actors.
This basic statement is actually about access control. Yet, the common reflex is to respond with encryption as the solution. Encryption does provide access control: to access the data you need to be able to use the right decryption keys. However, if you use an application that accesses encrypted data, access control provides confidentiality, while you may be unaware of the underlying encryption of the data.
Encrypting the data is countering other threats, for instance rogue access to the stored data, or interception of data communication.
The "classic" fundament for data confidentiality: data classification
The classic main concern for securing data and indeed a solid starting point is looking into the confidentiality requirements. The common approach is to identify a limited set of classes of data with increasing confidentiality requirements. For instance, the series Public, Internal, Secret, and Top secret could be the 4 classes linked to all data. More elaborate schemes may be used, like adding specific organization related classes: internal sales, internal HR, …
The classification of data serves to support risk management. It must make clear what the risk level is if the data confidentiality is breached. Notice that the data classification typically does not include differences in integrity related risks. These risks are independent of the confidentiality risks. We must consider the risk level if the integrity of the data is compromised.
The data security policy would define which data belongs in which class, and would spell out the security requirements for data storage for each class.
Extra: the classification may be time dependent
Data can go from secret to public over time, because the risks are changing over time. In general, lowering the data classification should be handled carefully. However, there are cases where there is a point where the secrecy of the data disappears naturally. The financial reporting of companies that are publicly traded are (top) secret until the report is released at a very specific point in time. From that point on, they are effectively and intentionally public, with high integrity requirements.
Transaction logs probably are not public, but if audit requests specific logs, that extract becomes very sensitive, both for confidentiality and integrity.
Handle with care: what is "unclassified data"?
Data not being classified has two meanings: one, it has not yet received a label, or two, it is worthless. These are very distinct situations. In the end, all data should have a label, with the lowest risk typically indicated by the label "public". The default of assuming unlabelled data to be "public" is an unsafe choice.
The risks associated with data confidentiality or integrity loss or outright data loss exist regardless whether the data has been labelled with a class or not.
In a recent event, data was stored and transferred over an insecure channel. The claim that the data was unclassified may be correct, in the sense that is was likely not assigned a security classification. Yet, not classifying data is in itself a policy violation, especially if the data classification when assigned would not be public, let alone secret.
Address all risks: include integrity
What does integrity mean?
Only authorized users must be able to change the data.
This statement is a statement about access control. The use of cryptographic means to detect unauthorized changes address rogue access to the storage or communication of the data. Recovering from unauthorized modifications depends on back-ups and possibly transaction recording.
Public data may need integrity protection
A common misconception is that public data requires no protection. For sure, there is no risk of confidentiality compromises. Yet, for a lot of data integrity is required to reduce risks. Exchange rates, tariffs, tax percentages, account numbers (some), … are all public, yet, they should be correct, and their integrity guaranteed.
One could argue that integrity is the most fundamental property of data. If the data of your bank accounts are exposed, it has an impact, still, if the data are modified it matters al lot more. Loss of integrity means the data is unreliable. Unreliable data can be worse than having no data at all. For that reason data may also be labelled with reliability scores.
An area were dealing with reliability of data is part of normal business is customer relationship management. When people visit your site, you may be able to get some information through tracker data with low reliability. If users self-register for information, the data they enter can often be completely fictitious. When they buy your products, you get more and reliable data. Recurrent customers may add even more details, on top of the implicit profiling you can do based on their purchases. For accountability, records must be provably correct.
Data availability
Back-up
The prime control to ensure data will be available is the use of back-up systems. This does not mean complete snap-shots of everything only, today it is possible to have fine-grain back-ups of document versions.
Archives
Archives provide the historic view of data. In principle these data items are not including previous versions, as there is only one truth at a certain point in time. It is wise to have back-ups of your archives.
Resilience
Resilience is much more on the radar than it used to be, with new regulations putting more pressure on organizations to improve their resilience. This is a good evolution to be better prepared to deal with new and complex threats to our data.
Data retention: Some data you shall keep, other data you must not keep!
Data retention has two sides: there are rules that oblige you to keep data, and there are rules that require you to delete data.
Data that was requested by audit or law enforcement or similar requests must not be deleted during defined retention periods. Failing to be able to provide the data may lead to a failed audit, to legal accountability and to financial losses.
You may need to keep business records for a longer period till they may no longer be challenged. That means the expected storage period might be extended with many years.
You may need to have to keep records in the public services area, to enable transparency of governance. When using third party services the retention policy and especially the retention periods must be carefully examined, like, what happens after the contract is terminated?
Personnel information, medical records, and other personal identifiable information may be subject to inquiry by the subjects and to exercise their rights to get full extracts. Likewise, their data must be removed if there is no longer a justified case to keep the data, or on explicit request for the data to be removed.
Archives and back-up: not immutable?
An serious complication due to the requirement to erase some data completely for your systems is that archiving and back-up are possibly severely impacted. An archive is a snapshot of all data at some point, and is expected to be immutable. Now, if some data must be kept and some data must be removed the archive becomes a living data source.
Similarly with back-ups: the back-up data must also be adapted if data has to be removed from all media storages. For back-ups the problem is less complex, as todays back-up has relatively short lifetimes.
When providing security services for instance the contractual agreements may stipulate a complete removal of all data that was shared with the customer, except for necessary proof of delivering the services. These periods vary between contracts and depend also on contact termination.
Side note: it is funny to see stipulations that say all material has to be returned upon end of the contract. The result is sending electronic copies needlessly creating an unnecessary risk. To comply with destruction alone is already difficult in an environment with back-ups and archiving. Best is to avoid such issues and plan for this future event.
Segregate high-risk data from other data
Practical limits on classification granularity
Not all data require the same protection. As protection of top secret documents is much more costly and cumbersome than protecting public documents, classifying data is a must, if only for cost reasons.
In practice it may not help a lot to distinguish classifications of data that are actually stored or processed together: the highest classification determines what measures must be in place. The data that has a high classification is condemning all the other data, in a certain sense.
Segregation
Solutions to diminish this contamination effect are around for a while. One of the early examples of this contamination was payment card information (PCI) protection. The data security standard (PCI-DSS) took the data protection seriously, and rightfully led to many systems getting included in the scope of this regulation. The most important part of becoming compliant was reducing the footprint of systems in scope, by segregating the core data causing the trouble. Something similar happened with the data privacy laws and personally identifiable information: keep as much as possible systems outside of the scope of the regulation by data segregation.
Masking and tokenization and variations
Instead of segregation, or in combination with it, tokenization is a common technique. It replaces the critical data parts with unrelated tokens, yet, allowing in some way to connect the actual data again in specific, secured systems if so required.
There are many ways to tokenize data (using the term generically here), the major ones are: encryption, hashing, and substitution. They have different properties and issues.
Data protection: storing, transferring, processing
Applying CIA to data processing steps
Once data is classified the rules for storing, handling, distributing the data are defined. Observing these rules should ensure the identified risks are kept to the desired level.
Data is produced everywhere by most anyone in the country or the organization or the family. Understanding and applying the policy for data classification is everyone's responsibility. All people should be trained on the policies that apply them to data classification and must handle classified data correctly.
Confidentiality
Data must be protected according to the policy at all times. The most common protection mechanism is data encryption providing confidentiality: encrypted storage and encrypted communication. While processing data the prevalent solutions require the data to be decrypted. This decryption does reduce the protection.
For instance, access to encrypted disks make the encryption transparent to higher layers, and therefor also to attacks at these higher layers. Their primarily benefit is portable and removable media protection.
The same weakness is true for encrypted databases etc. A SQL injection attack will not even have to know or care about this "protection".
While there are initiatives to make processing of encrypted data without decryption possible, this is still very limited.
When applying encryption for communication protection, there is a protected "tunnel" from the encryption point to the decryption point. Outside of that tunnel the data is possibly exposed. When claiming end-to-end encryption the end points matter a lot. In common cases, data is fetched from storage, decrypted with internal key support, and re-encrypted for communication with specific communication keys. This way, a natural, normal gap in encryption is the consequence. This gap may occur more than once: after getting the data from a datastore (encrypted data base access) and moving the data internally to another system (for instance using MQ), that transfers the data to a communication system (may be SCP), that sends the data to its destination (likely TLS).
Integrity and availability
All this encryption introduces a potential availability risk. If the keys are lost, the data is lost forever (barring Quantum computing for now). There is a balance to be made here: having a long term key makes it easier to have safe copies of that key, on the other hand, the longer a key is used the higher the risks of it being compromised. With storage and databases encryption changing keys frequently is a challenge.
Data protection: data type dependencies
Documents – unstructured data
There are two rather different types of data protection requirements. The first is the protection of documents, in a broad sense. This type includes all typical "office" documents, letters, emails, contracts, memos, design documents, pictures, … The confidentiality classification is mostly based on the type of the document and is applicable to the whole document, even if some parts would be considered public. Those documents should carry a classification label.
Data bases – structured data
Data bases most of the time structure data in great detail. The database can support multiple views that expose different parts of the database. The typical composition of databases is sets of tables, containing rows of data, that have a defined set of fields.
There are two primary access types: read-write and read-only, applicable to the full database or parts of it.
Labelling for data protection must not be attempted at the field or table level. That level is way too low to attach a data classification. The right level is much higher, at the business object level. Remember, all access control has fundamentally a business motivation, so the rules are set and must be clear on a business level.
Examples of failures of classification at the lower level are approaches to de-anonymize data and smart queries to bypass restrictions on individual record requests.
Business communication records
Business communication is a specific form of data. In many jurisdictions, electronic communication is as valid as paper-based communication, and can serve as a ground for business decisions like purchases, contractual agreements, … The communication does not have to be digitally signed for it to be assumed valid. The use of an email from the domain of the organization for the communication is strong evidence.
It is essential to keep good track of all these communications just in case there would be a dispute were the communication is essential evidence.
Business communication security
Accidental exposure of data via unsafe communication is a frequent issue. Not using encryption, not using sanctioned devices, unsafe portable data media, … are typical problems.
Data leaks are always an issue, even if sometimes they are the result of well intended actions. Using social media or free email services or free document sharing solutions is attractive given its simplicity. Plainly forbidding access to such public services sounds draconic, yet, it is a common approach that makes sense on corporate computing infrastructure.
Data leak prevention and detection must be in place to mitigate the associated risks.
Data mining, data aggregation, and AI
Value discovery: impact on classification
The very purpose of data mining, statistics processing, data aggregation, and AI is in many cases to produce new data that has a higher value than the data from which they start. That implies that the classification of the derived data is likely higher than the base data. For processing purposes the classification of the results defines the controls to be put in place.
While the above observation seems clear, it is not a priori obvious or known what value the results will have and what classification would be appropriate. In a better safe than sorry approach, assume your data monetization approach will be successful and classify the results accordingly.
High level schema with the major data protection policies and controls
The scheme hereafter gives a birds-eye view on the key elements for data protection.
