“Security is always excessive until it’s not enough.”
– Robbie Sinclair
You may not already know this about me, but I kinda like data masking.
Scratch that, I LOVE data masking.
Increasingly both around Redgate and in general I seem to be getting a bit of a reputation as “the data masking guy” but for good reason – to include yet another quote, from Joe Kaeser this time: “Data is the oil, some say the gold, of the 21st century…”, more and more I hear stories about people leaving their oil/gold out for everyone to see, opening up the widest attack surface area by doing things like copying backups down into non-Production environments or exposing test systems to the internet – the list goes on.
This means that people turn to all of the protective methods they have available to them: encryption (TDE, row and column level etc.), static and dynamic data masking, access control… and many combinations of them.
One of the big points I always have to cover when it comes to static data masking though, is something called “deterministic masking”, so let’s start with 2 definitions of my own to make sure we’re on the same page:
Static data masking is the process of de-identifying sensitive data-at-rest within the tables of your Database. It is typically used to provide realistic, Production-like data into non-Production environments like Dev and Test, and even sets that are given to 3rd parties. This relies on retaining non-sensitive business specific fields within rows and taking anything considered PII (Personally Identifiable Information) or PHI (Protected Health Information) and either scrambling or replacing it with similar but ultimately false data.
Deterministic data masking is the process of masking data with values in a repeatable way, such that it will give the same value when masked in any and all future runs on any value that matches and will create a new record for values which have not been previously masked. An example of this would be if you were to mask “Chris Unwin” to “Brad Pitt”, it should appear as “Brad Pitt” not only in our (for example) dbo.Contacts table but also all associated tables (regardless of PKFK relationships at the DB level) and every single run should provide the same output. This is useful for building up familiarity with the data and utilizing for future test runs.
Now. I should caveat this blog post (you’ll find I’m always caveating my posts) with the fact that deterministic masking does have it’s benefits but the very idea that one thing should always become another, in my eyes, is inherently less secure than something that always gets de-identified to a different value. As such, I will always recommend that where possible, masking runs should produce differing values. Deterministic masking is also compute heavy because instead of simply randomizing in values, it will have to check up front the value to be replaced and then replace it with the corresponding value, another potential downside if speed is a key driver in the process you’re trying to put in place.
Most masking tools either support deterministic masking directly (*cough* dbatools.io *cough*) or require a little bit of configuration to get started, but it is a workflow that should be catered for, for those who need it. So here is a quick getting started guide for deterministic masking, if you’re writing your own scripts or (as I will in this example) you’re using Data Masker for SQL Server from Redgate.
Step 0 – Figure out where you’re going to do the masking
There are lots of different ways to move data around and get things masked before exposing it to development and testing teams (or handing it to 3rd parties). In this example, I’m going to assume that the only thing I have available to me is a backup and restore process for moving data around, nothing fancy like SQL Clone to help.
So for this I will assume I have a staging instance somewhere that I can use as part of this process, lets call it WIN2019 and the process I’m going to follow is:
- Restore a full .bak file to WIN2019
- Carry out the data masking process on this instance
- Backup the masked DB
- Move the new .bak into lower environments (restore / make available to Devs)
Step 1 – Map it for multiple future runs
The first problem you have to contend with is needing to maintain a record of how we want to mask things. If we always want to mask the credit card number 3422-567157-24797 to be 3488-904546-46471 then we need to have a place this is stored. The question to ask ourselves here though is WHAT needs to be recorded. There is a huge difference between:
The latter is obviously far preferable because it does not contain any sensitive PII – it is purely the masked value and a non-sensitive CustomerID which only really makes sense within the company or is a system identifier.
So we should get a mapping location set up on WIN2019, I don’t want to make my tables too wide so I’m going to keep this fairly small and atomic – we’ll create a Database on the server with a mapping table for the credit cards:
This is going to be the basis for our repeated masking. The reason for having this as a separate DB/Table though?
1) The mapping should persist – if it exists in the same DB then we will just overwrite it every time, rendering the mapping useless.
2) Devs/Testers don’t need the mapping – just the end result.
Step 2 – Set up the masking to cover the tables, regardless if there is a mapped value
One of the most important phrases in the GDPR is “Data Protection by Design and Default“, and it’s one of my favorites. In this context I am going to interpret this in a very specific way, and that is: “we must mask everything, before trying to map it back to a value that exists, just in case the link to the MaskingMapper DB were to fail for any reason.
I first restore a copy of the Database I’m going to mask (the DMDatabase) and then setup a Data Masker substitution rule to process the DM_Customer table, de-identifying the credit card numbers:
Step 3 – Copy the distinct values across into the mapping table
This step is going to be as simple as writing a single tSQL statement to copy the values across – in Data Masker I will wrap this into a Command Rule and make it dependent on the previous substitution rule:
INSERT INTO MaskingMapper.dbo.CreditCardMap ( CustomerID , CreditCard ) SELECT DISTINCT customer_id , customer_credit_card_number FROM dbo.DM_CUSTOMER WHERE ( customer_credit_card_number <> '' AND customer_credit_card_number IS NOT NULL AND customer_id NOT IN (SELECT customer_id FROM MaskingMapper.dbo.CreditCardMap) );
Step 4 – Sync everything back together
Finally – we need to bring any information back from the table if it had values written to it in previous runs. In tSQL we could write an UPDATE with an appropriate WHERE clause but I’m going to use an additional controller and Table-to-Table Sync rule in Data Masker to handle this:
If we now run this we will have achieved deterministic masking, because we have the following before and afters – first for the DM_Customer table:
and for the CreditCardMap table:
The mapping table now has 77 rows and if we repeat the masking step by step without changing anything we can see that the credit card numbers change in the first instance, but then synchronize back to the values that should persist, the images below represent just running the first two steps in isolation (i.e. masking everything regardless – left) and then the synchronization job restoring the predetermined values (right) and the mapping table still has 77 rows.
Now if in the next run one of the NULL/blank fields has a real credit card number, or we add any additional customer IDs (i.e. with a more recent backup with fresh data) they can be masked, accounted for and persisted between each run.
Deterministic masking is hard, but it is possible. You can use a number of methods to achieve it, such as the above, but the first question you need to ask yourself (after “do I feel lucky”) is:
“Do I NEED deterministically masked data, or is it a nice to have?”
9 times out of 10, I’m pretty sure the answer will be that it is not essential, and therefore you should focus on making sure the masking of the data is random, static and fast. Adding compute to this process will only slow it down and at the end of the day, we just need to make sure our customers data is protected.