“Where do I even begin with data masking?” Getting started in 3 steps.

“I like to begin where winds shake the first branch.”
Odysseus Elytis

I’ve already covered data masking a little bit with a few of my first posts on this blog, and for many it will not be new to them. Performance tweaks and determinism are just par for the course in masking data, approaches to solving a well defined problem. But for many, the challenge is not how they mask their data. You’re free to write scripts, PowerShell or use a COTS software solution like Data Masker and the end result is going to be a set of data which you can securely work on; knowing that if you are going to suffer a breach of that data, it will be useless to those who get their hands on it.

The biggest problem here though, is getting started.

That phrase in itself suggests what kind of problem data masking is – anything in this life should be easy to at least get started at, with it becoming more and more complex the more you get into it. One can write a Hello World program quite simply in any major programming language but to start creating rich, functional graphical software, you’re just going to have to learn, write and test (repeat infinitely).

But data masking is like playing ‘capture the flag’ without an opponent team. Capturing the flag is a straight-forward enough objective, but if you don’t know where the flag is, how big it is, what it looks like, how many flags their are etc. then your job is going to be impossible if you just jump straight into it. Like all things in this life, data masking takes careful thought and planning to understand the scale you’re dealing with – and you need to make sure you don’t close your scope too soon by focusing on one system or set of data.

So here are the 3 broad steps I would recommend when starting to protect data in non-production environments:

1 – Catalog your structured data

I cannot emphasize this point enough. Before you do anything, catalog the heck out of your estate. Use anything you like and whilst i’d recommend approaches such as this or this, even an excel sheet is a start. You must know what you hold, everywhere, before you can begin to make decisions about your data. Perhaps the goal of this was to mask the data in pre-Production copies of your CRM system? Well, data doesn’t just exist there in isolation… or maybe it does! You don’t know until you have a record of everything. Every column, every table, every database, every server. There should be a record, tagged with a reasonable value to indicate at the very least the following 4 things:

  1. What system it is used in
  2. Who is in charge of the stewardship of it (i.e. who should be keeping it up to date / ensuring compliance)
  3. How sensitive it is
  4. What kind of data it is

Obviously other values can be specified, like your treatment intent (masking, encryption, both etc.) or your retention policy for specific fields, even something like identifying how the data moves in-out-and-around your estate (data lineage) but at the very least it needs to highlight where the hot-spots of sensitive data exist and what is most at risk. From this you can derive the insight needed to start masking data.

Notice this step is before any specific masking steps, that is because we should have a company wide resource to reference our structured data regardless.

2 – Scope your attack, identify the ‘hit points’

Once you have a record of everything you can begin with the systems that are going to be the most at risk / which databases you should start with. As part of the new project you’re working on, you may need to mask multiple tables or even databases but the size and scale of this process is currently stopping you proceeding.

Identify the key tables or even databases you will need to start with – it is so rare to come across databases where sensitive information is perfectly evenly spread across every table (I’m sure there are some, of course, but many will have concentrations of data) – these ‘usual suspects’ will be in contact, account, address, payment info tables etc. and where we have multiple tables with duplicated information, we want to select the biggest behemoths for this task. The reason for this is that instead of trying to mask all tables equally, you’ll want to mask the biggest sources of truth and then “fan out” your masking, linking them back to the masked tables. Not only is this process quicker but it also allows our masking to be consistent and believable. Draw out diagrams top we can keep the actual masking to a minimum (i.e. just these hit points) and then identify where you can fan this out to. Use these almost as you would stories in development.

This approach keeps the effort of writing scripts of configuring masking rules to a minimum but keeps the impact high – preventing us from wasting time trying to mask a table which, for instance, may only hold Personally Identifiable Information by correlation with other data. Ultimately, once you have this (now hopefully smaller, more focused) list, it’ll be easier to define how you want the masked data to appear.

3 – Write. Test. Validate. Record.

Once you have an understanding of your estate and an understanding of what you’re going to be masking, it’s time to start getting some indicative masked values mapped across (names for names, dates for date of birth etc.) but this is not a big bang approach.

Just like software development this process is best done iteratively. If you try to write one masking script or rule set you will gain 2 things:

  1. A thing that only you know how it works
  2. A thing that only you can maintain

Start with a copy of the database. Restore it to somewhere locked down that only a very select few have access to. Build the first part of your script to reflect the first of the ‘stories’ you identified in 2). i.e. “Mask the names to be something believable”. Does this work?

No, why not?
Yes, perfect.

You should even ask for feedback on the masked data from developers, they will be working with it so it makes sense to ensure you collaborate with them on what they would like to see.

So it’s time to record what you’ve done, just as anyone would do with any other software process. If only there was some way for us to do this… oh wait!

PUT IT IN SOURCE CONTROL.

It doesn’t matter if you’ve written a script or created part of a Data Masker masking set, it should now be put into source control, properly commented, and stored so that it is easily accessed, updated and implemented. Grant Fritchey actually wrote a great article about doing this here!

Now build up those rules, tackle your stories and gradually try to create enough to mask each of the ‘hit points’ and then fan them out to the secondary tables. Once you have kept this effort minimal but high impact, you’ll be able to try this in earnest. Once you have tackled your high risk targets then you can add stories to add specific test cases or oddities required in Dev and Test.

The point is to start somewhere, like I said, getting started is the hard part, but once you know what you have, where it is and how it’s structured, how you actually mask the data is a breeze.

5 thoughts on ““Where do I even begin with data masking?” Getting started in 3 steps.

  1. Pingback: Shall we begin? (With data classification) – Chris Unwin

  2. Pingback: Azure DevOps Masking a.k.a “point, no click” – Chris Unwin

  3. Pingback: 3 methods for seeding test data during CI builds with Flyway – Chris Unwin

  4. Pingback: Refreshing SQL Server Development workflows with Redgate SQL Provision – Chris Unwin

  5. Pingback: Configuring Dynamic Data Masking in Azure SQL Database from SQL Data Catalog using PowerShell – PlantBasedSQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s