“How long does data masking take?” Part 1

“Exploring the unknown requires tolerating uncertainty.”
Brian Greene

Another in the data masking series and I thought I’d go with an easy one to answer. It’s right up there alongside the other easy questions we all ponder like:

“What is the meaning of life?”
“How do I build a fusion reactor using common household objects?”

and the classic “Can Vegans eat Animal Crackers™?”

The answer to any of the above questions is, of course, it depends.

If you’re content to ‘live life to the fullest, travel and be happy’, you happen to have spare protium and boron-11 just around the house and you’re content to bypass the ethical conundrum that although the crackers look like animals, they do not actually contain any animal products then the above are theoretically answerable, but data masking… whooo that’s a doozy!

There’s no easy way to benchmark this process, honestly. It is something that will be subject to many things you may wish to think about prior to the deed:

  • How many columns are on the tables you want to mask?
  • How many rows are in the tables you want to mask? (Which are the biggest tables?)
  • How many indexes are on the tables? Will these get in the way of a static masking process?
  • Is data duplicated across multiple tables and does it need to be synchronized to maintain referential integrity?
  • How powerful is the machine you’re running the process on?
  • WHERE is the Database you’re running this process on? (Same machine, Network, PaaS…)
  • Will anyone be trying to connect to and running queries on against the Database being masked?
  • Do you need to mask across Databases for consistency?

All of these things will affect the process, both in terms of how you configure it and how long it takes.

I will endeavor though, to give SOME idea. At the very least it’s a good starting point and hopefully if you decide to run any testing of your own you will see similar, or even better performance. For this test I will use Data Masker for SQL Server but in future blogs I’m hoping to test out a few other methods as well, so stay tuned!

For this test, I will be using the following setup – note this isn’t what I would expect a real setup to be, I would expect the staging server being used for masking (and potentially Image creation in SQL Clone) would be a teensy bit more powerful!:

  • Microsoft SQL Server Developer 2016 (64-bit)
  • Microsoft Windows NT 6.3 Server hosted in AWS EC2 (c5.2xlarge)
  • 4 Processors (vCPU), 16 GB Memory
  • Redgate Data Masker version 6.3.20.5034 installed
  • DMDatabase (tables below in Data Masker) – this is my demonstration database in simple recovery mode with:
    – 1,000,000 rows on the DM_Customer table
    – 50,000 rows on the DM_Customer_Notes table
Data Masker Tables Tab showing row counts for large tables

For the purposes of this test I will mask: firstname, lastname, company name, street address, region, country, telephone number and zip code on the DM_Customer table and the notes field on DM_Customer_Notes (a free text field). Once I have completed this, I will then create an email address on DM_Customer using the masked first and last name values, and then finally I will synchronize the masked names across to the DM_Customer_Notes table, where they are duplicated.

The output masking set is as below. It uses a substitution rule for the columns on DM_Customer, a row-internal synchronization rule for the email address, a search-replace rule to randomly replace all of the characters in the free text field and then finally a table to table synchronization rule to copy the masks across to DM_Customer_Notes – for the rules I have enabled Bulk Substitution and set the commit frequency to 100,000 to speed up building and committing the statements (hence Simple Recovery mode):

When run this process takes… well actually I stopped it. It’s paaaaaaainfully slow. This is probably because I’ve not utilized Data Masker’s worker threads to their best and I’m reliant on a single connection to carry out the process – we’ll be here for a while!

So, I’m going to make some performance improvements! I’m going to split the substitution and row-internal rules into ‘Split Range Manager’ rules (video on how to do this here) – This is what it looks like now (rules expanded) so i can utilize concurrent connections to break this big table up:

…and guess what?

Masking stats report

Total run time to mask and synchronize all of these rows was 1 minute 26 seconds. I would say that’s pretty good, and the results set looks great too!

Results set from masking: Contacts

So in conclusion – how long does data masking take? Well, it still depends.

I didn’t have lots of non-clustered indexes over these tables that I had to take care of first, I didn’t have very complex operations being performed as command rules or table to table rules matching on free-text fields, but I did have an integer field I was able to split the range across, so things went my way… this time.

But it can go fast and you can make improvements, as shown above.

If you have any questions on how to improve the performance of your Data Masker masking sets feel free to let me know! Tweet me or get in touch!

Happy masking!

Oh, and Happy Holidays / Merry Christmas / Happy Tuesday! 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s