SQL Data Catalog, Data Masker and your DevOps pipeline: How do I make sure everything is being masked?

“However fast regulation moves, technology moves faster. Especially as far as data is concerned.”
Elizabeth Denham

You’re probably sick of me constantly talking about how the cataloging of columns should be part of the DevOps upstream deployment process. I’ve blogged about it. I’ve even produced a video demonstrating this in action. But one question that this always throws up is:

If we include cataloging in the upstream process, how do we make sure our masking sets are also staying up to date?

The benefits of including the classifications in the upstream pipeline is that nothing ever gets to Production that hasn’t been classified – we constantly have a perfectly up to date idea of the nature of our structured data estate, how data is distributed, what risk is associated with which systems? etc. but one of the biggest wins is that we constantly know which fields need to be masked when we’re pulling copies back into non-Production, constantly.

Add a table? Add columns? We know about them, they’re classified, they’re deployed… so now they need to be masked on our next refresh. But how? Well it all depends on which approach we want to take:

  • Automated
  • Manual

Doesn’t it always boil down to those options? What I mean is that either we have an appetite to completely generate our masking set afresh every single time based on our classifications (Automated) or do we want to ensure that we configure each rule ourselves (Manual). They both have benefits and drawbacks.

Automated

How: We can generate a masking set using the SQL Data Catalog PowerShell each and every time as part of our pipeline. Add a column, tag it in the pipeline and then simply wait for the necessary rulesets to be generated in the pipeline (perhaps as part of your build) for you to run when you bring a copy back. Check out my walkthrough for how to set this up.

Benefits: The process is automatic. Its headless and you don’t need to think about it at all. As long as the classifications are provided (and if you follow the steps from the blog post and video you should be providing them) then you’re always generating rules for every classified column.

Drawbacks: This process can be fragile. If we don’t classify correctly we can end up masking in the wrong way or trying to mask the wrong field (e.g. a Primary Key, Constraint, Identity etc.) which can cause masking failures and then you have to spend time fixing the pipeline/masking set. This also means that the nature of the masking is dependent on your classifications, and the values you will get masked into the columns will be less realistic as a result (i.e. you can’t generate Row-Internal Sync Rules using the integration).

Drawback Mitigation: To avoid the process breaking, be sure to really focus on how you set up your API settings / how you pick which categories and tags are used to generate the masking rules (like I discussed here). This will at least help you make sure you map common data sets into columns (and don’t hit columns that have constraints or keys).

Only columns marked with Static Masking as the Treatment Intent will get a rule created for them
An Information Type is given to every column we intend on masking – these are then mapped to templates in masker to ensure more realistic data

Manual

How: Either rely on developers to check in masking set changes along side their code changes or build in a manual intervention step to your upstream process to ensure that someone opens and updates the schema and rules within the masking set, and then put this back into the pipeline.

Benefits: This results in more accurate, more likely to run sets that generate significantly more reliable and realistic data as an end result. Masked DB copies can be more easily used for anything beyond simple development changes, including analytics etc. You can include any specialist rules you need and apply your own understanding and knowledge of the database.

Drawbacks: This is obviously a manual process – less than ideal. Anything that involves a human can invariably go wrong because we are humans and we make mistakes (which we learn from of course). This also takes significantly more time as part of the process.

Drawback Mitigation: This is a harder one to mitigate as we’re reliant on manual intervention, however the best way to check this could be with another team member checking (maybe as part of a pull request) or you could include an automated PowerShell script to effectively rationalize the columns to be masked from Catalog vs the columns currently in the Data Masker masking set – this would help you understand if all the necessary columns have been updated or if any were missed. I have included an example of this PowerShell below.

### CHANGE THIS ###

$MaskingSet = "yourmaskingset.DMSMaskSet" # Your masking set including the DMSMaskSet file extension
$instance = "yourinstance" # The Instance as it is shown in Data Catalog that hosts the database
$DatabaseName = "yourdatabase" # The DB you want classification info for
$CatalogServer="http://yourmachine:15156" # The lcoation of your catalog server, ending on :15156
$authToken="redacted" # Your Data Catalog Auth token from the Settings page
$tagName = "Static Masking" # The tag you're using to identify which columns need to be masked

### DONT CHANGE THIS ###

Invoke-WebRequest -Uri "$CatalogServer/powershell" -OutFile 'data-catalog.psm1' -Headers @{"Authorization"="Bearer $authToken"}
Import-Module .\data-catalog.psm1 -Force
Connect-SqlDataCatalog -ServerUrl $CatalogServer -AuthToken $authToken 
$ColumnsMarkedForMasking = Get-ClassificationColumn `
    -InstanceName $instance `
    -DatabaseName $DatabaseName | Where-Object {$_.tags.name -eq $tagName} 
$MaskingSetXML = [xml](Get-Content -Path $MaskingSet)
$subrules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleSubstitution')
$internalrules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleRowInternal')
$shufflerules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleShuffle')
$searchreplacerules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleSearchReplace')
$TablesAndColumns = @()

$subrules | ForEach-Object {`
    $CurrentTable = $_.TargetTableName.value
    $_.DMSPickedColumnAndDataSetCollection.DMSPickedColumnAndDataSet.N2KSQLServerEntity_PickedColumn.ColumnName.value | ForEach-Object {$TablesAndColumns+= $CurrentTable + "." + $_ }
}

$internalrules | ForEach-Object {`
    $TablesAndColumns+= $_.TargetTableName.value + "." + $_.TargetColumnName.value
}

$shufflerules | ForEach-Object {`
    $CurrentTable = $_.TargetTableName.value
    $_.DMSPickedColumnCollection.DMSPickedColumn.N2KSQLServerEntity_PickedColumn.ColumnName.value | ForEach-Object {$TablesAndColumns+= $CurrentTable + "." + $_ }
}

$searchreplacerules | ForEach-Object {`
    $TablesAndColumns+= $_.TargetTableName.value + "." + $_.TargetColumnName.value
}

$result = $TablesAndColumns | Sort -Unique
$ColumnsNeedingRules = $ColumnsMarkedForMasking | Where-Object {($_.tableName + "." + $_.columnName) -notin $result}

"`nThere are " + $ColumnsMarkedForMasking.count + " columns that require masking for database " + $DatabaseName + "in SQL Data Catalog."
"You are masking " + $result.count + " distinct columns in masking set: " + $MaskingSet
"`nThe columns that do not currently have a mask configured are:`n"

$ColumnsNeedingRules | ForEach-Object {$_.tableName + "." + $_.columnName + "     (" + $_.dataType + ")"}

$next = Read-Host -Prompt "`nWould you like to see the columns currently in your masking set? (Y/N)"
if ($next -in ("Y", "y")) {$result}

This code can also be found on my GitHub here.

The output of running the script – 13 columns in the masking set, 14 columns outstanding to be masked

You can technically even use this same check approach for the automated masking set generation option, to ensure that everything has been tagged correctly.

Configuring Dynamic Data Masking in Azure SQL Database from SQL Data Catalog using PowerShell

Horror is the removal of masks.”
Robert Bloch

I spend a great deal of my time talking about Data Masking, don’t believe me? Checkout here, here, here and even here. I talk about it a LOT, but I’m always talking about Static Data Masking (SDM), which is the process of masking the data at the file level itself so it is irreversibly altered – this is fabulous for non-Production environments like Dev and Test, especially when you pair it with a good cloning technology.

But what about Staging / Production environments?

I often work with teams to implement SDM and one of the fastest routes to successfully generating your masking sets for cloned environments is, of course, SQL Data Catalog (or cataloging solution of your choice) – you’ve already put the effort in to classify your columns and figure out where sensitive information exists within your databases and instances… so doesn’t it make sense that we can just use THAT as a source of truth and generate masking from there?

Note: I actually produced an end-to-end video showing the process of Cataloging, Masking and Cloning in under an hour here: https://www.red-gate.com/hub/university/courses/sql-data-catalog/end-to-end-data-protection-with-sql-data-catalog-and-sql-provision – if you watch/try it let me know how you get on!

Funnily enough though we wouldn’t want to use SDM for Production (and potentially Staging) environments though – as it irreversibly changes the data, it’s just going to completely mess up all of our Prod data. To tackle this then, many people I work with turn to Dynamic Data Masking.

Dynamic Data Masking (DDM)

DDM is a method of masking the data based on your access rights to the data. As far as customers see they have access to their data through our site or application no issues, but if anyone else needs to query that data, or different people need to see different results when querying environments, DDM has been their way to go.

Whilst a lot of people like to pick up on some of the well known downsides of DDM, it’s not like you’re entrusting the entire security of an environment to it alone – there are a ton of measures we can put in place and DDM is just one; like an ex-colleague of mine (someone very wise whom I admired greatly and am still sad to this day I no longer get to work with them) used to say: “It’s about building a defensible position. The more you do the easier it is to prove you’re doing something and the more likely you are to BE protected.

So when a customer asked this week if it was possible to configure Dynamic Data Masking from SQL Data Catalog (because they’d seen the “Treatment Intent” category and the tag that clearly states “Dynamic Data Masking”), just like we’re able to configure Static Data Masking, well now that was a challenge I couldn’t turn down!

The SQL Data Catalog Taxonomy Page – Treatment Intent Category showing Dynamic Data Masking

DDM in Azure SQL DB

Configuring Dynamic Masking in Azure SQL DB is fairly straight forward through the Azure portal, you can go to your Azure SQL DB, click Dynamic Data Masking and it gives you the option to simply pick and save columns to apply Data Masking to, and to whom these rules apply / don’t apply:

DDM in the Azure Portal for the DMDatabase_Dev, with masks configured on customer_firstname and customer_email

However when we potentially have a lot of columns or DBs to configure masks for this is really going to get very old very fast. As with all things, I turned to PowerShell for the answer and fortunately I found it: https://docs.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview?#set-up-dynamic-data-masking-for-your-database-using-powershell-cmdlets – basically I can get existing DDM configurations and set new configurations for columns directly from my Azure SQL DB using PowerShell.

Now I’m not expert on DDM, and Redgate Data Masker for SQL Server is not a DDM solution (so I’ve only ever needed to know SDM really) I don’t pretend to be, but it seemed that I had everything I needed to tie Catalog into DDM.

PowerShell time!

I’ve written so much PowerShell to get classifications out of Data Catalog at this point it’s become second nature, but if you’re using the SDC PowerShell module and you need a reference you can view it here: https://documentation.red-gate.com/sql-data-catalog/automation-with-powershell but the standard “stuff” goes:

  • Pull down the PoSh module
  • Connect to catalog where it’s installed using an Auth token
  • Grab out the classifications with Get-ClassificationColumn
  • Shrink this down to just the columns we care about based on the tags

But the Az PowerShell cmdlets were honestly just as easy to use! I was surprised how easy it was to get up and running:

  • Connect to my Azure subscription
  • Get the current list of columns already with DDM masks
  • Remove these from the Catalog list
  • Update the remaining columns to use the default Mask

This was the full code I ended up using:

# This script is intended to be used with Azure SQL Database and Redgate SQL Data Catalog, however you are welcome to adapt and edit as required
# It will pull columns out of azure that are already being masked, and a list of columns that need to be masked with DDM
# It will then rationalise these, and configure Default DDM masks for any columns not already being masked on that Azure SQL DB

#Variables for Azure SQL DB & Catalog
$ResourceGroup = "DMDb"
$ServerName = "dmnonproduction" # Your instance minus .database.windows.net
$instance = "dmnonproduction.database.windows.net" # The instance or logical SQL Server as displayed in SQL Data Catalog
$DatabaseName = "DMDatabase_Dev"
$CatalogServer="http://pse-lt-chrisu:15156" # Your SQL Data Catalog location, leave off the trailing "/"
$authToken="REDACTED" # Your SQL Data Catalog Auth Token
$AzureSub = "Redacted" # Your Sub ID

# Get the SQL Data Catalog PowerShell Module & Connect
Invoke-WebRequest -Uri "$CatalogServer/powershell" -OutFile 'data-catalog.psm1' -Headers @{"Authorization"="Bearer $authToken"}
Import-Module .\data-catalog.psm1 -Force
Connect-SqlDataCatalog -ServerUrl $CatalogServer -AuthToken $authToken 

#Connect to your Azure Subscription
Connect-AzAccount -Subscription $AzureSub

#Get current active DDM Masks from Azure
$DdmMasks = Get-AzSqlDatabaseDataMaskingRule `
    -ResourceGroupName $ResourceGroup `
    -ServerName $ServerName `
    -DatabaseName $DatabaseName
$ListOfDDMColumns = $DdmMasks | ForEach-Object {$_.SchemaName + '.' + $_.TableName + '.' + $_.ColumnName}

#Get columns from Catalog currently marked with "Dynamic Data Masking" as a treatment intent
$CatalogColumns = Get-ClassificationColumn `
    -InstanceName $instance `
    -DatabaseName $DatabaseName | Where-Object {$_.tags.name -eq "Dynamic data masking"} 

#Filter down to a list of columns that need to be masked, that currently aren't configured with DDM
$ColumnsToDDM = $CatalogColumns | Where-Object {($_.SchemaName + '.' + $_.TableName + '.' + $_.ColumnName) -notin $ListOfDDMColumns }


#Set default DDM Masks for identified columns
$ColumnsToDDM | ForEach-Object { `
    New-AzSqlDatabaseDataMaskingRule -ResourceGroupName $ResourceGroup `
                                     -ServerName $ServerName `
                                     -DatabaseName $DatabaseName  `
                                     -SchemaName $_.schemaName `
                                     -TableName $_.tableName `
                                     -ColumnName $_.columnName `
                                     -MaskingFunction "Default"

}

But I have also uploaded it to my GitHub here in case anyone would like to take and adapt as they see fit: https://github.com/ChrisUnwin/PowerShell/blob/master/Demos/Redgate%20Demos/DDMFromCatalog.ps1

And this was the result – here were the two columns I had already being masked:

Customer Firstname and Customer Email with DDM Masks Configured

These were the columns I had marked as Dynamic Data Masking in Data Catalog:

Customer firstname, lastname, street addres and email all marked for DDM in Catalog

and after running the PowerShell it deduced that the delta was street_address and lastname and created the default DDM mask for them in Azure:

All columns now being masked dynamically

Considerations

1 – I have used the default mask in this process, however if you wanted to configure the mask (as per the link to the docs above) to be specific numbers or format you could absolutely do this, simply by modifying the PowerShell to look at the Data Type and then just passing into a different New-AzSqlDatabaseDataMaskingRule for each of those types.

2 – This only applies to Azure SQL DB and does not take into account the considerations when using DDM on say, a 2017 SQL Server Instance running on a VM – however you could use the same approach to pass the columns into some dynamic T-SQL which would in turn run the correct command to add DDM to that/those column(s)

3 – I would still use Static Data Masking (SDM) for non-Production environments, because if anyone bypasses the DDM they will have access to the full data, which we don’t really NEED in less secure non-Prod environments anyway, so Static might well be the way to go!

“But I don’t wanna INSTALL it!”: Data Masker on the fly in Azure DevOps (with an Azure SQL DB)

“There is always a way to go if you look for it.”
Ernest A. Fitzgerald

As many of you know, I really enjoy talking about Data Masking. I fundamentally believe it is an absolutely ESSENTIAL part of Test Data Management and specifically the provisioning of Pre-Production environments. If you hold sensitive PII/PHI/PCI in your Production environments, you have no excuses for porting any of that back into Dev and Test.

I also stand firmly behind the belief (and it is just that, my belief) that masking is much safer than anything you can achieve with encryption or limiting access alone. Static masking, when done correctly, means that even if all other security measures are bypassed, or we accidentally expose data somehow, it doesn’t matter because the PII/PHI has been wiped, but it is still fundamentally useful for development and testing environments.

I’m also a huge proponent of using classification and data masking as part of a DevOps process, and often you’ll find me using Azure DevOps to actually kick off my masking process – but one of the most frequent questions I get is “do I have to install Data Masker somewhere?” and the answer I always have to give is… yes.

When we use SQL Provision to spin up our environments in non-Prod, generally we know they have to be in IaaS VMs or On-Prem, due to the nature of the technology (keep your eyes open on THIS because big changes coming soon *squeals in excitement*) but sometimes we’re working entirely with PaaS DBs and VMs don’t even come into our vocabulary – but we still need them to be masked for non-Prod use.

I have some data in an Azure SQL DB – a copy of DMDatabase (get it from my GitHub):

DM_CUSTOMER table in the DMDatabase

and I would like to get this masked before I create copies of it. Now I have written scripts in the past to make sure that Azure SQL DBs can be masked and then copied into non-Prod environments and you can get those scripts here, and you could easily wrap something like the below INTO a script like that – but much of the work I’ve done on this always involves having Data Masker installed somewhere and invoked on that machine – a VM, my laptop, whatever.

So, how do we avoid having to install Data Masker each time?

Data Masker for SQL Server does not (at time of writing) have publicly available a docker container or method for installing using Choco or something like that – once we do, trust me, I’ll be blogging about it A LOT as I will be very excited. But there is an install file available: https://download.red-gate.com/installers/DataMaskerforSQLServer/ and this might be enough.

The Process

Initially I created a Masking Set locally for the DMDatabase copy in Azure, it was nice and simple just masking a few of the fields on the CUSTOMER table:

It relies on SQL auth for the connection and I’m remembering the credentials, though these could be subbed in later on using the PARFILE (documentation on that here).

Next I put this masking set into a newly created Azure DevOps Git repo, which I cloned down onto my local machine – and then committed and pushed my changes up into Azure DevOps:

Now that this was all in ADO, it was time to set up a pipeline for it – so let’s jump into some YAML! Now, Data Masker currently needs to run on a Windows machine so we’ll set the pool to Windows-Latest:

trigger:
- main

pool:
  vmImage: windows-latest

The next step is to grab the installer – which I know I can easily do with PowerShell. I’m sure you could be more clever about this, but with limited time I chose the most recent version and hard coded that in to a PowerShell Invoke-WebRequest cmdlet:

- task: PowerShell@2
  inputs:
    targetType: 'inline'
    script: |
      $source = 'https://download.red-gate.com/installers/DataMaskerforSQLServer/2021-03-15/DataMaskerforSQLServer.exe'
      New-Item -ItemType directory -Path C:\download
      $destination = 'C:\download\DataMaskerforSQLServer.exe'
      Invoke-WebRequest -Uri $source -OutFile $destination
      New-Item -ItemType directory -Path C:\download\DMlogs
  displayName: 'Download Data Masker'

This PowerShell task is going to grab the most recent exe for Data Masker and pull it down into C:\download on the hosted VM we’re using for the pipeline and it’s also going to create a directory for the Data Masker logs as well (if you wanted to extend the YAML at the end to wrap these logs up and publish them as a result of the pipeline masking, then go for it and tweet me to let me know!)

Next we have to extract and install Data Masker from that download, which is fairly easy to do with a cmd call:

- task: CmdLine@2
  inputs:
    script: |
      "C:\download\DataMaskerforSQLServer.exe" products "Data Masker for SQL Server" log c:\download temp c:\download /IAgreeToTheEula RG_LICESE=%RGLICENSE%
  displayName: 'Install DMS Headlessly'

Note I’m using the guidance from this page, making sure to accept the EULA and I’m passing in my Redgate License as a variable, that I have specified for the pipeline and kept secret. This will put Data Masker in the default location in C:\Program Files\… and means we will then be able to call it. I do however need to make sure that this now works. So I saved my pipeline and ran it to see what happened:

Fabulous, that all works nicely. Now to pass the DMSMaskSet file to Data Masker and get it to run – ah but I forgot, I’m going to need a PARFILE as per the cmdline documentation that specifies where the files etc. are for the run. So I create my PARFILE.txt as such:

MASKINGSET=C:\download\DMDB_MaskingTime.DMSMaskSet
LOGFILEDIR=C:\download\DMlogs
DATASETSDIR=C:\Program Files\Red Gate\Data Masker for SQL Server 7\DataSets 
REPORTSDIR=C:\download\DMlogs
INTERIM_REPORTS=false

As you can see, very simple. But to make sure these files are “simply” in the right folder, and because I don’t have time to explore how I could pass an environment variable into the .txt file itself, I’m going to add a quick file copy task to make sure that my masking set and PARFILE both make it into that location:

- task: CopyFiles@2
  inputs:
    SourceFolder: '$(Build.Repository.LocalPath)'
    Contents: '**'
    TargetFolder: 'C:\download'
  displayName: 'Copy files from working directory'

Of course now that we have Data Masker installed on the pipeline VM, the masking set AND the PARFILE… let’s get masking!

- task: CmdLine@2
  inputs:
    script: '"C:\Program Files\Red Gate\Data Masker for SQL Server 7\DataMaskerCmdLine.exe" PARFILE=C:\download\PARFILE.txt'
  displayName: 'Run Data Masker'

I agree the command is now less than impressive given all the prep work, but when you run all of this in it’s entirety…

Big success! But let’s check the data of course and make sure it is as we expect:

It all works, Lynne has been masked to Muna (and the rest has been masked too!), and I didn’t need to have Data Masker installed on a VM in my environment, with an Azure DevOps self hosted agent to run it – I could just do it programmatically. #Winning.

I’ll put the full YAML below however some caveats:

  • You cannot run this pipeline a LOT in a short space of time, I found that out. The RedGate downloads page is not particularly designed for this process, so it should probably be run sparingly otherwise the pipeline times out because the server, by way of balancing, will prevent you from pulling the file too often.
  • You will need to update the PowerShell step reasonably often if you want the DMS most recent installer, or make it more futureproof to grab the latest version – I just didn’t investigate that.
  • Data Masker is dependent on the power of the machine it is running on in MANY ways and Azure DevOps pipeline VMs are not particularly the most powerful beasts in the world – so if you have a lot of masking that needs doing, I would be weary of this method and might stick to an Application Server VM you’ve got hanging around, just in case.
  • This is likely to change a lot in the future so may not be relevant when you’re reading it after a few months – so before implementing anything like this, if it’s been a few months, contact me or Redgate and just confirm that there isn’t some better way of doing this, if I haven’t already blogged it!

Thanks for stopping by and have a great week!

Full YAML:

trigger:
- main

pool:
  vmImage: windows-latest

steps:
- task: PowerShell@2
  inputs:
    targetType: 'inline'
    script: |
      $source = 'https://download.red-gate.com/installers/DataMaskerforSQLServer/2021-03-15/DataMaskerforSQLServer.exe'
      New-Item -ItemType directory -Path C:\download
      $destination = 'C:\download\DataMaskerforSQLServer.exe'
      Invoke-WebRequest -Uri $source -OutFile $destination
      New-Item -ItemType directory -Path C:\download\DMlogs
  displayName: 'Download Data Masker'

- task: CmdLine@2
  inputs:
    script: |
      "C:\download\DataMaskerforSQLServer.exe" products "Data Masker for SQL Server" log c:\download temp c:\download /IAgreeToTheEula RG_LICESE=%RGLICENSE%
  displayName: 'Install DMS Headlessly'

- task: CopyFiles@2
  inputs:
    SourceFolder: '$(Build.Repository.LocalPath)'
    Contents: '**'
    TargetFolder: 'C:\download'
  displayName: 'Copy files from working directory'

- task: CmdLine@2
  inputs:
    script: '"C:\Program Files\Red Gate\Data Masker for SQL Server 7\DataMaskerCmdLine.exe" PARFILE=C:\download\PARFILE.txt'
  displayName: 'Run Data Masker'

5 (Noticeable) Business Benefits of Secure Database Provisioning

“Quality is never an accident; it is always the result of high intention, sincere effort, intelligent direction and skillful execution; it represents the wise choice of many alternatives.”
William A. Foster

I know what you’re thinking.

Chris. Your title looks like it was written to be a corporate whitepaper that I see ads for when I’m browsing social media; it should have a CLICK HERE button, a boilerplate photo of a smiling person holding a pen and it should say something like ‘executives hate them, find out their secret here!’

But something has become immediately obvious to me in the last few months, I still speak to people daily who are forced to:

  • Work in shared development models
  • Work on empty (schema-only) / heavily subset databases
  • Work on old, out of date and/or irrelevant data
  • Make decisions without knowing enough about their data or what they hold

When speaking to them though it becomes immediately obvious that the reason there is no dedicated option available for developers is actually not related to the “traditional” problems that one would expect. You would naturally assume that the reason for not refreshing these environments is because of the large amounts of space or time taken to refresh often enough, or even that ‘we simply cannot due to sensitive PII and regulatory concerns‘.

No. In fact it comes down to, as all things do, time and money.

paid make it rain GIF by Thalia de Jong

In the most recent State of Database DevOps report (2020 that is), a whopping 70% of 2000+ respondents replied that they were using a shared development database and this comes with a whole heap of associated problems, like poor code quality, looser controls around sensitive data and defective deployments. Just these figures alone already point to the solution being to spin up copies for developers on demand and it’s not like we can’t do that. There is SO much technology in the world, across almost all database platforms, that will allow us to virtualize, containerize, sanitize… (effectively all of the ‘izes‘) our databases so that we can have full, safe, realistic copies as frequently as we like. So what is stopping us?

From experience, it’s justification*. People going to senior stakeholders and saying “we need this technology” and hearing a cacophony of classic business challenges back: “but is it broken?”, “do we really NEED it?“, “it costs HOW much!?!“, “how much time will it take to implement?” etc… It’s dev and test hygiene, not a sexy major modernization project like using Azure Arc, using Blockchain or creating Artificial Intelligence. Who cares that developers have to share a database? We’ve got bigger Tofish to fry!

*Sometimes, but much less frequently, it’s down to complexity of implementation – but we’ll leave that one aside for now!

As you will know from my last post on why now is the time to adopt better working practices, it’s important for us to highlight the gains that can be made from newer, updated practices, and why now is not the time to be closing our minds off to a better way of life. It’s not going to be easy to sum this up in 5 points, and there are many other benefits to solid database provisioning but these are in my opinion, the ones that will revolutionize the way you develop.

Very important side note for this blog post: there are lots of subjective key practices, processes and tools that can form part of the “database provisioning process” specifically and they will vary wildly by experience, opinion and company – so for the purposes of the below I will be describing the benefits of a process that involves 3 primary components / steps, given these are the three I tackle most often:

  • Data Identification and Classification / Cataloging
  • Data De-Identification i.e. Data Masking
  • Data Provisioning i.e. Real Time Database Cloning / Provisioning

1 – Increase developer happiness / contentedness

Developers are employed to do 1 thing: innovate. It’s even in the name! Developers are on the cutting edge and are focused on providing value to end users as quickly and efficiently as possible, with shortened release cycles, incremental stories and optimized workflows they can produce this innovation. But a big part of the story is the setup.

Even if you’re working to a more agile methodology it is hard to deliver and test changes which are, in development environments, fundamentally destructive and experimental if you are sharing a workspace with multiple colleagues. Writing on shared Word documents can be frustrating at the best of times, so how can developers be expected to produce high-quality, rigorously tested, game-changing code when at any minute another developer can take the environment down, cause it to run slowly, or overwrite those changes with their own? When you cannot produce changes in an isolated, sandbox environment where they can be individually assessed, re-worked and improved then you have no guarantee that it should be promoted.

All of these sound like arguments that are focused around the production of code, but in fact these issues can all have a huge impact on something that is widely under regarded and scrutinized: developer happiness.

Developers are the people who make stuff go, and without them feeling content and valued in their roles, we can’t expect our productivity and product quality to reflect that – so when developers witness the poor management of their code, something they have worked so hard on as it goes sliding down the priority list or gets rolled back or overwritten etc. they don’t feel motivated to continue doing the best that they can do.

With dedicated environments for dev and test, for different branches, pull requests etc. developers can finally work on innovative and exciting projects, and optimize the code that goes out the door to end users.

2 – Develop a common language about data & make better decisions

It’s very hard to speak about things when you use different language to describe the same thing. That much is obvious. In the United Kingdom alone we have many different words for bread rolls. So when someone comes into a sandwich shop in London and asks for a “Stotty”, can you guarantee that the person serving will know exactly what they mean, exactly when they say it?

The Office Reaction GIF

No. There will be a gap where some translation will be required: some “down-time“, if you will. Now imagine taking something as simple as a bread roll and applying it to an enterprise data estate… you’re going to have a very bad time.

As I talked about in my blog posts here (importance of database classification) and here (classifications role in DevOps) before you can really make a fully informed decision about your data, you must know 2 simple things:

  1. What data you hold
  2. Where your data is

I should hurriedly add that I don’t just mean sensitive data now – all data deserves to be classified because whether you’re a full stack developer adding a column to a table you’ve never used before, an auditor trying to carry out a Data Protection Impact Assessment (DPIA) and trying desperately to include the database, or you’re a BI developer setting up some new reports or processes, you’re going to need to know about the data. This is where people have questions, and this is where you shouldn’t have to reply on anecdotal knowledge or being pushed around from one person to another at the company who supposedly “might be able to help“.

Better insights into data leads to better practices, less waiting (waste reduction) and greater insight. When we then act on this insight we move faster and deliver greater value in our pipelines.

Have you picked up on the trend yet? How all of these are going to end? Well don’t spoil the ending for those who haven’t, they’ll have to wait fort he conclusion!

3 – Move faster and better enable the DevOps pipeline

It’s apt that I’m listening to an amazing EDM remix of the Green Hill Zone from Sonic when writing this section, but isn’t this just what we need as a business? We want to be able to move faster, or to put it in more ‘agile’ terms, we need to be able to pivot and adapt with only a moments notice. Until now, the database has been a monolithic and difficult to steer behemoth, and it shows in our processes.

Yank Tug Of War GIF by BEERLAND

A tangible example of what I mean when I say “move faster”, is branching. It’s fairly commonplace now for a developer to be able to clone a repository and checkout a specific branch, create new branches etc. without fear of switching between those branches and what it might entail. On a dev environment, especially when one is working database-first with your changes (it does make sense to know how the changes will impact the database first – that’s all I’m saying) it is, without a reasonable process in place, exceptionally difficult to easily switch between branches and keep work separate.

This often forces developers to stick to one environment when changes are all made in tandem and can play havoc when it comes to capturing those changes in the right place – a manual state-based comparison of a dev database with multiple branches of work on it to a target upstream could be disastrous.

This is why taking advantage of something like database virtualization, allowing you to spin up copies of databases in seconds, could be the answer. You can automate the provisioning of environments as githooks, during Pull Request automation or as release candidates and the experience will be exactly the same across the board – boom *code base*, fresh and ready to go. When developers can move fast, value comes through a whole lot faster.

4 – Minimize space constraints on new copies, on premise or in the cloud

Space is always a big player in these conversations, and for some it’s enough to boil it down to “well just how much space can we save??” and that’s enough to put a dollar value on the ROI, and people storm ahead with a solution (that’s not always right for them).

But space is a very real problem, much as we (as technology professionals) like to believe that in these modern times of cloud-native solutions, easily scaled serverless-compute VMs and Big Data Clusters, we know there are still a LOT of people out there firefighting legacy, necessary technology and wrestling with what they CAN get out of backups or their SAN tech.

Even using cloud providers costs money, data egress and ingress costs $, BLOB storage costs $, additional security measures cost $. So it’s really not ideal when our databases, for historical reasons or by virtue of the sheer AMOUNT of data we hold and process, are 5, 10, 50, 100TB+, because we’re going to be struggling with this Dev/Test issue still for years to come.

As before with point 3, database virtualization has come of age and has now we have a lot of different solutions from containerization through DBaaS that can aid us in minimizing the amount of space that we ACTUALLY require, meaning we have less money that we need to pour into maintaining large, unwieldy Dev/Test environments or paying a large bill for the privilege of doing so in the cloud (and when developers will be using their dev machines anyway it just makes sense to see what we can do to leverage this existing hardware).

Whilst this one doesn’t directly add specific value to the end of the pipeline, or speed up this delivery, it can help reduce overhead costs associated with the infrastructure needed when providing this value.

5 – Work on realistic data without worrying about data breaches

This is probably one of the most obvious reasons given that I tend to blog about data regulations and compliance ALL THE TIME but I feel like I need to keep saying this.

If you remove all of the data from development and test database copies, this will not help with development and developers will have nothing meaningful to go on, nor any testing that isn’t limited to pre-defined values.

If you leave all of the data in development and test database copies, all you’re doing is duplicating your attack surface area and creating a lot of potential risks for that data to be surfaced where it shouldn’t be – on the internet, in screenshots, emails and of course, hacked.

So there needs to be a happy medium where we can have both the useful data that gives us insight and intelligence of a full data set, the business logic, trends, demographics etc. that we need during testing or analytics – but it should also be sanitized so that data subjects contained therein cannot be re-identified. Static masking, applied to lower environments allows us to retain the data with none of the data.

The Next Generation Data GIF by Star Trek

Protective measures can be built into the DevOps process from the very beginning as you’ve already seen right here on my blog; so as long as it is a part of the process, and we have multiple controls (or guard rails) that allow us to operate safely and quickly without fearing that same speed will cause us to release any sensitive information, allowing us to focus on one thing, value.

Conclusion

As you’ve seen above, it all comes down to time and money but there are many ways to save and speed up within a DevOps process by means of a good, solid database provisioning process. Whilst none of these reasons comes with a fixed ROI (unless you have ALL of your pre-prod database storage costs to hand) they contribute to something far better than that:

The ease of delivering value.

In a world where we can be concerned about everything, and where it’s hard to keep up with emerging technologies – it makes sense to start pruning away blockers to the process, the problems that are stopping us from delivering value faster – THAT is the theme and point of this blog post; our end users. We’re already delivering excellent value to them, we trust our developers and teams, but what’s stopping them from moving faster with database changes? Adopting a good provisioning process will mean you start to notice all of the above become true of your database development lifecycle.

Azure DevOps Masking a.k.a “point, no click”

“[My] kids haven’t responded to my GDPR requests so I don’t think I’m legally allowed to tell them when dinner’s on the table.”
@mrdaveturner

Ah masking. You would have thought I’d be sick of it by now, no? No, fortunately now, more so than ever, I find myself answering question after question and tackling use-case after use-case. So when I was asked this week:

“Chris, is there a way for us to call Data Masker for SQL Server directly from Azure DevOps?”

I thought to myself, well that sounds easy enough… and it was! I know what you’re thinking, c’mon Chris, surely there is more to it? But no, it’s actually pretty straight forward!

I pointed them at the PowerShell module and cmdlets for SQL Provision and the Azure DevOps plugin to automate all of their Provisioning and Masking process, thinking all the while “pffft, they could have made this harder!” and then…

“No sorry Chris, is there a way for us to call JUST Data Masker for SQL Server directly from Azure DevOps?”

Ah! Now that’s an interesting one!

#1 Figure out where you want Data Masking to run in your process

This empty Azure deployment stage looks good enough for now! If you wanted to chain other processes either side of it, that’s cool too! Maybe you have your own provisioning process in place and you want to point Data Masker at it to sanitize it? Makes sense to me! For now I’m going to stick with a single agent job for simplicity.

#2 Figure out what is actually going to run Data Masker

Data Masker is a client install and as such will need to be installed on a *gasp* actual machine!

No but seriously, any server you have lying around, physical or VM will do the trick as long as it meets these requirements. Now this Server/VM will need to have an Azure DevOps agent on it already, which of course is the ideal candidate for being the “thing” that calls Data Masker – this could be the Staging/Non-Functional/Pre-Prod environment also of course, so you could copy down PROD and then immediately invoke masking.

#3 Call the command line from Azure DevOps

In your pipeline steps you can specify the calling of an executable on the machine where the agent resides. Fortunately Data Masker has a wonderful command line available that you can call, you can read all about it here: https://documentation.red-gate.com/dms/data-masker-help/general-topics/about-command-line-automation

The PARFILE you could of course dynamically replace with variables so that it only calls the relevant parameter file for that particular database as well, a nice benefit!

My PARFILE just simply looked like this:

It was calling a local Data Masker set “AzureFun” – now the thing to bear in mind is that Data Masker will run with the Windows authentication credentials that are being run as by the Azure DevOps agent, unless you specify otherwise. In this case because the Azure DevOps agent has the correct permissions to update the databases on this instance anyway I’m fine to use Windows Authentication:

Conclusion

It’s very easy to simply call the command line of Data Masker for SQL Server directly from Azure DevOps, does this same approach work from other CI/CD tools? If they can call executables on the target server then absolutely! So it’s very easily included in the process – you just have to think about where Data Masker is installed and what credentials you’re using for it!

Bonus Point – what about if it’s all Azure SQL Database?

You had to do it didn’t you, you had to say it!

“But Chris, now we know we can call this all from Azure DevOps, what if we wanted to mask and copy Azure SQL Databases into Dev/Test etc.?”

Well actually the good thing is, it’s also pretty similar! When you’re connecting Data Masker to an Azure SQL DB you only need to specify this in the connections in the controller. Again, authentication will likely have to be SQL Auth at this point, and you need to be in Cloud mode, and I’d recommend setting the connection timeout to 10s rather than the standard 5s, but it can still be called as normal from the PARFILE:

So the Data Masker element is reasonably straight forward – that’s the good news. But the thing you REALLY need to stop and think about is:

Where are our Dev and Test copies going to BE?

Option #1: If they’re going to be on VMs or local dev and test servers / developer machines then you could follow a similar approach to one I laid out in this blog post for Redgate in which you create a BACPAC file and split it out on premise before importing it and then provisioning from there. And you could use this code in my Github to achieve something very similar. Caveat: I am no PowerShell guru, who do you think I am? Rob Sewell? Chrissy LeMaire? No. Sadly not. So you can build your own logic around my code though, have at it, I don’t mind! ^_^

Option #2: Keeping everything in Azure. You can copy databases around in Azure and it seems to work pretty well! So I wrote this PowerShell (also in my GitHub for y’all) to effectively copy a PROD DB into the same resource group, mask it and then copy it across to a Dev/Test resource group, dropping the temp copy so as not to incur lots of extra Azure costs (this is just one of the methods I’ve seen people use, again it’s up to you!) – again, see the caveat in option #1 above for my statement on PowerShell! The good thing is, you can use the ‘&’ simply from PowerShell to call Data Masker’s command line.

Either of these options can be run from Azure DevOps also as part of your provisioning or working processes, but instead of including a call to the command line, you can run a fun PowerShell script instead:

Second Conclusion *sigh*

There are lots of ways to get what you need into Dev and Test, but these copies should be masked if they contain personal, identifying information. There are some methods above but there are plenty of others out there on the internet and if you’re not sure about getting started with data masking; try my post here – happy masking!

“Where do I even begin with data masking?” Getting started in 3 steps.

“I like to begin where winds shake the first branch.”
Odysseus Elytis

I’ve already covered data masking a little bit with a few of my first posts on this blog, and for many it will not be new to them. Performance tweaks and determinism are just par for the course in masking data, approaches to solving a well defined problem. But for many, the challenge is not how they mask their data. You’re free to write scripts, PowerShell or use a COTS software solution like Data Masker and the end result is going to be a set of data which you can securely work on; knowing that if you are going to suffer a breach of that data, it will be useless to those who get their hands on it.

The biggest problem here though, is getting started.

That phrase in itself suggests what kind of problem data masking is – anything in this life should be easy to at least get started at, with it becoming more and more complex the more you get into it. One can write a Hello World program quite simply in any major programming language but to start creating rich, functional graphical software, you’re just going to have to learn, write and test (repeat infinitely).

But data masking is like playing ‘capture the flag’ without an opponent team. Capturing the flag is a straight-forward enough objective, but if you don’t know where the flag is, how big it is, what it looks like, how many flags their are etc. then your job is going to be impossible if you just jump straight into it. Like all things in this life, data masking takes careful thought and planning to understand the scale you’re dealing with – and you need to make sure you don’t close your scope too soon by focusing on one system or set of data.

So here are the 3 broad steps I would recommend when starting to protect data in non-production environments:

1 – Catalog your structured data

I cannot emphasize this point enough. Before you do anything, catalog the heck out of your estate. Use anything you like and whilst i’d recommend approaches such as this or this, even an excel sheet is a start. You must know what you hold, everywhere, before you can begin to make decisions about your data. Perhaps the goal of this was to mask the data in pre-Production copies of your CRM system? Well, data doesn’t just exist there in isolation… or maybe it does! You don’t know until you have a record of everything. Every column, every table, every database, every server. There should be a record, tagged with a reasonable value to indicate at the very least the following 4 things:

  1. What system it is used in
  2. Who is in charge of the stewardship of it (i.e. who should be keeping it up to date / ensuring compliance)
  3. How sensitive it is
  4. What kind of data it is

Obviously other values can be specified, like your treatment intent (masking, encryption, both etc.) or your retention policy for specific fields, even something like identifying how the data moves in-out-and-around your estate (data lineage) but at the very least it needs to highlight where the hot-spots of sensitive data exist and what is most at risk. From this you can derive the insight needed to start masking data.

Notice this step is before any specific masking steps, that is because we should have a company wide resource to reference our structured data regardless.

2 – Scope your attack, identify the ‘hit points’

Once you have a record of everything you can begin with the systems that are going to be the most at risk / which databases you should start with. As part of the new project you’re working on, you may need to mask multiple tables or even databases but the size and scale of this process is currently stopping you proceeding.

Identify the key tables or even databases you will need to start with – it is so rare to come across databases where sensitive information is perfectly evenly spread across every table (I’m sure there are some, of course, but many will have concentrations of data) – these ‘usual suspects’ will be in contact, account, address, payment info tables etc. and where we have multiple tables with duplicated information, we want to select the biggest behemoths for this task. The reason for this is that instead of trying to mask all tables equally, you’ll want to mask the biggest sources of truth and then “fan out” your masking, linking them back to the masked tables. Not only is this process quicker but it also allows our masking to be consistent and believable. Draw out diagrams top we can keep the actual masking to a minimum (i.e. just these hit points) and then identify where you can fan this out to. Use these almost as you would stories in development.

This approach keeps the effort of writing scripts of configuring masking rules to a minimum but keeps the impact high – preventing us from wasting time trying to mask a table which, for instance, may only hold Personally Identifiable Information by correlation with other data. Ultimately, once you have this (now hopefully smaller, more focused) list, it’ll be easier to define how you want the masked data to appear.

3 – Write. Test. Validate. Record.

Once you have an understanding of your estate and an understanding of what you’re going to be masking, it’s time to start getting some indicative masked values mapped across (names for names, dates for date of birth etc.) but this is not a big bang approach.

Just like software development this process is best done iteratively. If you try to write one masking script or rule set you will gain 2 things:

  1. A thing that only you know how it works
  2. A thing that only you can maintain

Start with a copy of the database. Restore it to somewhere locked down that only a very select few have access to. Build the first part of your script to reflect the first of the ‘stories’ you identified in 2). i.e. “Mask the names to be something believable”. Does this work?

No, why not?
Yes, perfect.

You should even ask for feedback on the masked data from developers, they will be working with it so it makes sense to ensure you collaborate with them on what they would like to see.

So it’s time to record what you’ve done, just as anyone would do with any other software process. If only there was some way for us to do this… oh wait!

PUT IT IN SOURCE CONTROL.

It doesn’t matter if you’ve written a script or created part of a Data Masker masking set, it should now be put into source control, properly commented, and stored so that it is easily accessed, updated and implemented. Grant Fritchey actually wrote a great article about doing this here!

Now build up those rules, tackle your stories and gradually try to create enough to mask each of the ‘hit points’ and then fan them out to the secondary tables. Once you have kept this effort minimal but high impact, you’ll be able to try this in earnest. Once you have tackled your high risk targets then you can add stories to add specific test cases or oddities required in Dev and Test.

The point is to start somewhere, like I said, getting started is the hard part, but once you know what you have, where it is and how it’s structured, how you actually mask the data is a breeze.

“How long does data masking take?” Part 1

“Exploring the unknown requires tolerating uncertainty.”
Brian Greene

Another in the data masking series and I thought I’d go with an easy one to answer. It’s right up there alongside the other easy questions we all ponder like:

“What is the meaning of life?”
“How do I build a fusion reactor using common household objects?”

and the classic “Can Vegans eat Animal Crackers™?”

The answer to any of the above questions is, of course, it depends.

If you’re content to ‘live life to the fullest, travel and be happy’, you happen to have spare protium and boron-11 just around the house and you’re content to bypass the ethical conundrum that although the crackers look like animals, they do not actually contain any animal products then the above are theoretically answerable, but data masking… whooo that’s a doozy!

There’s no easy way to benchmark this process, honestly. It is something that will be subject to many things you may wish to think about prior to the deed:

  • How many columns are on the tables you want to mask?
  • How many rows are in the tables you want to mask? (Which are the biggest tables?)
  • How many indexes are on the tables? Will these get in the way of a static masking process?
  • Is data duplicated across multiple tables and does it need to be synchronized to maintain referential integrity?
  • How powerful is the machine you’re running the process on?
  • WHERE is the Database you’re running this process on? (Same machine, Network, PaaS…)
  • Will anyone be trying to connect to and running queries on against the Database being masked?
  • Do you need to mask across Databases for consistency?

All of these things will affect the process, both in terms of how you configure it and how long it takes.

I will endeavor though, to give SOME idea. At the very least it’s a good starting point and hopefully if you decide to run any testing of your own you will see similar, or even better performance. For this test I will use Data Masker for SQL Server but in future blogs I’m hoping to test out a few other methods as well, so stay tuned!

For this test, I will be using the following setup – note this isn’t what I would expect a real setup to be, I would expect the staging server being used for masking (and potentially Image creation in SQL Clone) would be a teensy bit more powerful!:

  • Microsoft SQL Server Developer 2016 (64-bit)
  • Microsoft Windows NT 6.3 Server hosted in AWS EC2 (c5.2xlarge)
  • 4 Processors (vCPU), 16 GB Memory
  • Redgate Data Masker version 6.3.20.5034 installed
  • DMDatabase (tables below in Data Masker) – this is my demonstration database in simple recovery mode with:
    – 1,000,000 rows on the DM_Customer table
    – 50,000 rows on the DM_Customer_Notes table
Data Masker Tables Tab showing row counts for large tables

For the purposes of this test I will mask: firstname, lastname, company name, street address, region, country, telephone number and zip code on the DM_Customer table and the notes field on DM_Customer_Notes (a free text field). Once I have completed this, I will then create an email address on DM_Customer using the masked first and last name values, and then finally I will synchronize the masked names across to the DM_Customer_Notes table, where they are duplicated.

The output masking set is as below. It uses a substitution rule for the columns on DM_Customer, a row-internal synchronization rule for the email address, a search-replace rule to randomly replace all of the characters in the free text field and then finally a table to table synchronization rule to copy the masks across to DM_Customer_Notes – for the rules I have enabled Bulk Substitution and set the commit frequency to 100,000 to speed up building and committing the statements (hence Simple Recovery mode):

When run this process takes… well actually I stopped it. It’s paaaaaaainfully slow. This is probably because I’ve not utilized Data Masker’s worker threads to their best and I’m reliant on a single connection to carry out the process – we’ll be here for a while!

So, I’m going to make some performance improvements! I’m going to split the substitution and row-internal rules into ‘Split Range Manager’ rules (video on how to do this here) – This is what it looks like now (rules expanded) so i can utilize concurrent connections to break this big table up:

…and guess what?

Masking stats report

Total run time to mask and synchronize all of these rows was 1 minute 26 seconds. I would say that’s pretty good, and the results set looks great too!

Results set from masking: Contacts

So in conclusion – how long does data masking take? Well, it still depends.

I didn’t have lots of non-clustered indexes over these tables that I had to take care of first, I didn’t have very complex operations being performed as command rules or table to table rules matching on free-text fields, but I did have an integer field I was able to split the range across, so things went my way… this time.

But it can go fast and you can make improvements, as shown above.

If you have any questions on how to improve the performance of your Data Masker masking sets feel free to let me know! Tweet me or get in touch!

Happy masking!

Oh, and Happy Holidays / Merry Christmas / Happy Tuesday! 🙂

I got to watch a team work fast and it was awesome!

“Unless someone like you cares a whole awful lot, nothing is going to get better. It’s not.”
Dr Seuss

I care a great deal about anything I work on, as I mentioned in my previous post here (ironically about not working for once), so it goes without saying that I am super invested in any customer I help to get up and running with any of the Redgate solutions, and none more so than Data Masker.

Ever since it became part of the Redgate family Data Masker has been an integral part of my workday – there aren’t many days where I don’t interact with the tool or it’s concepts in some way and when things go wrong, the tool breaks, something doesn’t work like it should, well not only is it less than ideal for me (showing me up) but it’s not delivering value to the customer.

Now that. I hate with a passion.

However, I’m lucky enough that once that happens, that is not the end of it – we don’t pack everything up and say “well… sorry all, that’s your lot.” No – I get to speak directly to our fantastic support team and the equally as fabulous and helpful development team directly behind the tool, and guess what? They care too. Immensely.

I can think of 2 key examples of this team working in the most incredible way, you wouldn’t even believe (well maybe you would), but it goes to show you what is possible, especially when you break down the silos in your organisation. I never became just “a ticket from a sales engineer”, and this is how they helped me fix 2 problems:

1 – UTF-8 encoding of strings for substitution rules

I was working closely with one of the Business Development folk (little side nod there to Kendra for saying folk so often I’ve started saying it) in Redgate’s sales team who were working with a potential customer in a country where Arabic is the primary language. As such, you would expect them to want to use Data Masker to mask Arabic names like اَمير‎ (“Amir” in English) into data sets, instead of something like “Frank”, which just doesn’t have the same ring to it.

It turns out that in the port across of Data Masker from it’s older v5.5 to the swanky new v6.0 (yes this was a little while back) the ability to change the encoding of strings from user defined data sets had been broken, which meant that the values from Data Masker weren’t being inserted correctly in the table, rendering all of their Arabic sets useless. This was a huge blocker to their trial, which was under time constraints anyway.

I reported this to the Data Masker team on the 7th February 2019 at 12:53pm, created a support ticket for reference at 2.03pm and had personally spoken to them by 3.00pm. The lead developer, support rep, product designer and myself quickly met up to discuss it and agreed that as this was a bug, broken functionality that should exist (and which could block not just this customer but any customer requiring it for other language sets) that they would down tools and work on a fix immediately.

By 9.00am the following day a fix went out the door. Built, tested, deployed. Who did they involve in the testing? Me, initially. Then they waited on feedback from the potential customer, who also confirmed it worked after upgrading.

Wow. That’s what I call fast!

2 – Time-out tennis

More recently (think November 2019) I was working with a customer of ours who I had built up a great relationship with – they were super friendly, super responsive and all round great to work with. Unfortunately as we were getting their masking sets set-up (pun definitely intended) we started encountering time out issues when waiting for masking stats to return.

This was a little irksome, as it was a slight dent in my relationship and credibility and was slowing them down, as it was causing sets to not complete at all.

The problem was though, it wasn’t such an easy fix as with Number 1 above, as it wasn’t exactly clear what was causing the timeouts and I wasn’t really sure what were the best places to check! Fortunately the “Masketeers” as they are more commonly known around Redgate towers, did have an idea of where to look. A nominated member of the development team (and to him I will be forever grateful) almost became a little subdivision of the team – it didn’t require their full might, just someone who knew even more intimately what was happening than me!

Through a few ‘back and forth’s with the customer, experimenting with timeouts and making some tweaks we were able to establish what was going wrong, and ultimately provide a fix. This work became a new branch which was merged into the main base after testing once again and was released the very next day. Finally, the customer let us know it was all working again and sets were completing as they should.

Conclusion?

Sometimes you have the pleasure of working with some unsung heroes where you work, I do it on a regular basis – from the facilities and cleaning team here in the building who do the most incredible work to look after us, to the Sales management team who are constantly looking at ways to make us the best possible company to deal with – Redgate is definitely a place where people can do the best work of their lives.

But on these occasions, I got to witness something special. Cross-functional collaboration. Communication. Empathy. Passion.

And just getting the work done. When it’s about delivering value to customers, feedback, development and testing are everybody’s job, and I’m lucky enough to work with people who put that theory into practice.

Deterministic data masking – the who, who and who? (and how?)

“Security is always excessive until it’s not enough.”
– Robbie Sinclair

You may not already know this about me, but I kinda like data masking.
Scratch that, I LOVE data masking.

Increasingly both around Redgate and in general I seem to be getting a bit of a reputation as “the data masking guy” but for good reason – to include yet another quote, from Joe Kaeser this time: “Data is the oil, some say the gold, of the 21st century…”, more and more I hear stories about people leaving their oil/gold out for everyone to see, opening up the widest attack surface area by doing things like copying backups down into non-Production environments or exposing test systems to the internet – the list goes on.

This means that people turn to all of the protective methods they have available to them: encryption (TDE, row and column level etc.), static and dynamic data masking, access control… and many combinations of them.

One of the big points I always have to cover when it comes to static data masking though, is something called “deterministic masking”, so let’s start with 2 definitions of my own to make sure we’re on the same page:

Static data masking is the process of de-identifying sensitive data-at-rest within the tables of your Database. It is typically used to provide realistic, Production-like data into non-Production environments like Dev and Test, and even sets that are given to 3rd parties. This relies on retaining non-sensitive business specific fields within rows and taking anything considered PII (Personally Identifiable Information) or PHI (Protected Health Information) and either scrambling or replacing it with similar but ultimately false data.

Deterministic data masking is the process of masking data with values in a repeatable way, such that it will give the same value when masked in any and all future runs on any value that matches and will create a new record for values which have not been previously masked. An example of this would be if you were to mask “Chris Unwin” to “Brad Pitt”, it should appear as “Brad Pitt” not only in our (for example) dbo.Contacts table but also all associated tables (regardless of PKFK relationships at the DB level) and every single run should provide the same output. This is useful for building up familiarity with the data and utilizing for future test runs.

Now. I should caveat this blog post (you’ll find I’m always caveating my posts) with the fact that deterministic masking does have it’s benefits but the very idea that one thing should always become another, in my eyes, is inherently less secure than something that always gets de-identified to a different value. As such, I will always recommend that where possible, masking runs should produce differing values. Deterministic masking is also compute heavy because instead of simply randomizing in values, it will have to check up front the value to be replaced and then replace it with the corresponding value, another potential downside if speed is a key driver in the process you’re trying to put in place.

Most masking tools either support deterministic masking directly (*cough* dbatools.io *cough*) or require a little bit of configuration to get started, but it is a workflow that should be catered for, for those who need it. So here is a quick getting started guide for deterministic masking, if you’re writing your own scripts or (as I will in this example) you’re using Data Masker for SQL Server from Redgate.

Step 0 – Figure out where you’re going to do the masking

There are lots of different ways to move data around and get things masked before exposing it to development and testing teams (or handing it to 3rd parties). In this example, I’m going to assume that the only thing I have available to me is a backup and restore process for moving data around, nothing fancy like SQL Clone to help.

So for this I will assume I have a staging instance somewhere that I can use as part of this process, lets call it WIN2019 and the process I’m going to follow is:

  • Restore a full .bak file to WIN2019
  • Carry out the data masking process on this instance
  • Backup the masked DB
  • Move the new .bak into lower environments (restore / make available to Devs)

Step 1 – Map it for multiple future runs

The first problem you have to contend with is needing to maintain a record of how we want to mask things. If we always want to mask the credit card number 3422-567157-24797 to be 3488-904546-46471 then we need to have a place this is stored. The question to ask ourselves here though is WHAT needs to be recorded. There is a huge difference between:

CreditCardBeforeCreditCardAfter
3422-567157-24797 3488-904546-46471

And

CustomerIDCreditCardAfter
1000000001 3488-904546-46471

The latter is obviously far preferable because it does not contain any sensitive PII – it is purely the masked value and a non-sensitive CustomerID which only really makes sense within the company or is a system identifier.

So we should get a mapping location set up on WIN2019, I don’t want to make my tables too wide so I’m going to keep this fairly small and atomic – we’ll create a Database on the server with a mapping table for the credit cards:

CreditCardMap table in SSMS - CustomerID as INT (PK) and CreditCard as nvarchar(60)

This is going to be the basis for our repeated masking. The reason for having this as a separate DB/Table though?

1) The mapping should persist – if it exists in the same DB then we will just overwrite it every time, rendering the mapping useless.
2) Devs/Testers don’t need the mapping – just the end result.

Step 2 – Set up the masking to cover the tables, regardless if there is a mapped value

One of the most important phrases in the GDPR is “Data Protection by Design and Default“, and it’s one of my favorites. In this context I am going to interpret this in a very specific way, and that is: “we must mask everything, before trying to map it back to a value that exists, just in case the link to the MaskingMapper DB were to fail for any reason.

I first restore a copy of the Database I’m going to mask (the DMDatabase) and then setup a Data Masker substitution rule to process the DM_Customer table, de-identifying the credit card numbers:

Data Masker substitution rule to mask Credit Card Numbers with invalid AMEX CC numbers, also (fake) customer credit card Nos. displayed in SSMS

Step 3 – Copy the distinct values across into the mapping table

This step is going to be as simple as writing a single tSQL statement to copy the values across – in Data Masker I will wrap this into a Command Rule and make it dependent on the previous substitution rule:

INSERT INTO MaskingMapper.dbo.CreditCardMap
(
    CustomerID
  , CreditCard
)
SELECT DISTINCT
       customer_id
     , customer_credit_card_number
FROM dbo.DM_CUSTOMER
WHERE (
          customer_credit_card_number <> ''
          AND customer_credit_card_number IS NOT NULL
		  AND customer_id NOT IN (SELECT customer_id FROM MaskingMapper.dbo.CreditCardMap)
      );

Step 4 – Sync everything back together

Finally – we need to bring any information back from the table if it had values written to it in previous runs. In tSQL we could write an UPDATE with an appropriate WHERE clause but I’m going to use an additional controller and Table-to-Table Sync rule in Data Masker to handle this:

Rules in Data Masker - Substitution to mask data, Command rule to update the mapping table and a Table to Table rule to sync back into the table

Result

If we now run this we will have achieved deterministic masking, because we have the following before and afters – first for the DM_Customer table:

DM_Customer Credit cards before
DM_Customer credit cards after masking

and for the CreditCardMap table:

Mapping table prior to masking
mapping table after masking

The mapping table now has 77 rows and if we repeat the masking step by step without changing anything we can see that the credit card numbers change in the first instance, but then synchronize back to the values that should persist, the images below represent just running the first two steps in isolation (i.e. masking everything regardless – left) and then the synchronization job restoring the predetermined values (right) and the mapping table still has 77 rows.

Now if in the next run one of the NULL/blank fields has a real credit card number, or we add any additional customer IDs (i.e. with a more recent backup with fresh data) they can be masked, accounted for and persisted between each run.

Conclusion

Deterministic masking is hard, but it is possible. You can use a number of methods to achieve it, such as the above, but the first question you need to ask yourself (after “do I feel lucky”) is:

“Do I NEED deterministically masked data, or is it a nice to have?”

9 times out of 10, I’m pretty sure the answer will be that it is not essential, and therefore you should focus on making sure the masking of the data is random, static and fast. Adding compute to this process will only slow it down and at the end of the day, we just need to make sure our customers data is protected.