When the unused become used: Using SQL Data Catalog to spot the pitfalls of a data classification process

“The surprise is that you continue to be surprised.”
Jill A. Davis

But what surprise am I alluding to? Of course it is the surprise that “all of a sudden” you have sensitive data in tables where there ought not be!

What I mean is; we have a fantastic process in place: we have completely classified our entire environments, developers move changes to classification up-stream just like schema changes and we have an automatic process to check and update classifications as part of our pipeline.

But as part of this process we “descoped” everything that was non-sensitive, including (crucially) empty tables. Now empty tables are a common occurrence and most people I speak to have a bunch of them, for a multitude of reasons; they’re using an ERP/CRM and aren’t using some of the modules, some of them are older and are gradually being phased out etc.

What happens over time however, is that some tables which we originally weren’t using suddenly start to populate with data, and others which were in-scope now suddenly aren’t. But because this isn’t technically a schema change and we can’t necessarily rely on the developers to know when this change will happen – how do we catch it?

The easiest thing to do in this instance is an automated check – write once, run many times. As we’ve seen from the many, many blog posts before this the answer is pretty straight forward – SQL Data Catalog’s PowerShell module once again saves the day!

(You can find the full documentation on the PowerShell module and a cmdlet reference here in the documentation.)

The Process

The most obvious thing to do is to re-scan our DB so Catalog is up to date and then fetch all of our columns first into a collection:

# Connect to SQL Data Catalog
Connect-SqlDataCatalog -AuthToken $authToken -ServerUrl $serverUrl

# Quickly refresh the scan of the Instance to get accurate row counts
Start-ClassificationDatabaseScan -FullyQualifiedInstanceName $instanceName -DatabaseName $databaseName | Wait-SqlDataCatalogOperation

# Get all columns into a collection 
$allColumns = Get-ClassificationColumn -instanceName $instanceName -databaseName $databaseName

Each of the columns returned by Data Catalog here has a number of properties that we can use:

The properties of each column object returned by the SQL Data Catalog PowerShell

All of these properties can be useful for something (and have been in the past) – in this case though the properties we care about are the id, the tags collection (name property) and the tableRowCount. The reason being that id uniquely identifies our columns so can be used to easily sort and filter on without having to match on text, the tags collection means we can search for any columns that do or do not have the Out of Scope – Unused tag and the tableRowCount lets us know if the table is or is not populated with data.

Its then fairly easy to filter these down:

# Collection for columns that are on empty tables

$emptyTableColumns = $allColumns | Where-Object { $_.tableRowCount -eq 0 }

# Collection for columns that are tagged as empty

$ColumnstaggedAsEmpty = $allColumns | Where-Object { $_.tags.name -eq $unusedTag }

# Collections for columns both tagged as empty but not, and empty columns not tagged as such

$ColumnsNowInUse = $ColumnstaggedAsEmpty | Where-Object { $_.id -notin $emptyTableColumns.id}

$ColumnsMissingClassification = $emptyTableColumns | Where-Object { $_.id -notin $ColumnstaggedAsEmpty.id}

You can easily visualize these in Data Catalog itself too:

Columns that are on empty tables, regardless of tag
Columns that are tagged as Out of Scope Unused regardless of if they have data or not

We can already see a difference in the lists – if we run the entire PowerShell however (which you can find right here in my GitHub, and I’ll include it at the end) then we actually find those differences out:

The PowerShell result: 2 classified as empty columns now with data, 3 columns not classified as out of scope, now on empty tables

So we can correct these by classifying the 3 now empty columns as now Out Of Scope – Unused, and we can change the 2 now-populated columns to be In Scope:

Suppliers is now in use, and corrected to be In-Scope
The missing Out Of Scope – Unused tags are corrected on the empty columns

…and now when we run the PowerShell again we should find everything is rectified:

Everything has been corrected, now no columns are incorrectly labelled or missing labels

Ongoing

Now we have the script we can run to identify these issues, you could implement it on the back-end of the DevOps process described by the blog post or video linked at the start of this blog post – and this would allow you to constantly be sure that your view of the Production database is accurate and up to date.

The key to a good classification process as part of on-going governance and an up-to-date view of your structured data estate is minimizing the manual effort that has to go into maintaining this process, which with the help from the Data Catalog PowerShell module – we can do easily (if we wanted a process that would go out of date after 2 weeks, we could have just used MS Excel…)

Happy Classifying!

SQL Data Catalog, Data Masker and your DevOps pipeline: How do I make sure everything is being masked?

“However fast regulation moves, technology moves faster. Especially as far as data is concerned.”
Elizabeth Denham

You’re probably sick of me constantly talking about how the cataloging of columns should be part of the DevOps upstream deployment process. I’ve blogged about it. I’ve even produced a video demonstrating this in action. But one question that this always throws up is:

If we include cataloging in the upstream process, how do we make sure our masking sets are also staying up to date?

The benefits of including the classifications in the upstream pipeline is that nothing ever gets to Production that hasn’t been classified – we constantly have a perfectly up to date idea of the nature of our structured data estate, how data is distributed, what risk is associated with which systems? etc. but one of the biggest wins is that we constantly know which fields need to be masked when we’re pulling copies back into non-Production, constantly.

Add a table? Add columns? We know about them, they’re classified, they’re deployed… so now they need to be masked on our next refresh. But how? Well it all depends on which approach we want to take:

  • Automated
  • Manual

Doesn’t it always boil down to those options? What I mean is that either we have an appetite to completely generate our masking set afresh every single time based on our classifications (Automated) or do we want to ensure that we configure each rule ourselves (Manual). They both have benefits and drawbacks.

Automated

How: We can generate a masking set using the SQL Data Catalog PowerShell each and every time as part of our pipeline. Add a column, tag it in the pipeline and then simply wait for the necessary rulesets to be generated in the pipeline (perhaps as part of your build) for you to run when you bring a copy back. Check out my walkthrough for how to set this up.

Benefits: The process is automatic. Its headless and you don’t need to think about it at all. As long as the classifications are provided (and if you follow the steps from the blog post and video you should be providing them) then you’re always generating rules for every classified column.

Drawbacks: This process can be fragile. If we don’t classify correctly we can end up masking in the wrong way or trying to mask the wrong field (e.g. a Primary Key, Constraint, Identity etc.) which can cause masking failures and then you have to spend time fixing the pipeline/masking set. This also means that the nature of the masking is dependent on your classifications, and the values you will get masked into the columns will be less realistic as a result (i.e. you can’t generate Row-Internal Sync Rules using the integration).

Drawback Mitigation: To avoid the process breaking, be sure to really focus on how you set up your API settings / how you pick which categories and tags are used to generate the masking rules (like I discussed here). This will at least help you make sure you map common data sets into columns (and don’t hit columns that have constraints or keys).

Only columns marked with Static Masking as the Treatment Intent will get a rule created for them
An Information Type is given to every column we intend on masking – these are then mapped to templates in masker to ensure more realistic data

Manual

How: Either rely on developers to check in masking set changes along side their code changes or build in a manual intervention step to your upstream process to ensure that someone opens and updates the schema and rules within the masking set, and then put this back into the pipeline.

Benefits: This results in more accurate, more likely to run sets that generate significantly more reliable and realistic data as an end result. Masked DB copies can be more easily used for anything beyond simple development changes, including analytics etc. You can include any specialist rules you need and apply your own understanding and knowledge of the database.

Drawbacks: This is obviously a manual process – less than ideal. Anything that involves a human can invariably go wrong because we are humans and we make mistakes (which we learn from of course). This also takes significantly more time as part of the process.

Drawback Mitigation: This is a harder one to mitigate as we’re reliant on manual intervention, however the best way to check this could be with another team member checking (maybe as part of a pull request) or you could include an automated PowerShell script to effectively rationalize the columns to be masked from Catalog vs the columns currently in the Data Masker masking set – this would help you understand if all the necessary columns have been updated or if any were missed. I have included an example of this PowerShell below.

### CHANGE THIS ###

$MaskingSet = "yourmaskingset.DMSMaskSet" # Your masking set including the DMSMaskSet file extension
$instance = "yourinstance" # The Instance as it is shown in Data Catalog that hosts the database
$DatabaseName = "yourdatabase" # The DB you want classification info for
$CatalogServer="http://yourmachine:15156" # The lcoation of your catalog server, ending on :15156
$authToken="redacted" # Your Data Catalog Auth token from the Settings page
$tagName = "Static Masking" # The tag you're using to identify which columns need to be masked

### DONT CHANGE THIS ###

Invoke-WebRequest -Uri "$CatalogServer/powershell" -OutFile 'data-catalog.psm1' -Headers @{"Authorization"="Bearer $authToken"}
Import-Module .\data-catalog.psm1 -Force
Connect-SqlDataCatalog -ServerUrl $CatalogServer -AuthToken $authToken 
$ColumnsMarkedForMasking = Get-ClassificationColumn `
    -InstanceName $instance `
    -DatabaseName $DatabaseName | Where-Object {$_.tags.name -eq $tagName} 
$MaskingSetXML = [xml](Get-Content -Path $MaskingSet)
$subrules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleSubstitution')
$internalrules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleRowInternal')
$shufflerules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleShuffle')
$searchreplacerules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleSearchReplace')
$TablesAndColumns = @()

$subrules | ForEach-Object {`
    $CurrentTable = $_.TargetTableName.value
    $_.DMSPickedColumnAndDataSetCollection.DMSPickedColumnAndDataSet.N2KSQLServerEntity_PickedColumn.ColumnName.value | ForEach-Object {$TablesAndColumns+= $CurrentTable + "." + $_ }
}

$internalrules | ForEach-Object {`
    $TablesAndColumns+= $_.TargetTableName.value + "." + $_.TargetColumnName.value
}

$shufflerules | ForEach-Object {`
    $CurrentTable = $_.TargetTableName.value
    $_.DMSPickedColumnCollection.DMSPickedColumn.N2KSQLServerEntity_PickedColumn.ColumnName.value | ForEach-Object {$TablesAndColumns+= $CurrentTable + "." + $_ }
}

$searchreplacerules | ForEach-Object {`
    $TablesAndColumns+= $_.TargetTableName.value + "." + $_.TargetColumnName.value
}

$result = $TablesAndColumns | Sort -Unique
$ColumnsNeedingRules = $ColumnsMarkedForMasking | Where-Object {($_.tableName + "." + $_.columnName) -notin $result}

"`nThere are " + $ColumnsMarkedForMasking.count + " columns that require masking for database " + $DatabaseName + "in SQL Data Catalog."
"You are masking " + $result.count + " distinct columns in masking set: " + $MaskingSet
"`nThe columns that do not currently have a mask configured are:`n"

$ColumnsNeedingRules | ForEach-Object {$_.tableName + "." + $_.columnName + "     (" + $_.dataType + ")"}

$next = Read-Host -Prompt "`nWould you like to see the columns currently in your masking set? (Y/N)"
if ($next -in ("Y", "y")) {$result}

This code can also be found on my GitHub here.

The output of running the script – 13 columns in the masking set, 14 columns outstanding to be masked

You can technically even use this same check approach for the automated masking set generation option, to ensure that everything has been tagged correctly.

Configuring Dynamic Data Masking in Azure SQL Database from SQL Data Catalog using PowerShell

Horror is the removal of masks.”
Robert Bloch

I spend a great deal of my time talking about Data Masking, don’t believe me? Checkout here, here, here and even here. I talk about it a LOT, but I’m always talking about Static Data Masking (SDM), which is the process of masking the data at the file level itself so it is irreversibly altered – this is fabulous for non-Production environments like Dev and Test, especially when you pair it with a good cloning technology.

But what about Staging / Production environments?

I often work with teams to implement SDM and one of the fastest routes to successfully generating your masking sets for cloned environments is, of course, SQL Data Catalog (or cataloging solution of your choice) – you’ve already put the effort in to classify your columns and figure out where sensitive information exists within your databases and instances… so doesn’t it make sense that we can just use THAT as a source of truth and generate masking from there?

Note: I actually produced an end-to-end video showing the process of Cataloging, Masking and Cloning in under an hour here: https://www.red-gate.com/hub/university/courses/sql-data-catalog/end-to-end-data-protection-with-sql-data-catalog-and-sql-provision – if you watch/try it let me know how you get on!

Funnily enough though we wouldn’t want to use SDM for Production (and potentially Staging) environments though – as it irreversibly changes the data, it’s just going to completely mess up all of our Prod data. To tackle this then, many people I work with turn to Dynamic Data Masking.

Dynamic Data Masking (DDM)

DDM is a method of masking the data based on your access rights to the data. As far as customers see they have access to their data through our site or application no issues, but if anyone else needs to query that data, or different people need to see different results when querying environments, DDM has been their way to go.

Whilst a lot of people like to pick up on some of the well known downsides of DDM, it’s not like you’re entrusting the entire security of an environment to it alone – there are a ton of measures we can put in place and DDM is just one; like an ex-colleague of mine (someone very wise whom I admired greatly and am still sad to this day I no longer get to work with them) used to say: “It’s about building a defensible position. The more you do the easier it is to prove you’re doing something and the more likely you are to BE protected.

So when a customer asked this week if it was possible to configure Dynamic Data Masking from SQL Data Catalog (because they’d seen the “Treatment Intent” category and the tag that clearly states “Dynamic Data Masking”), just like we’re able to configure Static Data Masking, well now that was a challenge I couldn’t turn down!

The SQL Data Catalog Taxonomy Page – Treatment Intent Category showing Dynamic Data Masking

DDM in Azure SQL DB

Configuring Dynamic Masking in Azure SQL DB is fairly straight forward through the Azure portal, you can go to your Azure SQL DB, click Dynamic Data Masking and it gives you the option to simply pick and save columns to apply Data Masking to, and to whom these rules apply / don’t apply:

DDM in the Azure Portal for the DMDatabase_Dev, with masks configured on customer_firstname and customer_email

However when we potentially have a lot of columns or DBs to configure masks for this is really going to get very old very fast. As with all things, I turned to PowerShell for the answer and fortunately I found it: https://docs.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview?#set-up-dynamic-data-masking-for-your-database-using-powershell-cmdlets – basically I can get existing DDM configurations and set new configurations for columns directly from my Azure SQL DB using PowerShell.

Now I’m not expert on DDM, and Redgate Data Masker for SQL Server is not a DDM solution (so I’ve only ever needed to know SDM really) I don’t pretend to be, but it seemed that I had everything I needed to tie Catalog into DDM.

PowerShell time!

I’ve written so much PowerShell to get classifications out of Data Catalog at this point it’s become second nature, but if you’re using the SDC PowerShell module and you need a reference you can view it here: https://documentation.red-gate.com/sql-data-catalog/automation-with-powershell but the standard “stuff” goes:

  • Pull down the PoSh module
  • Connect to catalog where it’s installed using an Auth token
  • Grab out the classifications with Get-ClassificationColumn
  • Shrink this down to just the columns we care about based on the tags

But the Az PowerShell cmdlets were honestly just as easy to use! I was surprised how easy it was to get up and running:

  • Connect to my Azure subscription
  • Get the current list of columns already with DDM masks
  • Remove these from the Catalog list
  • Update the remaining columns to use the default Mask

This was the full code I ended up using:

# This script is intended to be used with Azure SQL Database and Redgate SQL Data Catalog, however you are welcome to adapt and edit as required
# It will pull columns out of azure that are already being masked, and a list of columns that need to be masked with DDM
# It will then rationalise these, and configure Default DDM masks for any columns not already being masked on that Azure SQL DB

#Variables for Azure SQL DB & Catalog
$ResourceGroup = "DMDb"
$ServerName = "dmnonproduction" # Your instance minus .database.windows.net
$instance = "dmnonproduction.database.windows.net" # The instance or logical SQL Server as displayed in SQL Data Catalog
$DatabaseName = "DMDatabase_Dev"
$CatalogServer="http://pse-lt-chrisu:15156" # Your SQL Data Catalog location, leave off the trailing "/"
$authToken="REDACTED" # Your SQL Data Catalog Auth Token
$AzureSub = "Redacted" # Your Sub ID

# Get the SQL Data Catalog PowerShell Module & Connect
Invoke-WebRequest -Uri "$CatalogServer/powershell" -OutFile 'data-catalog.psm1' -Headers @{"Authorization"="Bearer $authToken"}
Import-Module .\data-catalog.psm1 -Force
Connect-SqlDataCatalog -ServerUrl $CatalogServer -AuthToken $authToken 

#Connect to your Azure Subscription
Connect-AzAccount -Subscription $AzureSub

#Get current active DDM Masks from Azure
$DdmMasks = Get-AzSqlDatabaseDataMaskingRule `
    -ResourceGroupName $ResourceGroup `
    -ServerName $ServerName `
    -DatabaseName $DatabaseName
$ListOfDDMColumns = $DdmMasks | ForEach-Object {$_.SchemaName + '.' + $_.TableName + '.' + $_.ColumnName}

#Get columns from Catalog currently marked with "Dynamic Data Masking" as a treatment intent
$CatalogColumns = Get-ClassificationColumn `
    -InstanceName $instance `
    -DatabaseName $DatabaseName | Where-Object {$_.tags.name -eq "Dynamic data masking"} 

#Filter down to a list of columns that need to be masked, that currently aren't configured with DDM
$ColumnsToDDM = $CatalogColumns | Where-Object {($_.SchemaName + '.' + $_.TableName + '.' + $_.ColumnName) -notin $ListOfDDMColumns }


#Set default DDM Masks for identified columns
$ColumnsToDDM | ForEach-Object { `
    New-AzSqlDatabaseDataMaskingRule -ResourceGroupName $ResourceGroup `
                                     -ServerName $ServerName `
                                     -DatabaseName $DatabaseName  `
                                     -SchemaName $_.schemaName `
                                     -TableName $_.tableName `
                                     -ColumnName $_.columnName `
                                     -MaskingFunction "Default"

}

But I have also uploaded it to my GitHub here in case anyone would like to take and adapt as they see fit: https://github.com/ChrisUnwin/PowerShell/blob/master/Demos/Redgate%20Demos/DDMFromCatalog.ps1

And this was the result – here were the two columns I had already being masked:

Customer Firstname and Customer Email with DDM Masks Configured

These were the columns I had marked as Dynamic Data Masking in Data Catalog:

Customer firstname, lastname, street addres and email all marked for DDM in Catalog

and after running the PowerShell it deduced that the delta was street_address and lastname and created the default DDM mask for them in Azure:

All columns now being masked dynamically

Considerations

1 – I have used the default mask in this process, however if you wanted to configure the mask (as per the link to the docs above) to be specific numbers or format you could absolutely do this, simply by modifying the PowerShell to look at the Data Type and then just passing into a different New-AzSqlDatabaseDataMaskingRule for each of those types.

2 – This only applies to Azure SQL DB and does not take into account the considerations when using DDM on say, a 2017 SQL Server Instance running on a VM – however you could use the same approach to pass the columns into some dynamic T-SQL which would in turn run the correct command to add DDM to that/those column(s)

3 – I would still use Static Data Masking (SDM) for non-Production environments, because if anyone bypasses the DDM they will have access to the full data, which we don’t really NEED in less secure non-Prod environments anyway, so Static might well be the way to go!

“But I don’t wanna INSTALL it!”: Data Masker on the fly in Azure DevOps (with an Azure SQL DB)

“There is always a way to go if you look for it.”
Ernest A. Fitzgerald

As many of you know, I really enjoy talking about Data Masking. I fundamentally believe it is an absolutely ESSENTIAL part of Test Data Management and specifically the provisioning of Pre-Production environments. If you hold sensitive PII/PHI/PCI in your Production environments, you have no excuses for porting any of that back into Dev and Test.

I also stand firmly behind the belief (and it is just that, my belief) that masking is much safer than anything you can achieve with encryption or limiting access alone. Static masking, when done correctly, means that even if all other security measures are bypassed, or we accidentally expose data somehow, it doesn’t matter because the PII/PHI has been wiped, but it is still fundamentally useful for development and testing environments.

I’m also a huge proponent of using classification and data masking as part of a DevOps process, and often you’ll find me using Azure DevOps to actually kick off my masking process – but one of the most frequent questions I get is “do I have to install Data Masker somewhere?” and the answer I always have to give is… yes.

When we use SQL Provision to spin up our environments in non-Prod, generally we know they have to be in IaaS VMs or On-Prem, due to the nature of the technology (keep your eyes open on THIS because big changes coming soon *squeals in excitement*) but sometimes we’re working entirely with PaaS DBs and VMs don’t even come into our vocabulary – but we still need them to be masked for non-Prod use.

I have some data in an Azure SQL DB – a copy of DMDatabase (get it from my GitHub):

DM_CUSTOMER table in the DMDatabase

and I would like to get this masked before I create copies of it. Now I have written scripts in the past to make sure that Azure SQL DBs can be masked and then copied into non-Prod environments and you can get those scripts here, and you could easily wrap something like the below INTO a script like that – but much of the work I’ve done on this always involves having Data Masker installed somewhere and invoked on that machine – a VM, my laptop, whatever.

So, how do we avoid having to install Data Masker each time?

Data Masker for SQL Server does not (at time of writing) have publicly available a docker container or method for installing using Choco or something like that – once we do, trust me, I’ll be blogging about it A LOT as I will be very excited. But there is an install file available: https://download.red-gate.com/installers/DataMaskerforSQLServer/ and this might be enough.

The Process

Initially I created a Masking Set locally for the DMDatabase copy in Azure, it was nice and simple just masking a few of the fields on the CUSTOMER table:

It relies on SQL auth for the connection and I’m remembering the credentials, though these could be subbed in later on using the PARFILE (documentation on that here).

Next I put this masking set into a newly created Azure DevOps Git repo, which I cloned down onto my local machine – and then committed and pushed my changes up into Azure DevOps:

Now that this was all in ADO, it was time to set up a pipeline for it – so let’s jump into some YAML! Now, Data Masker currently needs to run on a Windows machine so we’ll set the pool to Windows-Latest:

trigger:
- main

pool:
  vmImage: windows-latest

The next step is to grab the installer – which I know I can easily do with PowerShell. I’m sure you could be more clever about this, but with limited time I chose the most recent version and hard coded that in to a PowerShell Invoke-WebRequest cmdlet:

- task: PowerShell@2
  inputs:
    targetType: 'inline'
    script: |
      $source = 'https://download.red-gate.com/installers/DataMaskerforSQLServer/2021-03-15/DataMaskerforSQLServer.exe'
      New-Item -ItemType directory -Path C:\download
      $destination = 'C:\download\DataMaskerforSQLServer.exe'
      Invoke-WebRequest -Uri $source -OutFile $destination
      New-Item -ItemType directory -Path C:\download\DMlogs
  displayName: 'Download Data Masker'

This PowerShell task is going to grab the most recent exe for Data Masker and pull it down into C:\download on the hosted VM we’re using for the pipeline and it’s also going to create a directory for the Data Masker logs as well (if you wanted to extend the YAML at the end to wrap these logs up and publish them as a result of the pipeline masking, then go for it and tweet me to let me know!)

Next we have to extract and install Data Masker from that download, which is fairly easy to do with a cmd call:

- task: CmdLine@2
  inputs:
    script: |
      "C:\download\DataMaskerforSQLServer.exe" products "Data Masker for SQL Server" log c:\download temp c:\download /IAgreeToTheEula RG_LICESE=%RGLICENSE%
  displayName: 'Install DMS Headlessly'

Note I’m using the guidance from this page, making sure to accept the EULA and I’m passing in my Redgate License as a variable, that I have specified for the pipeline and kept secret. This will put Data Masker in the default location in C:\Program Files\… and means we will then be able to call it. I do however need to make sure that this now works. So I saved my pipeline and ran it to see what happened:

Fabulous, that all works nicely. Now to pass the DMSMaskSet file to Data Masker and get it to run – ah but I forgot, I’m going to need a PARFILE as per the cmdline documentation that specifies where the files etc. are for the run. So I create my PARFILE.txt as such:

MASKINGSET=C:\download\DMDB_MaskingTime.DMSMaskSet
LOGFILEDIR=C:\download\DMlogs
DATASETSDIR=C:\Program Files\Red Gate\Data Masker for SQL Server 7\DataSets 
REPORTSDIR=C:\download\DMlogs
INTERIM_REPORTS=false

As you can see, very simple. But to make sure these files are “simply” in the right folder, and because I don’t have time to explore how I could pass an environment variable into the .txt file itself, I’m going to add a quick file copy task to make sure that my masking set and PARFILE both make it into that location:

- task: CopyFiles@2
  inputs:
    SourceFolder: '$(Build.Repository.LocalPath)'
    Contents: '**'
    TargetFolder: 'C:\download'
  displayName: 'Copy files from working directory'

Of course now that we have Data Masker installed on the pipeline VM, the masking set AND the PARFILE… let’s get masking!

- task: CmdLine@2
  inputs:
    script: '"C:\Program Files\Red Gate\Data Masker for SQL Server 7\DataMaskerCmdLine.exe" PARFILE=C:\download\PARFILE.txt'
  displayName: 'Run Data Masker'

I agree the command is now less than impressive given all the prep work, but when you run all of this in it’s entirety…

Big success! But let’s check the data of course and make sure it is as we expect:

It all works, Lynne has been masked to Muna (and the rest has been masked too!), and I didn’t need to have Data Masker installed on a VM in my environment, with an Azure DevOps self hosted agent to run it – I could just do it programmatically. #Winning.

I’ll put the full YAML below however some caveats:

  • You cannot run this pipeline a LOT in a short space of time, I found that out. The RedGate downloads page is not particularly designed for this process, so it should probably be run sparingly otherwise the pipeline times out because the server, by way of balancing, will prevent you from pulling the file too often.
  • You will need to update the PowerShell step reasonably often if you want the DMS most recent installer, or make it more futureproof to grab the latest version – I just didn’t investigate that.
  • Data Masker is dependent on the power of the machine it is running on in MANY ways and Azure DevOps pipeline VMs are not particularly the most powerful beasts in the world – so if you have a lot of masking that needs doing, I would be weary of this method and might stick to an Application Server VM you’ve got hanging around, just in case.
  • This is likely to change a lot in the future so may not be relevant when you’re reading it after a few months – so before implementing anything like this, if it’s been a few months, contact me or Redgate and just confirm that there isn’t some better way of doing this, if I haven’t already blogged it!

Thanks for stopping by and have a great week!

Full YAML:

trigger:
- main

pool:
  vmImage: windows-latest

steps:
- task: PowerShell@2
  inputs:
    targetType: 'inline'
    script: |
      $source = 'https://download.red-gate.com/installers/DataMaskerforSQLServer/2021-03-15/DataMaskerforSQLServer.exe'
      New-Item -ItemType directory -Path C:\download
      $destination = 'C:\download\DataMaskerforSQLServer.exe'
      Invoke-WebRequest -Uri $source -OutFile $destination
      New-Item -ItemType directory -Path C:\download\DMlogs
  displayName: 'Download Data Masker'

- task: CmdLine@2
  inputs:
    script: |
      "C:\download\DataMaskerforSQLServer.exe" products "Data Masker for SQL Server" log c:\download temp c:\download /IAgreeToTheEula RG_LICESE=%RGLICENSE%
  displayName: 'Install DMS Headlessly'

- task: CopyFiles@2
  inputs:
    SourceFolder: '$(Build.Repository.LocalPath)'
    Contents: '**'
    TargetFolder: 'C:\download'
  displayName: 'Copy files from working directory'

- task: CmdLine@2
  inputs:
    script: '"C:\Program Files\Red Gate\Data Masker for SQL Server 7\DataMaskerCmdLine.exe" PARFILE=C:\download\PARFILE.txt'
  displayName: 'Run Data Masker'

Classification by design: Including data cataloging in a DevOps process

“The most powerful tool we have as developers is automation.”
Scott Hanselman

It is no secret that I love to talk about data protection, specifically from the perspective of structured data. When we talk about database development practices, we often find ourselves talking about 3 things most often:

  • Source Control
  • Continuous Integration and Continuous Delivery/Deployment (CI/CD)
  • Estate Monitoring

Some people refer to this as “DataOps“, others refer to it as “DevDataOps” but in reality, it’s all DevOps guys. This may be an unpopular opinion (and if it clashes with yours please forgive me, it’s just my opinion) but just because a certain niche area hasn’t been specifically called out within a subset of DevOps doesn’t mean you have to invent your own term for it!

Now this leads me on to DevSecOps, or as I like to call it… More secure DevOps.

rimshot GIF

No but seriously this is a slightly different case – DevSecOps is like DevOps but fortified with security from the ground up. There’s a fantastic article and diagram of this on Plutora from Mark Robinson of how this looks (below) and if you haven’t read his article I would definitely go and give it a read!

DevSecOps Diagram

Good DevOps practice is a combination of different things working together, bringing the right mentality, the principles, processes and amazing tools at our disposal like automation but this all includes security from the ground up too. DevOps is about putting those principles and practices in place to strengthen the pipeline, so why don’t we treat security in the same way?

Take, for example, 3 pieces of legislation that have been very much in the spotlight:

The controller shall implement appropriate technical and organisational measures for ensuring that, by default, only personal data which are necessary for each specific purpose of the processing are processed.
GDPR (Europe) Art. 25 “Data protection by design and by default”

Processing agents shall adopt security, technical and administrative
measures
able to protect personal data from unauthorized accesses and accidental or unlawful situations of destruction, loss, alteration, communication or any type of improper or unlawful processing.

– LGPD (Brazil) Chapter VII, Art. 46 “Security and Secrecy of Data”

“A Controller or Processor is required to implement appropriate technical and organisational measures to demonstrate that Processing is performed in accordance with this Law…”
DIFC LAW NO. 5 OF 2020 (Dubai) Part 2D, Art. 14 (2) “Accountability and notification”

There’s a common running theme here and although lots of global legislation will either allude to, or directly tell you ways you can be compliant and what some of these “organizational” and “technical” measures are, it’s still pretty blurry.

How do we know what we can do? How do we know what “default” and “design” mean in this context? Well, we build it into the DevOps process.

Now I could sit here forever and talk about why transforming your database development, deployment and provisioning processes allows us to be more secure, but that’s a lot of material and it might have to come in chunks! So what we’re going to focus on today is as the title suggests: Data Classification and Cataloging.

Why is Cataloging important?

Cataloging structured data is incredibly important because it can be one of the first steps we take to securing sensitive Personally Identifiable Information (PII) or Protected Health Information (PHI) wherever it exists across our database environments. It allows us to make strengthened, contextual decisions about the data we hold including how we treat it in pre-Production, how long we retain it for and which systems and processes consume it.

But the most important part of this is simply: it tells us where the risk is.

Read through any of the most recent data protection laws and you will notice that a few things come up quite a lot including “Data Protection Impact Assessment“, or DPIA. Effectively if you can assess the risk of processing activities you can more readily answer the data protection questions and challenges you may face.

Knowing where your data resides can be the first step to helping you assess this risk, and to more readily answer your own data questions. If you want to read more about Cataloging specifically and why it is useful, you can read more about it on my previous blog here.

Where does Cataloging fit into DevOps?

This one is simple to answer. Once you have fully classified your entire estate, you’re not done. No, if you’re a development house or indeed even a single developer – if you are making any schema changes to the tables holding that sensitive data, you’re never done.

The reason for this is that Cataloging is an evergreen activity – if you update tables by removing columns, adding columns, splitting tables, adding tables… anything! Well then you need to be ready to make sure that you are:

a) Prepared and equipped with knowledge of the tables you’re working on and if this is a high risk activity.

b) Updating classification information to reflect the new “truth”, i.e. if you’re adding a column that will collect people’s Twitter handles, then that column should be classified as sensitive, and this should be reflected the moment it is deployed to Production.

So it is important to have the correct people working on this, with the right knowledge, preparation and processes and using the correct tools ensuring that those updates are persisted properly and securely through your deployment pipelines.

Huh… people, processes and tools… That sounds familiar!

The Process: SQL Data Catalog, SQL Change Automation and Azure DevOps

For this little experiment of mine I used Redgate’s SQL Change Automation (Migrations First approach in SQL Server Management Studio) and SQL Data Catalog to both develop & deploy and classify/categorize respectively, and for simple version control and orchestration of this pipeline I opted for Azure DevOps (with SQL Change Automation CI/CD plugins):

NOTE: Heads up, all of the code I used for this can be found right here in my GitHub, feel free to have at it!

Step 1) Setup SQL Change Automation project with an Azure DevOps Git Repo and then create the YAML file to build it, and create a deployment process to Acceptance and the Production.

Ok – that’s a bit of a mouthful and a massive ask. There’s a lot of things there you have to be familiar with, but we don’t have time to go into right here. BUT fortunately if you checkout Redgate University right here, my blog post on using Change Automation with Azure DB and Redgate’s Product Learning section, you’ll be a Database DevOps ‘Whizz-Kid’ in no time!

Fast forward a little and I had my example databases, VCS and pipeline all up and running:

Step 2)The “Theory”: This is where things get interesting. So we have an example pipeline set up and we are able to completely deploy all the way through to “Production” so let’s talk theory.

In SQL Data Catalog I have covered both my Production and Acceptance Databases:

Now, in development we don’t make changes directly to Production, so why should Classification be any different? Now how you adapt the above code is up to you, feel free to split it, move it around, incorporate it into Pull Requests if you want to… But I’m going with a bit more of a simple situation.

Situation: Developer makes a change in Development, which gets committed, reviewed and merged o the main branch, resulting in a build and a deployment, in this case to Acceptance and then it is later deployed to Production.

Now, by Acceptance we should only have the “good work”, i.e. all of our testing is shifted left within DevOps so Acceptance is basically the last stop before Production. Therefore we should classify the work we have done on Acceptance, crucially, before it gets to Production and starts gathering sensitive data, and then copy this classification up on deployment.

Ideal: We should have no columns on Production that have not been classified.

Step 3) – In Practice: Fortunately it’s very easy to automate a lot of these steps with SQL Data Catalog utilizing it’s PowerShell cmdlets and REST API. The cmdlets are fully documented and very easy to use (docs here). This allows us to easily scan, classify and copy classifications up to other databases, but we’ll also need to do some checks and report if there are discrepancies, as part of the deployment pipeline that can be investigated.

  • Are there any columns on Acceptance that aren’t classified but have been deployed to Production? (failure to comply with process)
  • Are there any columns on Production that have not been classified? (classification drift)
  • Are there any unclassified columns on Acceptance that have not yet been deployed to Production (for pipeline hygiene purposes)

The other part of this ‘fun’ is reporting what has been changed in the same process. Now fortunately SQL Change Automation spits out a Changes.json file with its Release Artifacts and we can steal that away and find out how many tables have been created or changed in this release and report that back so we can correlate what has been done and what is missing:

So actually getting this up and running is just going to require 3 things:

  1. The PowerShell script from GitHub (or your own personalized variant) as a step in your production deployment
  2. Data Catalog available and pointed at Acceptance and Production (or your versions of these environments)
  3. Variables set in Azure DevOps to fill the gaps (e.g. Where is Data Catalog? Whats my PowerShell Auth token? What are my Acceptance and PROD DBs called? etc.)

3 is the last step there so you’ll need something like this to run the script:

  • DatabaseDeploymentJSON – where the JSON file will be with the latest changes in the Prod release
  • DataCatalogAuthToken – Your PowerShell Auth token from Settings in Data Catalog
  • DataCatalogUrl – The full URL to your Data Catalog installation, missing the “\” at the end (ending :15156)
  • ExportPath (Optional) – I specified the path for my Database Deployment Resources to save typing it out in the Redgate plugins
  • ProdDB / StageDB – As you would expect, the Production and Acceptance/Staging DBs you’re deploying to/from
  • ProdInstance / StageInstance – As above, except the instance the Database are located on

In the variables above the Instance and DB names are purely used within Data Catalog, so there’s no need to worry about anything happening to the actual databases themselves!

Once you’ve run through the deployment pipeline a couple of times and the changes.json file is being produced, you can go ahead and copy the script into an inline PowerShell script step in your release and you should find it will fire to life! I simulated an example by modifying my Contacts table and my Articles table, adding 1 column each and deploying both to Acceptance. I then classified just 1 of these in Acceptance in Data Catalog:

and then approved the deployment to Production and tada!

Ok you probably can’t make all that out, but it effectively says:

(Information) Table dbo.Articles was modified in this deployment.
(Information) Table dbo.Contacts was modified in this deployment.

That much we knew!

1 column(s) with classifications were discovered on VoiceOfTheDBA Acceptance that are not classified in VoiceOfTheDBA Production:
dbo.Articles.TestingPineapple

Excellent, we classified that one so it gets copied up and we can verify that in data catalog against Production:

and finally, we get a warning about Production now containing unclassified columns:

(Alert) The following columns have been discovered on VoiceOfTheDBA Production that require classification:

dbo.Contacts.TestingPineapple

You should classify these columns in VoiceOfTheDBA Acceptance prior to the next deployment.

Just as we expected. Success!

Happy Tom And Jerry GIF

Conclusion

Classification and categorization belongs as part of DevOps, if you expect the context for your business decisions around data to remain evergreen and informed then it cannot sit on the shoulders of one or two people to support it, and it cannot live in a manually updated Excel sheet or document.

By including it within the DevOps process, not only do you add an additional layer of security but you also make it an automated, team activity that can be audited, checked and easily kept up to date.

Is this DevSecOps? Well… not really no. Is this a more secure approach to Database DevOps? Absolutely! Happy DevOpsing!

Shall we begin? (With data classification)

“If I had an hour to solve a problem I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.”
Albert Einstein

So I just spent about 20 minutes trying to come up with a suitable title for this blog post, and then it struck me – one of my favorite movies of all time (which I will soon be ensuring my wife and I watch again) is Star Trek into Darkness, featuring the magnificent Benedict Cumberbatch, in which he masterfully growls the phrase “shall we begin?”.

shall we begin benedict cumberbatch GIF

This sums up perfectly where people find themselves at the beginning of their classification activities. “Where do I start?” is usually the first question I get asked – regardless if you’re using an excel sheet, Azure Data Catalog or Redgate SQL Data Catalog to carry out this process, you will find yourself in the same place, asking the same question.

Classifying your SQL Server Tables and Columns is not straight forward, I’ll say that up front – you have to be prepared for whatever may come your way – but give yourself a fighting chance! Whether you’re looking to better understand your data, protect it, or you’re just hoping to prepare your business to be more able to deal with things such as Subject Access Requests (SARs), the Right to be Forgotten *cough* I’m looking at _you_ GDPR *cough* – or even just key development in systems containing sensitive information; this is the ultimate starting point. As per my blog post on data masking here, you can’t protect what you don’t know you have.

This is my effort to give you the best possible start with your classification process, whether this feeds into a wider data lineage process, data retention or, of course, data masking. So… shall we begin?

Get a taxonomy set up

This is perhaps the most crucial part of your success. Even if you have the best classification process in the world it really means nothing if you’ve basically described your data in one of say, 3 possible ways. The thing to bear in mind before getting started is that the data cataloging process is not specific to one job.

You may think at this point in time that you’re going to use it to highlight what you want to mask for Dev/Test environments, or maybe it’s your hit list for implementing TDE or column level encryption – but this _thing_ you’re building is going to be useful for everyone.

  • DBAs will be able to use this to help them prioritize systems they look after and being more proactive when it comes to security checks or updates, backups etc.
  • Developers will be able to use this to better understand the tables and environments they are working on, helping them contextualize their work and therefore engage and work with any other teams or individuals who may be affected or who may need to be involved.
  • Governance teams and auditors will be able to use this to better understand what information is held by the business, who is responsible for keeping it up to date and how it is classified and protected.

The list goes on.

So all of the above will need to be engaged in a first run to help actually describe the data you’re working with. What do you actually care about? What do you want to know about data at a first glance? Below is the standard taxonomy that comes out of the box with Redgate’s Data Catalog:

Some of my favorites are in here, which I would encourage you to include as well! If nothing else, having Classification Scope as a category is an absolute must – but I’ll come to this soon. You can see though, how being able to include tags such as who owns the data (and is therefore in charge of keeping it up to date), what regulation(s) it falls under and even what our treatment policy is in line with any of those regulations is, gives us so much more to go on. We can be sure we are appropriately building out our defensible position.

Having a robust Taxonomy will enable you to not only know more about your data but to easily communicate and collaborate with others on the data you hold and the structure of your tables.

Decide who is in charge

This seems like an odd one, but actually one of the most common questions I get is about who will be carrying out the classification process, and this is where the true nature of collaboration within a company is going to be absolutely critical.

Some people believe that a DBA or a couple of developers will suffice but as you’ll see later on, this is not a simple process that only 1 or 2 people can handle by themselves. Be prepared to spend hours on this and actually the implementation of classification means by nature you are going to need a team effort in the first instance.

You will need representation from people who know the database structure, people who know the function of the various tables and people who know the business and how data should be protected. You will require representation on this team and the collaboration between Dev, DBAs, Testers, Governance and DevOps, and you will need someone central to coordinate this effort. When you have key representation from these teams, it will make it easier to identify and collaborate on hot spots of data, so ensure you have this knowledge up front.

Get rid of what doesn’t matter

You may be surprised that the next step is technically an execution step, but it is an important point nonetheless and will absolutely help with the classification effort. This is where the Classification Scope category comes in, and this is why it’s my favorite.

One of the biggest problems that people face when actually executing on their classification is the sheer enormity of the task. There is no “average” measure we can rely on unfortunately but even small schemas can be not insubstantial – recently, some work I did with a customer meant they provided me with just ONE of their database schemas which had well in advance of 1800 columns across dozens of tables. When you scale that same amount to potentially hundreds of databases, it will become rapidly clear that going over every single column is going to be unmanageable.

To start then, the knowledge brought by the team mentioned above will be invaluable because we’re going to need to “de-scope” everything that is not relevant to this process. It is very rare to find a company with more than 50% of columns per database which contain PII/PHI and even if you are one of those companies, this process can help you too.

There could be many reasons that something shouldn’t be included in this process. Perhaps it is an empty table that exists as part of a 3rd party database schema, such as in an ERP or CRM solution. It could be a purely system specific table that holds static/reference data or gathers application specific information. Regardless what the table is, use the knowledge the team has to quickly identify these and then assign them all with the necessary “Out of Scope” tag.

This will not only help you reduce the number of columns you’re going to need to process significantly, but will give you greater focus on what does need to be processed. One of the greatest quotes I’ve heard about this process comes from @DataMacas (a full on genius, wonderful person and someone who over the years I have aspired to learn as much from as possible) who referred to it as “moving from a battleships style approach to one more akin to minesweeper“. Which is just so incredibly accurate.

In my example database below with only 150 odd columns, using the “Empty Tables” filter, and then filtering down to system tables I know about, I was able to de-scope just under half of the database, just as a starting point:

Figure out patterns and speed up

Many of the people carrying out this process will already have some anecdotal knowledge of the database as I’ve already mentioned, but now it’s time to turn this from what _isn’t_ important, to what is.

The fastest way to do this is to build up some examples of column naming conventions you already have in place across multiple databases – there will likely be columns with names that contain things like Name, Email or SSN in some format. Both SQL Data Catalog from Redgate and Microsoft’s Azure Data Catalog have suggestions out of the box that will look at your column names and make suggestions as to what might be sensitive for you to check and accept the classification tags.

Now these suggestions are incredibly helpful but they do both have a reduced scope because they’re just matching against common types of PII, so it’s important to customize them to better reflect your own environments. You can do this fairly easily, one or both of the following ways:

1 – Customize the suggestions

In Redgate’s SQL Data Catalog you can actually, prior to even looking at the suggestions and accepting them, customize the regular expressions that are being run over the column naming convention to actually check that they are more indicative of your own schemas – either by editing existing rules or by creating your own rules and choosing which tags should be associated with the columns as a result:

You can then go through and accept these suggestions if you so wish, obviously making sure to give them a sense check first:

2 – POWER ALL THE SHELL

In both of the aforementioned solutions you can call the PowerShell API to actually carry out mass classification against columns with known formats – this will allow you to rapidly hit any known targets to further reduce the amount of time spent looking directly at columns, an example of the SQL Data Catalog PowerShell in action is the below, which will classify any columns it finds where the name is like Email but not like ID (as Primary and Foreign keys may, in most cases, fall under the de-scoping work we did above) with a tag for sensitivity and information type (full worked example here):

Finally – get classifying

This is the last stage, or the “hunt” stage. It’s time for us to get going with classifying what’s left i.e. anything that wasn’t de-scoped and wasn’t caught by your default suggestions or PowerShell rules.

You can obviously start going through each column one by one, but it makes the most sense to start by filtering down by tables which have the highest concentration of columns (i.e. the widest tables) or the columns that are specifically known as containing sensitive information (anecdotally or by # of tags) and classifying those as in or out of scope and what information they hold, who owns it and what the treatment intent is at the very least.

The approach i take in this instance is to use filtering to it’s utmost – in SQL Data Catalog we can filter by table and column names but also by Data Type. Common offenders can be found with certain NVARCHAR, XML or VARBINARY types, such as NVARCHAR(MAX) – to me, that sounds like an XML, JSON, document or free-text field which will likely contain some kind of difficult to identify but ultimately sensitive information.

Following the NVARCHAR classification I move on and look at DATETIME and INT/DECIMAL fields for any key candidates like dates when someone attended an event or even a Date of Birth field. This helps especially when the naming conventions don’t necessarily reflect the information that is stored in the column.

Finally, one thing to add is that you will need access to the databases or tables at some point. You can’t truly carry out a full-on data classification process purely against the schema, especially for the reason above. Data will often exist in places you weren’t aware of, and knowing the contents, format and sensitivity of the data can only reasonably be found and tagged if you have the support of the data team to do so.

Conclusion?

This is not a one time thing. The initial classification is going to be the hardest part but that can be mitigated if you follow some of the processes in this post and ultimately work as a team.

Classification though, is ongoing. It is there to be an evergreen solution, something that provides context for any data governance or DevOps processes that you have in place and therefore should be maintained and treated as what it is. The place that everyone is able to use to gather information about what they will be using, and what the company might have at risk.