When the unused become used: Using SQL Data Catalog to spot the pitfalls of a data classification process

“The surprise is that you continue to be surprised.”
Jill A. Davis

But what surprise am I alluding to? Of course it is the surprise that “all of a sudden” you have sensitive data in tables where there ought not be!

What I mean is; we have a fantastic process in place: we have completely classified our entire environments, developers move changes to classification up-stream just like schema changes and we have an automatic process to check and update classifications as part of our pipeline.

But as part of this process we “descoped” everything that was non-sensitive, including (crucially) empty tables. Now empty tables are a common occurrence and most people I speak to have a bunch of them, for a multitude of reasons; they’re using an ERP/CRM and aren’t using some of the modules, some of them are older and are gradually being phased out etc.

What happens over time however, is that some tables which we originally weren’t using suddenly start to populate with data, and others which were in-scope now suddenly aren’t. But because this isn’t technically a schema change and we can’t necessarily rely on the developers to know when this change will happen – how do we catch it?

The easiest thing to do in this instance is an automated check – write once, run many times. As we’ve seen from the many, many blog posts before this the answer is pretty straight forward – SQL Data Catalog’s PowerShell module once again saves the day!

(You can find the full documentation on the PowerShell module and a cmdlet reference here in the documentation.)

The Process

The most obvious thing to do is to re-scan our DB so Catalog is up to date and then fetch all of our columns first into a collection:

# Connect to SQL Data Catalog
Connect-SqlDataCatalog -AuthToken $authToken -ServerUrl $serverUrl

# Quickly refresh the scan of the Instance to get accurate row counts
Start-ClassificationDatabaseScan -FullyQualifiedInstanceName $instanceName -DatabaseName $databaseName | Wait-SqlDataCatalogOperation

# Get all columns into a collection 
$allColumns = Get-ClassificationColumn -instanceName $instanceName -databaseName $databaseName

Each of the columns returned by Data Catalog here has a number of properties that we can use:

The properties of each column object returned by the SQL Data Catalog PowerShell

All of these properties can be useful for something (and have been in the past) – in this case though the properties we care about are the id, the tags collection (name property) and the tableRowCount. The reason being that id uniquely identifies our columns so can be used to easily sort and filter on without having to match on text, the tags collection means we can search for any columns that do or do not have the Out of Scope – Unused tag and the tableRowCount lets us know if the table is or is not populated with data.

Its then fairly easy to filter these down:

# Collection for columns that are on empty tables

$emptyTableColumns = $allColumns | Where-Object { $_.tableRowCount -eq 0 }

# Collection for columns that are tagged as empty

$ColumnstaggedAsEmpty = $allColumns | Where-Object { $_.tags.name -eq $unusedTag }

# Collections for columns both tagged as empty but not, and empty columns not tagged as such

$ColumnsNowInUse = $ColumnstaggedAsEmpty | Where-Object { $_.id -notin $emptyTableColumns.id}

$ColumnsMissingClassification = $emptyTableColumns | Where-Object { $_.id -notin $ColumnstaggedAsEmpty.id}

You can easily visualize these in Data Catalog itself too:

Columns that are on empty tables, regardless of tag
Columns that are tagged as Out of Scope Unused regardless of if they have data or not

We can already see a difference in the lists – if we run the entire PowerShell however (which you can find right here in my GitHub, and I’ll include it at the end) then we actually find those differences out:

The PowerShell result: 2 classified as empty columns now with data, 3 columns not classified as out of scope, now on empty tables

So we can correct these by classifying the 3 now empty columns as now Out Of Scope – Unused, and we can change the 2 now-populated columns to be In Scope:

Suppliers is now in use, and corrected to be In-Scope
The missing Out Of Scope – Unused tags are corrected on the empty columns

…and now when we run the PowerShell again we should find everything is rectified:

Everything has been corrected, now no columns are incorrectly labelled or missing labels

Ongoing

Now we have the script we can run to identify these issues, you could implement it on the back-end of the DevOps process described by the blog post or video linked at the start of this blog post – and this would allow you to constantly be sure that your view of the Production database is accurate and up to date.

The key to a good classification process as part of on-going governance and an up-to-date view of your structured data estate is minimizing the manual effort that has to go into maintaining this process, which with the help from the Data Catalog PowerShell module – we can do easily (if we wanted a process that would go out of date after 2 weeks, we could have just used MS Excel…)

Happy Classifying!

SQL Data Catalog, Data Masker and your DevOps pipeline: How do I make sure everything is being masked?

“However fast regulation moves, technology moves faster. Especially as far as data is concerned.”
Elizabeth Denham

You’re probably sick of me constantly talking about how the cataloging of columns should be part of the DevOps upstream deployment process. I’ve blogged about it. I’ve even produced a video demonstrating this in action. But one question that this always throws up is:

If we include cataloging in the upstream process, how do we make sure our masking sets are also staying up to date?

The benefits of including the classifications in the upstream pipeline is that nothing ever gets to Production that hasn’t been classified – we constantly have a perfectly up to date idea of the nature of our structured data estate, how data is distributed, what risk is associated with which systems? etc. but one of the biggest wins is that we constantly know which fields need to be masked when we’re pulling copies back into non-Production, constantly.

Add a table? Add columns? We know about them, they’re classified, they’re deployed… so now they need to be masked on our next refresh. But how? Well it all depends on which approach we want to take:

  • Automated
  • Manual

Doesn’t it always boil down to those options? What I mean is that either we have an appetite to completely generate our masking set afresh every single time based on our classifications (Automated) or do we want to ensure that we configure each rule ourselves (Manual). They both have benefits and drawbacks.

Automated

How: We can generate a masking set using the SQL Data Catalog PowerShell each and every time as part of our pipeline. Add a column, tag it in the pipeline and then simply wait for the necessary rulesets to be generated in the pipeline (perhaps as part of your build) for you to run when you bring a copy back. Check out my walkthrough for how to set this up.

Benefits: The process is automatic. Its headless and you don’t need to think about it at all. As long as the classifications are provided (and if you follow the steps from the blog post and video you should be providing them) then you’re always generating rules for every classified column.

Drawbacks: This process can be fragile. If we don’t classify correctly we can end up masking in the wrong way or trying to mask the wrong field (e.g. a Primary Key, Constraint, Identity etc.) which can cause masking failures and then you have to spend time fixing the pipeline/masking set. This also means that the nature of the masking is dependent on your classifications, and the values you will get masked into the columns will be less realistic as a result (i.e. you can’t generate Row-Internal Sync Rules using the integration).

Drawback Mitigation: To avoid the process breaking, be sure to really focus on how you set up your API settings / how you pick which categories and tags are used to generate the masking rules (like I discussed here). This will at least help you make sure you map common data sets into columns (and don’t hit columns that have constraints or keys).

Only columns marked with Static Masking as the Treatment Intent will get a rule created for them
An Information Type is given to every column we intend on masking – these are then mapped to templates in masker to ensure more realistic data

Manual

How: Either rely on developers to check in masking set changes along side their code changes or build in a manual intervention step to your upstream process to ensure that someone opens and updates the schema and rules within the masking set, and then put this back into the pipeline.

Benefits: This results in more accurate, more likely to run sets that generate significantly more reliable and realistic data as an end result. Masked DB copies can be more easily used for anything beyond simple development changes, including analytics etc. You can include any specialist rules you need and apply your own understanding and knowledge of the database.

Drawbacks: This is obviously a manual process – less than ideal. Anything that involves a human can invariably go wrong because we are humans and we make mistakes (which we learn from of course). This also takes significantly more time as part of the process.

Drawback Mitigation: This is a harder one to mitigate as we’re reliant on manual intervention, however the best way to check this could be with another team member checking (maybe as part of a pull request) or you could include an automated PowerShell script to effectively rationalize the columns to be masked from Catalog vs the columns currently in the Data Masker masking set – this would help you understand if all the necessary columns have been updated or if any were missed. I have included an example of this PowerShell below.

### CHANGE THIS ###

$MaskingSet = "yourmaskingset.DMSMaskSet" # Your masking set including the DMSMaskSet file extension
$instance = "yourinstance" # The Instance as it is shown in Data Catalog that hosts the database
$DatabaseName = "yourdatabase" # The DB you want classification info for
$CatalogServer="http://yourmachine:15156" # The lcoation of your catalog server, ending on :15156
$authToken="redacted" # Your Data Catalog Auth token from the Settings page
$tagName = "Static Masking" # The tag you're using to identify which columns need to be masked

### DONT CHANGE THIS ###

Invoke-WebRequest -Uri "$CatalogServer/powershell" -OutFile 'data-catalog.psm1' -Headers @{"Authorization"="Bearer $authToken"}
Import-Module .\data-catalog.psm1 -Force
Connect-SqlDataCatalog -ServerUrl $CatalogServer -AuthToken $authToken 
$ColumnsMarkedForMasking = Get-ClassificationColumn `
    -InstanceName $instance `
    -DatabaseName $DatabaseName | Where-Object {$_.tags.name -eq $tagName} 
$MaskingSetXML = [xml](Get-Content -Path $MaskingSet)
$subrules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleSubstitution')
$internalrules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleRowInternal')
$shufflerules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleShuffle')
$searchreplacerules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleSearchReplace')
$TablesAndColumns = @()

$subrules | ForEach-Object {`
    $CurrentTable = $_.TargetTableName.value
    $_.DMSPickedColumnAndDataSetCollection.DMSPickedColumnAndDataSet.N2KSQLServerEntity_PickedColumn.ColumnName.value | ForEach-Object {$TablesAndColumns+= $CurrentTable + "." + $_ }
}

$internalrules | ForEach-Object {`
    $TablesAndColumns+= $_.TargetTableName.value + "." + $_.TargetColumnName.value
}

$shufflerules | ForEach-Object {`
    $CurrentTable = $_.TargetTableName.value
    $_.DMSPickedColumnCollection.DMSPickedColumn.N2KSQLServerEntity_PickedColumn.ColumnName.value | ForEach-Object {$TablesAndColumns+= $CurrentTable + "." + $_ }
}

$searchreplacerules | ForEach-Object {`
    $TablesAndColumns+= $_.TargetTableName.value + "." + $_.TargetColumnName.value
}

$result = $TablesAndColumns | Sort -Unique
$ColumnsNeedingRules = $ColumnsMarkedForMasking | Where-Object {($_.tableName + "." + $_.columnName) -notin $result}

"`nThere are " + $ColumnsMarkedForMasking.count + " columns that require masking for database " + $DatabaseName + "in SQL Data Catalog."
"You are masking " + $result.count + " distinct columns in masking set: " + $MaskingSet
"`nThe columns that do not currently have a mask configured are:`n"

$ColumnsNeedingRules | ForEach-Object {$_.tableName + "." + $_.columnName + "     (" + $_.dataType + ")"}

$next = Read-Host -Prompt "`nWould you like to see the columns currently in your masking set? (Y/N)"
if ($next -in ("Y", "y")) {$result}

This code can also be found on my GitHub here.

The output of running the script – 13 columns in the masking set, 14 columns outstanding to be masked

You can technically even use this same check approach for the automated masking set generation option, to ensure that everything has been tagged correctly.