When the unused become used: Using SQL Data Catalog to spot the pitfalls of a data classification process

“The surprise is that you continue to be surprised.”
Jill A. Davis

But what surprise am I alluding to? Of course it is the surprise that “all of a sudden” you have sensitive data in tables where there ought not be!

What I mean is; we have a fantastic process in place: we have completely classified our entire environments, developers move changes to classification up-stream just like schema changes and we have an automatic process to check and update classifications as part of our pipeline.

But as part of this process we “descoped” everything that was non-sensitive, including (crucially) empty tables. Now empty tables are a common occurrence and most people I speak to have a bunch of them, for a multitude of reasons; they’re using an ERP/CRM and aren’t using some of the modules, some of them are older and are gradually being phased out etc.

What happens over time however, is that some tables which we originally weren’t using suddenly start to populate with data, and others which were in-scope now suddenly aren’t. But because this isn’t technically a schema change and we can’t necessarily rely on the developers to know when this change will happen – how do we catch it?

The easiest thing to do in this instance is an automated check – write once, run many times. As we’ve seen from the many, many blog posts before this the answer is pretty straight forward – SQL Data Catalog’s PowerShell module once again saves the day!

(You can find the full documentation on the PowerShell module and a cmdlet reference here in the documentation.)

The Process

The most obvious thing to do is to re-scan our DB so Catalog is up to date and then fetch all of our columns first into a collection:

# Connect to SQL Data Catalog
Connect-SqlDataCatalog -AuthToken $authToken -ServerUrl $serverUrl

# Quickly refresh the scan of the Instance to get accurate row counts
Start-ClassificationDatabaseScan -FullyQualifiedInstanceName $instanceName -DatabaseName $databaseName | Wait-SqlDataCatalogOperation

# Get all columns into a collection 
$allColumns = Get-ClassificationColumn -instanceName $instanceName -databaseName $databaseName

Each of the columns returned by Data Catalog here has a number of properties that we can use:

The properties of each column object returned by the SQL Data Catalog PowerShell

All of these properties can be useful for something (and have been in the past) – in this case though the properties we care about are the id, the tags collection (name property) and the tableRowCount. The reason being that id uniquely identifies our columns so can be used to easily sort and filter on without having to match on text, the tags collection means we can search for any columns that do or do not have the Out of Scope – Unused tag and the tableRowCount lets us know if the table is or is not populated with data.

Its then fairly easy to filter these down:

# Collection for columns that are on empty tables

$emptyTableColumns = $allColumns | Where-Object { $_.tableRowCount -eq 0 }

# Collection for columns that are tagged as empty

$ColumnstaggedAsEmpty = $allColumns | Where-Object { $_.tags.name -eq $unusedTag }

# Collections for columns both tagged as empty but not, and empty columns not tagged as such

$ColumnsNowInUse = $ColumnstaggedAsEmpty | Where-Object { $_.id -notin $emptyTableColumns.id}

$ColumnsMissingClassification = $emptyTableColumns | Where-Object { $_.id -notin $ColumnstaggedAsEmpty.id}

You can easily visualize these in Data Catalog itself too:

Columns that are on empty tables, regardless of tag
Columns that are tagged as Out of Scope Unused regardless of if they have data or not

We can already see a difference in the lists – if we run the entire PowerShell however (which you can find right here in my GitHub, and I’ll include it at the end) then we actually find those differences out:

The PowerShell result: 2 classified as empty columns now with data, 3 columns not classified as out of scope, now on empty tables

So we can correct these by classifying the 3 now empty columns as now Out Of Scope – Unused, and we can change the 2 now-populated columns to be In Scope:

Suppliers is now in use, and corrected to be In-Scope
The missing Out Of Scope – Unused tags are corrected on the empty columns

…and now when we run the PowerShell again we should find everything is rectified:

Everything has been corrected, now no columns are incorrectly labelled or missing labels

Ongoing

Now we have the script we can run to identify these issues, you could implement it on the back-end of the DevOps process described by the blog post or video linked at the start of this blog post – and this would allow you to constantly be sure that your view of the Production database is accurate and up to date.

The key to a good classification process as part of on-going governance and an up-to-date view of your structured data estate is minimizing the manual effort that has to go into maintaining this process, which with the help from the Data Catalog PowerShell module – we can do easily (if we wanted a process that would go out of date after 2 weeks, we could have just used MS Excel…)

Happy Classifying!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s