“Doing the best at this moment puts you in the best place for the next moment.” – Oprah Winfrey
One of the things that comes up the most when I’m talking to people about database classification is multi language support. Often people are resigned to:
Waiting on additional language support from their vendor of choice (defer)
Writing their own software to do it in their native tongue (shift effort)
Going the long way around i.e. manual classification (brute force)
Do nothing (abandon)
There are currently over 7000 languages in the world and when it comes down to classification, you can’t expect that whatever your choice of technology for this classification workflow will have packs or native support for your languages and/or the languages your databases have been configured in.
Now, deferring and abandoning as above really aren’t options any more and if you’ve read my post here then you’ll know that database classification is a key concern to address before trying to base any business decisions on data. But how then do we tackle t he above issues of shifting effort and brute force?
Sometimes it’s about finding ways to still be effective and still take advantage of the benefits such a classification technology affords you.
With this in mind – I wanted to share with you 3 tips and tricks in SQL Data Catalog to make it work better for your language(s):
1 – Customize your taxonomy to be relevant / Passen Sie Ihre Taxonomie an, um relevant zu sein / Ajustați-vă taxonomia pentru a fi relevante
The great thing about SQL Data Catalog is that it is incredibly easy to alter the taxonomy to best reflect your view of your data. Not all standard fields will be relevant to you and some fields that are ABSOLUTELY necessary will be missing.
So add them in!
Let’s say, in my country of Christopia, it is mandatory that people identify a certain government specified “Classification Category” from 1-5, where 1 is the most sensitive and 5 is basically public knowledge. I can add that as a category and tag:
That goes equally for fields that we will not be making use of. If it is irrelevant then remove it. Simple.
I will say this for the above step though – don’t think of this as being the easiest and most obvious part, in many ways this is actually the hardest step because you will need buy in. Guess who also cares how you describe data stored in databases? Data Governance, Legal, InfoSec, Database Admins… the list goes on. Just make sure you do the 2 most fundamental parts that are required for success here – communicate & collaborate. If you have buy in from other concerned parties, you will establish a ‘common language’ with which data can be described and consumed, and based on which complex data-driven business decisions can be made.
2 – Customize your search rules / Passen Sie Ihre Suchregeln an / Personalizați-vă regulile de căutare
Going through every single column is a colossal waste of time, especially when we have more than about 6 databases. I’ve spoken with people who have spent day, weeks and even months working on classifications in SSMS or Excel, only to find out that 90% of their columns matched common regular expressions, and moreover that 50-60% of their columns were out of scope anyway!
Fortunately within SQL Data Catalog, you can configure the rules that the tool runs against columns and tables to identify any clearly sensitive columns which it will then suggest to you each time it finds something that matches.
In my experience, between the analytical filters and the suggestions, you can make excellent progress very very quickly, and do you know the best part?
Data Catalog is evergreen.
It constantly keeps track of your ever evolving schema and checks the suggestions against them and that means if somebody makes a change to the schema (like in my post from last week) you’ll get the suggestions coming through and you can easily catch anything that might put your sensitive data in harms way.
3 – Use PowerShell / Verwenden Sie PowerShell / Utilizați PowerShell
Ok I get it – this is my catch all for everything whenever I talk about using Data Catalog – but did you know, using PowerShell with Data Catalog is really easy.
Like even I can do it easy.
But the good thing is, when you use PowerShell you’re not constrained by particular functions or waiting for things to happen, rather you can take control of the process and fill it with your own logic! There’s even a full on guide that shows you:
How to generate an authorization token
A full cmdlet reference
Complete worked examples
and the best part is it’s just part of the documentation! Don’t you love it when people keep documentation up to date and useful? Yeah me too!
Bonus Tip / Bonus-Tipp / Sfaturi Bonus
If you work in an environment that is multi-lingual and has a mix of languages associated with different databases which haven’t been standardized, that should just highlight further that this is a team game.
Pairing and mobbing, just like in coding, can be vital and powerful tools at your disposal – so it’s important to have the right people in place to handle this.
Classification is not infallible, much as you may wish to think it is. Data is data, and if there’s one thing I have learned about data it’s that it always manages to surprise you, so a second pair of eyes (or more) can be invaluable in helping you weed out some of the problems you may be facing.
“The most powerful tool we have as developers is automation.” – Scott Hanselman
It is no secret that I love to talk about data protection, specifically from the perspective of structured data. When we talk about database development practices, we often find ourselves talking about 3 things most often:
Continuous Integration and Continuous Delivery/Deployment (CI/CD)
Some people refer to this as “DataOps“, others refer to it as “DevDataOps” but in reality, it’s all DevOps guys. This may be an unpopular opinion (and if it clashes with yours please forgive me, it’s just my opinion) but just because a certain niche area hasn’t been specifically called out within a subset of DevOps doesn’t mean you have to invent your own term for it!
Now this leads me on to DevSecOps, or as I like to call it… More secure DevOps.
No but seriously this is a slightly different case – DevSecOps is like DevOps but fortified with security from the ground up. There’s a fantastic article and diagram of this on Plutora from Mark Robinson of how this looks (below) and if you haven’t read his article I would definitely go and give it a read!
Good DevOps practice is a combination of different things working together, bringing the right mentality, the principles, processes and amazing tools at our disposal like automation but this all includes security from the ground up too. DevOps is about putting those principles and practices in place to strengthen the pipeline, so why don’t we treat security in the same way?
Take, for example, 3 pieces of legislation that have been very much in the spotlight:
“The controller shall implement appropriate technical and organisational measures for ensuring that, by default, only personal data which are necessary for each specific purpose of the processing are processed.“ – GDPR (Europe) Art. 25 “Data protection by design and by default”
“Processing agents shall adopt security, technical and administrative measures able to protect personal data from unauthorized accesses and accidental or unlawful situations of destruction, loss, alteration, communication or any type of improper or unlawful processing.“ – LGPD (Brazil) Chapter VII, Art. 46 “Security and Secrecy of Data”
“A Controller or Processor is required to implement appropriate technical and organisational measures to demonstrate that Processing is performed in accordance with this Law…” – DIFC LAW NO. 5 OF 2020 (Dubai) Part 2D, Art. 14 (2) “Accountability and notification”
There’s a common running theme here and although lots of global legislation will either allude to, or directly tell you ways you can be compliant and what some of these “organizational” and “technical” measures are, it’s still pretty blurry.
How do we know what we can do? How do we know what “default” and “design” mean in this context? Well, we build it into the DevOps process.
Now I could sit here forever and talk about why transforming your database development, deployment and provisioning processes allows us to be more secure, but that’s a lot of material and it might have to come in chunks! So what we’re going to focus on today is as the title suggests: Data Classification and Cataloging.
Why is Cataloging important?
Cataloging structured data is incredibly important because it can be one of the first steps we take to securing sensitive Personally Identifiable Information (PII) or Protected Health Information (PHI) wherever it exists across our database environments. It allows us to make strengthened, contextual decisions about the data we hold including how we treat it in pre-Production, how long we retain it for and which systems and processes consume it.
But the most important part of this is simply: it tells us where the risk is.
Read through any of the most recent data protection laws and you will notice that a few things come up quite a lot including “Data Protection Impact Assessment“, or DPIA. Effectively if you can assess the risk of processing activities you can more readily answer the data protection questions and challenges you may face.
Knowing where your data resides can be the first step to helping you assess this risk, and to more readily answer your own data questions. If you want to read more about Cataloging specifically and why it is useful, you can read more about it on my previous blog here.
Where does Cataloging fit into DevOps?
This one is simple to answer. Once you have fully classified your entire estate, you’re not done. No, if you’re a development house or indeed even a single developer – if you are making any schema changes to the tables holding that sensitive data, you’re never done.
The reason for this is that Cataloging is an evergreen activity – if you update tables by removing columns, adding columns, splitting tables, adding tables… anything! Well then you need to be ready to make sure that you are:
a) Prepared and equipped with knowledge of the tables you’re working on and if this is a high risk activity.
b) Updating classification information to reflect the new “truth”, i.e. if you’re adding a column that will collect people’s Twitter handles, then that column should be classified as sensitive, and this should be reflected the moment it is deployed to Production.
So it is important to have the correct people working on this, with the right knowledge, preparation and processes and using the correct tools ensuring that those updates are persisted properly and securely through your deployment pipelines.
Huh… people, processes and tools… That sounds familiar!
The Process: SQL Data Catalog, SQL Change Automation and Azure DevOps
For this little experiment of mine I used Redgate’s SQL Change Automation (Migrations First approach in SQL Server Management Studio) and SQL Data Catalog to both develop & deploy and classify/categorize respectively, and for simple version control and orchestration of this pipeline I opted for Azure DevOps (with SQL Change Automation CI/CD plugins):
Fast forward a little and I had my example databases, VCS and pipeline all up and running:
Step 2) – The “Theory”: This is where things get interesting. So we have an example pipeline set up and we are able to completely deploy all the way through to “Production” so let’s talk theory.
In SQL Data Catalog I have covered both my Production and Acceptance Databases:
Now, in development we don’t make changes directly to Production, so why should Classification be any different? Now how you adapt the above code is up to you, feel free to split it, move it around, incorporate it into Pull Requests if you want to… But I’m going with a bit more of a simple situation.
Situation: Developer makes a change in Development, which gets committed, reviewed and merged o the main branch, resulting in a build and a deployment, in this case to Acceptance and then it is later deployed to Production.
Now, by Acceptance we should only have the “good work”, i.e. all of our testing is shifted left within DevOps so Acceptance is basically the last stop before Production. Therefore we should classify the work we have done on Acceptance, crucially, before it gets to Production and starts gathering sensitive data, and then copy this classification up on deployment.
Ideal: We should have no columns on Production that have not been classified.
Step 3) – In Practice: Fortunately it’s very easy to automate a lot of these steps with SQL Data Catalog utilizing it’s PowerShell cmdlets and REST API. The cmdlets are fully documented and very easy to use (docs here). This allows us to easily scan, classify and copy classifications up to other databases, but we’ll also need to do some checks and report if there are discrepancies, as part of the deployment pipeline that can be investigated.
Are there any columns on Acceptance that aren’t classified but have been deployed to Production? (failure to comply with process)
Are there any columns on Production that have not been classified? (classification drift)
Are there any unclassified columns on Acceptance that have not yet been deployed to Production (for pipeline hygiene purposes)
The other part of this ‘fun’ is reporting what has been changed in the same process. Now fortunately SQL Change Automation spits out a Changes.json file with its Release Artifacts and we can steal that away and find out how many tables have been created or changed in this release and report that back so we can correlate what has been done and what is missing:
So actually getting this up and running is just going to require 3 things:
Data Catalog available and pointed at Acceptance and Production (or your versions of these environments)
Variables set in Azure DevOps to fill the gaps (e.g. Where is Data Catalog? Whats my PowerShell Auth token? What are my Acceptance and PROD DBs called? etc.)
3 is the last step there so you’ll need something like this to run the script:
DatabaseDeploymentJSON – where the JSON file will be with the latest changes in the Prod release
DataCatalogAuthToken – Your PowerShell Auth token from Settings in Data Catalog
DataCatalogUrl – The full URL to your Data Catalog installation, missing the “\” at the end (ending :15156)
ExportPath (Optional) – I specified the path for my Database Deployment Resources to save typing it out in the Redgate plugins
ProdDB / StageDB – As you would expect, the Production and Acceptance/Staging DBs you’re deploying to/from
ProdInstance / StageInstance – As above, except the instance the Database are located on
In the variables above the Instance and DB names are purely used within Data Catalog, so there’s no need to worry about anything happening to the actual databases themselves!
Once you’ve run through the deployment pipeline a couple of times and the changes.json file is being produced, you can go ahead and copy the script into an inline PowerShell script step in your release and you should find it will fire to life! I simulated an example by modifying my Contacts table and my Articles table, adding 1 column each and deploying both to Acceptance. I then classified just 1 of these in Acceptance in Data Catalog:
and then approved the deployment to Production and tada!
Ok you probably can’t make all that out, but it effectively says:
(Information) Table dbo.Articles was modified in this deployment. (Information) Table dbo.Contacts was modified in this deployment.
That much we knew!
1 column(s) with classifications were discovered on VoiceOfTheDBA Acceptance that are not classified in VoiceOfTheDBA Production: dbo.Articles.TestingPineapple
Excellent, we classified that one so it gets copied up and we can verify that in data catalog against Production:
and finally, we get a warning about Production now containing unclassified columns:
(Alert) The following columns have been discovered on VoiceOfTheDBA Production that require classification: … dbo.Contacts.TestingPineapple … You should classify these columns in VoiceOfTheDBA Acceptance prior to the next deployment.
Just as we expected. Success!
Classification and categorization belongs as part of DevOps, if you expect the context for your business decisions around data to remain evergreen and informed then it cannot sit on the shoulders of one or two people to support it, and it cannot live in a manually updated Excel sheet or document.
By including it within the DevOps process, not only do you add an additional layer of security but you also make it an automated, team activity that can be audited, checked and easily kept up to date.
Is this DevSecOps? Well… not really no. Is this a more secure approach to Database DevOps? Absolutely! Happy DevOpsing!