When the unused become used: Using SQL Data Catalog to spot the pitfalls of a data classification process

“The surprise is that you continue to be surprised.”
Jill A. Davis

But what surprise am I alluding to? Of course it is the surprise that “all of a sudden” you have sensitive data in tables where there ought not be!

What I mean is; we have a fantastic process in place: we have completely classified our entire environments, developers move changes to classification up-stream just like schema changes and we have an automatic process to check and update classifications as part of our pipeline.

But as part of this process we “descoped” everything that was non-sensitive, including (crucially) empty tables. Now empty tables are a common occurrence and most people I speak to have a bunch of them, for a multitude of reasons; they’re using an ERP/CRM and aren’t using some of the modules, some of them are older and are gradually being phased out etc.

What happens over time however, is that some tables which we originally weren’t using suddenly start to populate with data, and others which were in-scope now suddenly aren’t. But because this isn’t technically a schema change and we can’t necessarily rely on the developers to know when this change will happen – how do we catch it?

The easiest thing to do in this instance is an automated check – write once, run many times. As we’ve seen from the many, many blog posts before this the answer is pretty straight forward – SQL Data Catalog’s PowerShell module once again saves the day!

(You can find the full documentation on the PowerShell module and a cmdlet reference here in the documentation.)

The Process

The most obvious thing to do is to re-scan our DB so Catalog is up to date and then fetch all of our columns first into a collection:

# Connect to SQL Data Catalog
Connect-SqlDataCatalog -AuthToken $authToken -ServerUrl $serverUrl

# Quickly refresh the scan of the Instance to get accurate row counts
Start-ClassificationDatabaseScan -FullyQualifiedInstanceName $instanceName -DatabaseName $databaseName | Wait-SqlDataCatalogOperation

# Get all columns into a collection 
$allColumns = Get-ClassificationColumn -instanceName $instanceName -databaseName $databaseName

Each of the columns returned by Data Catalog here has a number of properties that we can use:

The properties of each column object returned by the SQL Data Catalog PowerShell

All of these properties can be useful for something (and have been in the past) – in this case though the properties we care about are the id, the tags collection (name property) and the tableRowCount. The reason being that id uniquely identifies our columns so can be used to easily sort and filter on without having to match on text, the tags collection means we can search for any columns that do or do not have the Out of Scope – Unused tag and the tableRowCount lets us know if the table is or is not populated with data.

Its then fairly easy to filter these down:

# Collection for columns that are on empty tables

$emptyTableColumns = $allColumns | Where-Object { $_.tableRowCount -eq 0 }

# Collection for columns that are tagged as empty

$ColumnstaggedAsEmpty = $allColumns | Where-Object { $_.tags.name -eq $unusedTag }

# Collections for columns both tagged as empty but not, and empty columns not tagged as such

$ColumnsNowInUse = $ColumnstaggedAsEmpty | Where-Object { $_.id -notin $emptyTableColumns.id}

$ColumnsMissingClassification = $emptyTableColumns | Where-Object { $_.id -notin $ColumnstaggedAsEmpty.id}

You can easily visualize these in Data Catalog itself too:

Columns that are on empty tables, regardless of tag
Columns that are tagged as Out of Scope Unused regardless of if they have data or not

We can already see a difference in the lists – if we run the entire PowerShell however (which you can find right here in my GitHub, and I’ll include it at the end) then we actually find those differences out:

The PowerShell result: 2 classified as empty columns now with data, 3 columns not classified as out of scope, now on empty tables

So we can correct these by classifying the 3 now empty columns as now Out Of Scope – Unused, and we can change the 2 now-populated columns to be In Scope:

Suppliers is now in use, and corrected to be In-Scope
The missing Out Of Scope – Unused tags are corrected on the empty columns

…and now when we run the PowerShell again we should find everything is rectified:

Everything has been corrected, now no columns are incorrectly labelled or missing labels

Ongoing

Now we have the script we can run to identify these issues, you could implement it on the back-end of the DevOps process described by the blog post or video linked at the start of this blog post – and this would allow you to constantly be sure that your view of the Production database is accurate and up to date.

The key to a good classification process as part of on-going governance and an up-to-date view of your structured data estate is minimizing the manual effort that has to go into maintaining this process, which with the help from the Data Catalog PowerShell module – we can do easily (if we wanted a process that would go out of date after 2 weeks, we could have just used MS Excel…)

Happy Classifying!

SQL Data Catalog, Data Masker and your DevOps pipeline: How do I make sure everything is being masked?

“However fast regulation moves, technology moves faster. Especially as far as data is concerned.”
Elizabeth Denham

You’re probably sick of me constantly talking about how the cataloging of columns should be part of the DevOps upstream deployment process. I’ve blogged about it. I’ve even produced a video demonstrating this in action. But one question that this always throws up is:

If we include cataloging in the upstream process, how do we make sure our masking sets are also staying up to date?

The benefits of including the classifications in the upstream pipeline is that nothing ever gets to Production that hasn’t been classified – we constantly have a perfectly up to date idea of the nature of our structured data estate, how data is distributed, what risk is associated with which systems? etc. but one of the biggest wins is that we constantly know which fields need to be masked when we’re pulling copies back into non-Production, constantly.

Add a table? Add columns? We know about them, they’re classified, they’re deployed… so now they need to be masked on our next refresh. But how? Well it all depends on which approach we want to take:

  • Automated
  • Manual

Doesn’t it always boil down to those options? What I mean is that either we have an appetite to completely generate our masking set afresh every single time based on our classifications (Automated) or do we want to ensure that we configure each rule ourselves (Manual). They both have benefits and drawbacks.

Automated

How: We can generate a masking set using the SQL Data Catalog PowerShell each and every time as part of our pipeline. Add a column, tag it in the pipeline and then simply wait for the necessary rulesets to be generated in the pipeline (perhaps as part of your build) for you to run when you bring a copy back. Check out my walkthrough for how to set this up.

Benefits: The process is automatic. Its headless and you don’t need to think about it at all. As long as the classifications are provided (and if you follow the steps from the blog post and video you should be providing them) then you’re always generating rules for every classified column.

Drawbacks: This process can be fragile. If we don’t classify correctly we can end up masking in the wrong way or trying to mask the wrong field (e.g. a Primary Key, Constraint, Identity etc.) which can cause masking failures and then you have to spend time fixing the pipeline/masking set. This also means that the nature of the masking is dependent on your classifications, and the values you will get masked into the columns will be less realistic as a result (i.e. you can’t generate Row-Internal Sync Rules using the integration).

Drawback Mitigation: To avoid the process breaking, be sure to really focus on how you set up your API settings / how you pick which categories and tags are used to generate the masking rules (like I discussed here). This will at least help you make sure you map common data sets into columns (and don’t hit columns that have constraints or keys).

Only columns marked with Static Masking as the Treatment Intent will get a rule created for them
An Information Type is given to every column we intend on masking – these are then mapped to templates in masker to ensure more realistic data

Manual

How: Either rely on developers to check in masking set changes along side their code changes or build in a manual intervention step to your upstream process to ensure that someone opens and updates the schema and rules within the masking set, and then put this back into the pipeline.

Benefits: This results in more accurate, more likely to run sets that generate significantly more reliable and realistic data as an end result. Masked DB copies can be more easily used for anything beyond simple development changes, including analytics etc. You can include any specialist rules you need and apply your own understanding and knowledge of the database.

Drawbacks: This is obviously a manual process – less than ideal. Anything that involves a human can invariably go wrong because we are humans and we make mistakes (which we learn from of course). This also takes significantly more time as part of the process.

Drawback Mitigation: This is a harder one to mitigate as we’re reliant on manual intervention, however the best way to check this could be with another team member checking (maybe as part of a pull request) or you could include an automated PowerShell script to effectively rationalize the columns to be masked from Catalog vs the columns currently in the Data Masker masking set – this would help you understand if all the necessary columns have been updated or if any were missed. I have included an example of this PowerShell below.

### CHANGE THIS ###

$MaskingSet = "yourmaskingset.DMSMaskSet" # Your masking set including the DMSMaskSet file extension
$instance = "yourinstance" # The Instance as it is shown in Data Catalog that hosts the database
$DatabaseName = "yourdatabase" # The DB you want classification info for
$CatalogServer="http://yourmachine:15156" # The lcoation of your catalog server, ending on :15156
$authToken="redacted" # Your Data Catalog Auth token from the Settings page
$tagName = "Static Masking" # The tag you're using to identify which columns need to be masked

### DONT CHANGE THIS ###

Invoke-WebRequest -Uri "$CatalogServer/powershell" -OutFile 'data-catalog.psm1' -Headers @{"Authorization"="Bearer $authToken"}
Import-Module .\data-catalog.psm1 -Force
Connect-SqlDataCatalog -ServerUrl $CatalogServer -AuthToken $authToken 
$ColumnsMarkedForMasking = Get-ClassificationColumn `
    -InstanceName $instance `
    -DatabaseName $DatabaseName | Where-Object {$_.tags.name -eq $tagName} 
$MaskingSetXML = [xml](Get-Content -Path $MaskingSet)
$subrules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleSubstitution')
$internalrules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleRowInternal')
$shufflerules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleShuffle')
$searchreplacerules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleSearchReplace')
$TablesAndColumns = @()

$subrules | ForEach-Object {`
    $CurrentTable = $_.TargetTableName.value
    $_.DMSPickedColumnAndDataSetCollection.DMSPickedColumnAndDataSet.N2KSQLServerEntity_PickedColumn.ColumnName.value | ForEach-Object {$TablesAndColumns+= $CurrentTable + "." + $_ }
}

$internalrules | ForEach-Object {`
    $TablesAndColumns+= $_.TargetTableName.value + "." + $_.TargetColumnName.value
}

$shufflerules | ForEach-Object {`
    $CurrentTable = $_.TargetTableName.value
    $_.DMSPickedColumnCollection.DMSPickedColumn.N2KSQLServerEntity_PickedColumn.ColumnName.value | ForEach-Object {$TablesAndColumns+= $CurrentTable + "." + $_ }
}

$searchreplacerules | ForEach-Object {`
    $TablesAndColumns+= $_.TargetTableName.value + "." + $_.TargetColumnName.value
}

$result = $TablesAndColumns | Sort -Unique
$ColumnsNeedingRules = $ColumnsMarkedForMasking | Where-Object {($_.tableName + "." + $_.columnName) -notin $result}

"`nThere are " + $ColumnsMarkedForMasking.count + " columns that require masking for database " + $DatabaseName + "in SQL Data Catalog."
"You are masking " + $result.count + " distinct columns in masking set: " + $MaskingSet
"`nThe columns that do not currently have a mask configured are:`n"

$ColumnsNeedingRules | ForEach-Object {$_.tableName + "." + $_.columnName + "     (" + $_.dataType + ")"}

$next = Read-Host -Prompt "`nWould you like to see the columns currently in your masking set? (Y/N)"
if ($next -in ("Y", "y")) {$result}

This code can also be found on my GitHub here.

The output of running the script – 13 columns in the masking set, 14 columns outstanding to be masked

You can technically even use this same check approach for the automated masking set generation option, to ensure that everything has been tagged correctly.

PASS Data Community Summit 2021

“Education is the kindling of a flame, not the filling of a vessel.”
Socrates

I will be speaking at PASS Data Community Summit 2021

I have spoken at previous PASS Summits; both through the virtue of working for Redgate, and off my own back through dedication and passion to the subject matter I speak about: Data Privacy and Protection.

In 2018 I stood on stage with Microsoft to speak about the nature of Static Data Masking, how it differs from Dynamic Masking and what challenges need to be considered for a successful static masking rollout.

In 2019 I stood on stage alone to talk about creating a strategy for masking non-Production environments, including a walkthrough of the dbatools.io masking functionality utilized alongside Azure SQL Database classifications. PASS Summit 2019 was also when Kendra Little encouraged me to set up this blog, for which I’m forever grateful.

In 2020… well. You know what happened.

In 2021 Summit sees a new lease of life. Data Community Summit will be entirely online (no surprises there) but one big surprise you might not know is that it is completely free to attend. Never before will there have been SUCH a swathe of incredible speakers, with such a huge variety of topics and learning pathways for free and available on demand afterwards.

The dates for your diary? November 8-12, 2021

As it happens, I will also be speaking about setting up an end to end deployment pipeline using the Flyway Community Edition, Azure SQL Database and Azure DevOps it would be great to see you but with so much on offer I could absolutely understand if you watched on catch up!

You can see all the speakers here, but here’s a short list of some oft he sessions I will definitely be tuning in to!

  • Erin Stellato – Demystifying Statistics in SQL Server
  • Grant Fritchey – Identify Poorly Performing Queries – Three Tools You Already Own
  • Tracy Boggiano – Azure SQL Fundamentals
  • Angela Tidwell – Azure Devops Dashboards EZ as pie-charts!
  • Indira Bandari – Getting started with Python for Data professionals
  • Jess Pomfret – Azure SQL & Our Toolbox To Manage It
  • Taiob Ali – Think like the Cardinality Estimator
  • Neil Hambly – Azure Notebooks – Data Science fundamentals
  • and many more!

So please go check it out & register, support the community and do a bunch of learning in the process – it will be amazing to see you there and hopefully I’ll even get to see some of you in person in the not-so-far future!

Configuring Dynamic Data Masking in Azure SQL Database from SQL Data Catalog using PowerShell

Horror is the removal of masks.”
Robert Bloch

I spend a great deal of my time talking about Data Masking, don’t believe me? Checkout here, here, here and even here. I talk about it a LOT, but I’m always talking about Static Data Masking (SDM), which is the process of masking the data at the file level itself so it is irreversibly altered – this is fabulous for non-Production environments like Dev and Test, especially when you pair it with a good cloning technology.

But what about Staging / Production environments?

I often work with teams to implement SDM and one of the fastest routes to successfully generating your masking sets for cloned environments is, of course, SQL Data Catalog (or cataloging solution of your choice) – you’ve already put the effort in to classify your columns and figure out where sensitive information exists within your databases and instances… so doesn’t it make sense that we can just use THAT as a source of truth and generate masking from there?

Note: I actually produced an end-to-end video showing the process of Cataloging, Masking and Cloning in under an hour here: https://www.red-gate.com/hub/university/courses/sql-data-catalog/end-to-end-data-protection-with-sql-data-catalog-and-sql-provision – if you watch/try it let me know how you get on!

Funnily enough though we wouldn’t want to use SDM for Production (and potentially Staging) environments though – as it irreversibly changes the data, it’s just going to completely mess up all of our Prod data. To tackle this then, many people I work with turn to Dynamic Data Masking.

Dynamic Data Masking (DDM)

DDM is a method of masking the data based on your access rights to the data. As far as customers see they have access to their data through our site or application no issues, but if anyone else needs to query that data, or different people need to see different results when querying environments, DDM has been their way to go.

Whilst a lot of people like to pick up on some of the well known downsides of DDM, it’s not like you’re entrusting the entire security of an environment to it alone – there are a ton of measures we can put in place and DDM is just one; like an ex-colleague of mine (someone very wise whom I admired greatly and am still sad to this day I no longer get to work with them) used to say: “It’s about building a defensible position. The more you do the easier it is to prove you’re doing something and the more likely you are to BE protected.

So when a customer asked this week if it was possible to configure Dynamic Data Masking from SQL Data Catalog (because they’d seen the “Treatment Intent” category and the tag that clearly states “Dynamic Data Masking”), just like we’re able to configure Static Data Masking, well now that was a challenge I couldn’t turn down!

The SQL Data Catalog Taxonomy Page – Treatment Intent Category showing Dynamic Data Masking

DDM in Azure SQL DB

Configuring Dynamic Masking in Azure SQL DB is fairly straight forward through the Azure portal, you can go to your Azure SQL DB, click Dynamic Data Masking and it gives you the option to simply pick and save columns to apply Data Masking to, and to whom these rules apply / don’t apply:

DDM in the Azure Portal for the DMDatabase_Dev, with masks configured on customer_firstname and customer_email

However when we potentially have a lot of columns or DBs to configure masks for this is really going to get very old very fast. As with all things, I turned to PowerShell for the answer and fortunately I found it: https://docs.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview?#set-up-dynamic-data-masking-for-your-database-using-powershell-cmdlets – basically I can get existing DDM configurations and set new configurations for columns directly from my Azure SQL DB using PowerShell.

Now I’m not expert on DDM, and Redgate Data Masker for SQL Server is not a DDM solution (so I’ve only ever needed to know SDM really) I don’t pretend to be, but it seemed that I had everything I needed to tie Catalog into DDM.

PowerShell time!

I’ve written so much PowerShell to get classifications out of Data Catalog at this point it’s become second nature, but if you’re using the SDC PowerShell module and you need a reference you can view it here: https://documentation.red-gate.com/sql-data-catalog/automation-with-powershell but the standard “stuff” goes:

  • Pull down the PoSh module
  • Connect to catalog where it’s installed using an Auth token
  • Grab out the classifications with Get-ClassificationColumn
  • Shrink this down to just the columns we care about based on the tags

But the Az PowerShell cmdlets were honestly just as easy to use! I was surprised how easy it was to get up and running:

  • Connect to my Azure subscription
  • Get the current list of columns already with DDM masks
  • Remove these from the Catalog list
  • Update the remaining columns to use the default Mask

This was the full code I ended up using:

# This script is intended to be used with Azure SQL Database and Redgate SQL Data Catalog, however you are welcome to adapt and edit as required
# It will pull columns out of azure that are already being masked, and a list of columns that need to be masked with DDM
# It will then rationalise these, and configure Default DDM masks for any columns not already being masked on that Azure SQL DB

#Variables for Azure SQL DB & Catalog
$ResourceGroup = "DMDb"
$ServerName = "dmnonproduction" # Your instance minus .database.windows.net
$instance = "dmnonproduction.database.windows.net" # The instance or logical SQL Server as displayed in SQL Data Catalog
$DatabaseName = "DMDatabase_Dev"
$CatalogServer="http://pse-lt-chrisu:15156" # Your SQL Data Catalog location, leave off the trailing "/"
$authToken="REDACTED" # Your SQL Data Catalog Auth Token
$AzureSub = "Redacted" # Your Sub ID

# Get the SQL Data Catalog PowerShell Module & Connect
Invoke-WebRequest -Uri "$CatalogServer/powershell" -OutFile 'data-catalog.psm1' -Headers @{"Authorization"="Bearer $authToken"}
Import-Module .\data-catalog.psm1 -Force
Connect-SqlDataCatalog -ServerUrl $CatalogServer -AuthToken $authToken 

#Connect to your Azure Subscription
Connect-AzAccount -Subscription $AzureSub

#Get current active DDM Masks from Azure
$DdmMasks = Get-AzSqlDatabaseDataMaskingRule `
    -ResourceGroupName $ResourceGroup `
    -ServerName $ServerName `
    -DatabaseName $DatabaseName
$ListOfDDMColumns = $DdmMasks | ForEach-Object {$_.SchemaName + '.' + $_.TableName + '.' + $_.ColumnName}

#Get columns from Catalog currently marked with "Dynamic Data Masking" as a treatment intent
$CatalogColumns = Get-ClassificationColumn `
    -InstanceName $instance `
    -DatabaseName $DatabaseName | Where-Object {$_.tags.name -eq "Dynamic data masking"} 

#Filter down to a list of columns that need to be masked, that currently aren't configured with DDM
$ColumnsToDDM = $CatalogColumns | Where-Object {($_.SchemaName + '.' + $_.TableName + '.' + $_.ColumnName) -notin $ListOfDDMColumns }


#Set default DDM Masks for identified columns
$ColumnsToDDM | ForEach-Object { `
    New-AzSqlDatabaseDataMaskingRule -ResourceGroupName $ResourceGroup `
                                     -ServerName $ServerName `
                                     -DatabaseName $DatabaseName  `
                                     -SchemaName $_.schemaName `
                                     -TableName $_.tableName `
                                     -ColumnName $_.columnName `
                                     -MaskingFunction "Default"

}

But I have also uploaded it to my GitHub here in case anyone would like to take and adapt as they see fit: https://github.com/ChrisUnwin/PowerShell/blob/master/Demos/Redgate%20Demos/DDMFromCatalog.ps1

And this was the result – here were the two columns I had already being masked:

Customer Firstname and Customer Email with DDM Masks Configured

These were the columns I had marked as Dynamic Data Masking in Data Catalog:

Customer firstname, lastname, street addres and email all marked for DDM in Catalog

and after running the PowerShell it deduced that the delta was street_address and lastname and created the default DDM mask for them in Azure:

All columns now being masked dynamically

Considerations

1 – I have used the default mask in this process, however if you wanted to configure the mask (as per the link to the docs above) to be specific numbers or format you could absolutely do this, simply by modifying the PowerShell to look at the Data Type and then just passing into a different New-AzSqlDatabaseDataMaskingRule for each of those types.

2 – This only applies to Azure SQL DB and does not take into account the considerations when using DDM on say, a 2017 SQL Server Instance running on a VM – however you could use the same approach to pass the columns into some dynamic T-SQL which would in turn run the correct command to add DDM to that/those column(s)

3 – I would still use Static Data Masking (SDM) for non-Production environments, because if anyone bypasses the DDM they will have access to the full data, which we don’t really NEED in less secure non-Prod environments anyway, so Static might well be the way to go!

Automating best practice checks at build time using the SQL Code Analysis cmdline (and failing the build)

Quality is not an act, it is a habit.”
Aristotle

I’ve always been thoroughly impressed with the static analysis code report that you can get from SQL Change Automation when it creates a Release Artifact prior to deploying upstream, and the fact we can use SQL Prompt to carry out on-the-fly static analysis as we write our T-SQL code in SSMS, but it has always struck me as odd that there doesn’t appear to be a way to include these checks at build time.

This seems like the perfect opportunity to build the DB from scratch (check), run unit tests (check) and check no poor coding practices have been checked in onto our branch (uh… not check?)

Enter SQL Code Analysis!

Don’t know what I’m talking about? I got you: https://documentation.red-gate.com/scg/sql-code-analysis-documentation/code-analysis-for-sql-server-command-line but to be fair I didn’t know this was a thing myself until yesterday!

One of my esteemed colleagues asked this question yesterday an being me I just couldn’t wait to have a go – is there a way to include this cmdline as part of an Azure DevOps build and FAIL the build if issues are found?

First things’ first: How does it work?

I downloaded the command line and it was pretty simple to get my head around. you can use windows or sql auth, you can point it at a scripts folder or a live DB and you can output the results to xml, html or the console if you’d like. Simple.

SqlCodeGuard.Cmd.exe /s:localhost\TOOLS /d:DMDatabase2019 /out:helloworld.html
SQL Code Analysis Console Output: 4 issues with the DMDatabase2019 found
Code Analysis HTML Output: 4 issues found with the DMDatabase2019, but much easier to read this time

Next: Wrap it in some PowerShell

I am neither a windows command line nor a PowerShell guru, but my first instinct when I’m going to include something like this in a build or deployment is to use PowerShell. It’s easy and mostly non-confusing to pass variables through the pipeline to PowerShell, easy to customize scripts and include if/else logic and to capture exit codes.

So I wrapped the cmdline call in some PowerShell (take it, it’s all yours!) that made it easier to:

  • Import the XML output generated by Code Analysis
  • Count the number of issues generated
  • Exit with error code 1 (failure) if any issues were found
  • Exit clean with 0 if no issues are found
#Set Path for Code Guard, server/instance, database and output location for XML
$codeGuardPath = "C:\Users\chris.unwin\Downloads\SCG-2019-10-17-11-40-22-46"
$server = "REDACTED"
$database = "REDACTED"
$outLocation = "$codeGuardPath\myoutput.xml"
#$user = "REDACTED"
#$password = "REDACTED"

#Invoke SQL Code Guard against the DB (could be the Build Database)
& "$codeGuardPath\SqlCodeGuard.Cmd.exe" /s:$server /d:$database /out:$outLocation #/u:$user /p:$password

#Import output xml file and count contents
$blah = [xml](Get-Content -Path $outLocation)
$files = $blah.SelectNodes('//file') #Objects with issues
$issues = $blah.SelectNodes('//file/issue') #Total issues themselves

#If number of issues > zero, exit with non-zero exit code and output list of affected objects
if ( $issues.count -gt 0 ) {

    "You have: " + $files.count + " objects, containing a total of: " + $issues.count + " issues."
    $files.fullname
    "Please review the xml output for more information."

    exit 1

}

#Else continue with no issues
else {

    "No code issues discovered."

}

This works like a charm:

Static Analysis Output in PowerShell: 2 objects with 30 issues

Finally: Put it in a pipeline

Unsurprisingly, putting it in a pipeline was the easiest part. I took a pipeline I had that was already running a local Azure DevOps agent in my default pool, made sure SQL Code Analysis was present in the correct directory on that machine and voilà! The build fails if it finds any issues.

N.B. I just stuck the raw PowerShell in the pipeline, you would be better off passing connection and location variables to the PowerShell script using custom Azure DevOps environment (and secret) variables. Oh, and having a better install directory for Code Analysis than Downloads, my bad…

Example YAML containing the PowerShell step
Code Analysis finds issues, so the PowerShell exists with Code 1, causing the build to fail

Time to choose.

There are a couple of things I’ve assumed here – I’m running it locally on a server and running against a database, and that database could be one that I’ve just built during my CI pipeline, absolutely – but you could also run this against a scripts folder / set of scripts, so even if you don’t yet have a full build / deploy process, or you have a different process that works for you – you can still include SQL Code Analysis with fairly minimal overhead! Enjoy!

Creating rollback scripts automatically with SQL Compare and SQL Data Compare in Azure DevOps (migrations approach)

“Life’s under no obligation to give us what we expect.”
Margaret Mitchell

Hello everyone and happy 2021! Wait, what? It’s FEBRUARY!?! Oops.

You may well have noticed I have not been around for a couple of months. That is because I’ve been struggling. A lot. We’ve had some home worries, and with the state of the world playing on our minds constantly, it has never felt right to sit down and blog anything other than some of the recipes I’m working on (purely to get out of my own head!) but this week I was asked about rollbacks when working in the Migrations based approach with SQL Change Automation, and I couldn’t resist writing something.

Before I get into this though, I want to be very clear in my preferences. I don’t believe database changes SHOULD ever be rolled back. Always forward. If for any reason you need to undo a change, either for a disaster or just because you want to, and other code is being rolled back – then this should be a task. A rollback branch should be created, the respective changes should be made in dev and then using your fabulous, flexible and repeatable pipeline should be merged, built, checked, tested and deployed upstream to Prod. This rolls the database forwards and does so in a way that is controlled, auditable and prevents data loss.

Jimmy Fallon Agree GIF

However there MAY be some scenarios where a rollback is unavoidable, for example if you are required to do so immediately, or you’re not sure which of the 3 changes that were just deployed actually caused something to go bang, so you want to blanket undo until you can investigate properly.

With this in mind then there are a couple of ways of rolling back in the migrations approach that can absolutely be a part of your pipeline, one or two of which I want to try and flesh out in this post. This post would of course be much shorter if we were working with the State-Based approach, as we would simply need to compare a previous artifact with our target and generate a down script, but as we know, the state-based method does indeed come with it’s own challenges.

  1. Generate down scripts up front – this is an option that is described fairly well in the SQL Change Automation documentation, and may be preferable to some, however this is limited to VS and does definitely include some additional work for developers, and potentially generates quite a lot of scripts that may have to be rationalized
  2. Ad-hoc rollback with SQL Compare – again, this is described in the documentation but they will have to be changes that are then recaptured in the development pipeline in some way, plus this method is ideal for programmable objects but not so much for table changes.

But I think we’re onto something there with a combination of these 2 approaches. Is there a way to use SQL Compare and SQL Data Compare alongside SQL Change Automation, to automatically generate down scripts as artifacts, before each Production deployment? This is what I want to investigate.

Set Up

As usual, I set up an Azure DevOps repo (this time called RollBackPipe) and created myself a set of local databases that I’ve been using to record videos lately so I now have the MaskMePlease database in Dev, Test, Staging and Production flavors. I’m doing this locally on my machine with an Azure DevOps agent I had hanging around anyway – it saved me firing up a bunch of Azure SQL DBs, and I just used SQL Clone.

Implementation

I got the pipeline running straight through, I was simply using YAML for this pipeline as I didn’t want to “faff” around with the Release stages etc. but this could also have been achieved with Classic Pipelines of course.

Ignore the terrible stage names, essentially I was doing 3 things with SQL Change Automation:

  • Database Build: Producing a Nuget Package to represent the successfully building files stored in my Git repo, to consume later in the pipeline
  • Create Release Artifact: This stage looks at the target environments (Test, STG and Production) and creates a number of reports in my local DatabaseDeploymentResources folder for me to review
  • Deploy From Release Artifact: Take the patch script that has been produced and run this against the target DB

So it was time to try and build in the SQL Compare Command Lines. Note if I was doing this on an Azure Pipelines agent I might need to run this as a container or programmatically install the command lines instead, but fortunately they’re just installed on my machine so I could just add a command line stage.

There was actually a couple of ways to do this, I can either pass in the details of the 2 databases completely via the command line (full documentation here), or I could generate an XML config file to pass in and store that in version control, or I could pre-construct a SQL Compare / Data Compare “Project” and just pass that in. The latter seemed easier, so of course I did that!

I opened up SQL Compare and did a Database to Database comparison for Staging and Production, but in the reverse direction to ensure we create a down script and not an up script!

I then saved this as a project, then using the red SQL Compare logo in the top left I was able to launch SQL Data Compare with the same project, so it loads the settings for you. The reason for doing this is that if we undo anything that was created as part of a migration, then try to redeploy it, it won’t actually recognize those features as missing, because the Migration Log table tracks all the migrations, and it will still be saying that it has been deployed. So we need to compare the MigrationLog tables to re-sync back to the previous state, before we ran the migrations.

However I DID make sure that before comparing I only the __MigrationLog table in the comparison in Tables and Mappings:

I then had to change the Comparison Key being used for the migration log table to [Custom], because the default comparison key includes the date/time executed, so this is what I went for:

I saved this project down with the Compare script:

and then following the rather helpful documentation on the command lines, I was able to make the following calls as part of my pipeline (YAML included):

- task: CmdLine@2
  inputs: 
    script: '"C:\Program Files (x86)\Red Gate\SQL Compare 14\sqlcompare.exe" /project:"C:\DatabaseDeploymentResources\MaskMePlease\_ComparisonResources\ProdToStagingDownScript.scp" /scriptfile:"C:\DatabaseDeploymentResources\MaskMePlease\%BuildId%\SchemaDowngradeScript.sql" /include:Identical'
  env:
    BuildId: $(Build.BuildId)
- task: CmdLine@2
  inputs:
    script: '"C:\Program Files (x86)\Red Gate\SQL Data Compare 14\sqldatacompare.exe" /project:"C:\DatabaseDeploymentResources\MaskMePlease\_ComparisonResources\ProdToStagingDataDownScript.sdc" /scriptfile:"C:\DatabaseDeploymentResources\MaskMePlease\%BuildId%\DataDowngradeScript.sql" /include:Identical'
  env:
    BuildId: $(Build.BuildId)

and sure enough, having this run after the deployment to staging but before the deployment to Production I was able to get the reports being produced into the artifacts folder with the relevant changes document:

So it was time to try it out! I created a BadTable on Dev and generated the migration for it, and committed it to source control:

This then triggered the pipeline to run:

and produced the correct reports and .sql scripts:

Then I ran these scripts to rollback Production (of course I could have modified them to preserve any data I needed to retain, this actually works best with Programmable objects like sprocs). Then I re-ran the deployment pipeline and the BadTable lives once again!

So does this work with multiple changes too? I created a stored procedure called “deletemeplease” and modified our BadTable to add a “randomthing” column, to see if we could roll those back:

Production after deployment:

Production after rollback:

All I have to do to roll back forwards now is run my pipeline:

and they’re back:

This image has an empty alt attribute; its file name is image-24.png
This image has an empty alt attribute; its file name is image-25.png

Conclusions

Is it perfectly possible to have a version specific script output as an artifact as part of your Azure DevOps pipelines? Yes it absolutely is.

Is it perfectly possible to rollback changes that you’ve made when taking a Migrations-First approach to database development? Yes it absolutely is.

Would I recommend this approach? No. Absolutely not.

Just because this is possible doesn’t mean it is good process. I’ve been asked about it enough times that I was more curious than anything – however after actually IMPLEMENTING this process, it is clear to see there are many aspects to it that can fall down and it makes it clearer now, more so than ever, that the best approach would be to either roll forwards from Dev, through the pipeline in an automated, auditable and controllable way, or if we just need to roll back temporarily because of other code being rolled back, then use SQL Compare ad-hoc. This pipeline was very simple, with very few changes and when i made mistakes I could rectify them easily myself – but with a larger dev team, when under pressure to fix things, and many changes to consider, this approach can very quickly fall apart.

Long story short, (I believe) we should follow and trust the documentation.

YAML from Azure DevOps:

trigger:
- main

pool: 'default'

steps:
- task: RedgateSqlChangeAutomationBuild@4
  inputs:
    operation: 'Build SQL Change Automation Project'
    sqlProj: 'RollBackPipe\RollBackPipe.sqlproj'
    packageName: 'MaskMePlease.Database'
    tempServerTypeBuild: 'localDB'
    nugetPackageVersionSelector: 'Specific'
    nugetPackageVersion: '1.0'
    nugetPackageVersionUseBuildId: true
    requiredVersionOfDlma: 'latest'

- task: RedgateSqlChangeAutomationRelease@4
  inputs:
    Operation: 'Create'
    NuGetFile: 'MaskMePlease.Database.1.0.$(Build.BuildID).nupkg'
    ExportPath: 'C:\DatabaseDeploymentResources\MaskMePlease\$(Build.BuildID)\Test'
    AppendEnvironment: false
    TargetDatabaseConnectionType: 'explicit'
    TargetDatabaseServer: '.\TOOLS'
    TargetDatabaseName: 'MaskMePlease_Test'
    TargetAuthMethod: 'windowsAuth'
    TransactionIsolationLevel: 'serializable'
    AbortOnWarning: 'none'
    requiredVersionOfDlma: 'latest'


- task: RedgateSqlChangeAutomationRelease@4
  inputs:
    Operation: 'DeployFromResources'
    ImportPath: 'C:\DatabaseDeploymentResources\MaskMePlease\$(Build.BuildID)\Test'
    AppendEnvironment: false
    TargetDatabaseConnectionType: 'explicit'
    TargetDatabaseServer: '.\TOOLS'
    TargetDatabaseName: 'MaskMePlease_Test'
    TargetAuthMethod: 'windowsAuth'
    requiredVersionOfDlma: 'latest'

- task: RedgateSqlChangeAutomationRelease@4
  inputs:
    Operation: 'Create'
    NuGetFile: 'MaskMePlease.Database.1.0.$(Build.BuildID).nupkg'
    ExportPath: 'C:\DatabaseDeploymentResources\MaskMePlease\$(Build.BuildID)\Staging'
    AppendEnvironment: false
    TargetDatabaseConnectionType: 'explicit'
    TargetDatabaseServer: '.\TOOLS'
    TargetDatabaseName: 'MaskMePlease_STG'
    TargetAuthMethod: 'windowsAuth'
    TransactionIsolationLevel: 'serializable'
    AbortOnWarning: 'none'
    requiredVersionOfDlma: 'latest'


- task: RedgateSqlChangeAutomationRelease@4
  inputs:
    Operation: 'DeployFromResources'
    ImportPath: 'C:\DatabaseDeploymentResources\MaskMePlease\$(Build.BuildID)\Staging'
    AppendEnvironment: false
    TargetDatabaseConnectionType: 'explicit'
    TargetDatabaseServer: '.\TOOLS'
    TargetDatabaseName: 'MaskMePlease_STG'
    TargetAuthMethod: 'windowsAuth'
    requiredVersionOfDlma: 'latest'

- task: CmdLine@2
  inputs: 
    script: '"C:\Program Files (x86)\Red Gate\SQL Compare 14\sqlcompare.exe" /project:"C:\DatabaseDeploymentResources\MaskMePlease\_ComparisonResources\ProdToStagingDownScript.scp" /scriptfile:"C:\DatabaseDeploymentResources\MaskMePlease\%BuildId%\SchemaDowngradeScript.sql" /include:Identical'
  env:
    BuildId: $(Build.BuildId)
- task: CmdLine@2
  inputs:
    script: '"C:\Program Files (x86)\Red Gate\SQL Data Compare 14\sqldatacompare.exe" /project:"C:\DatabaseDeploymentResources\MaskMePlease\_ComparisonResources\ProdToStagingDataDownScript.sdc" /scriptfile:"C:\DatabaseDeploymentResources\MaskMePlease\%BuildId%\DataDowngradeScript.sql" /include:Identical'
  env:
    BuildId: $(Build.BuildId)
- task: RedgateSqlChangeAutomationRelease@4
  inputs:
    Operation: 'Create'
    NuGetFile: 'MaskMePlease.Database.1.0.$(Build.BuildID).nupkg'
    ExportPath: 'C:\DatabaseDeploymentResources\MaskMePlease\$(Build.BuildID)\Production'
    AppendEnvironment: false
    TargetDatabaseConnectionType: 'explicit'
    TargetDatabaseServer: '.\TOOLS'
    TargetDatabaseName: 'MaskMePlease_Production'
    TargetAuthMethod: 'windowsAuth'
    TransactionIsolationLevel: 'serializable'
    AbortOnWarning: 'none'
    requiredVersionOfDlma: 'latest'


- task: RedgateSqlChangeAutomationRelease@4
  inputs:
    Operation: 'DeployFromResources'
    ImportPath: 'C:\DatabaseDeploymentResources\MaskMePlease\$(Build.BuildID)\Production'
    AppendEnvironment: false
    TargetDatabaseConnectionType: 'explicit'
    TargetDatabaseServer: '.\TOOLS'
    TargetDatabaseName: 'MaskMePlease_Production'
    TargetAuthMethod: 'windowsAuth'
    requiredVersionOfDlma: 'latest'

SQL Change Automation and GitLab CI/CD (a.k.a. Oh this is fun on Windows)

“You never know what you can do until you try, and very few try unless they have to.”
C.S. Lewis

Well I don’t have to, but many of the people I speak to on a daily basis are moving into GitLab, so it’s about time I tried it! You can find here testament to the mistakes I make as I try to set up a full end-to-end database change management process with SQL Change Automation and GitLab.

Will it all work perfectly? I don’t doubt that everything will fall over at some point, but let’s see how we get on all the same, and hopefully if you’re setting up this same pipeline, you’ll be able to avoid the errors and failings I inevitably cause! So here we go!

Jan 2021 Edit: Hey guess what!? There’s a video of this too! Check it out on YouTube.

ready come on GIF

Let’s set up a GitLab Project (and rename the default branch)

Naturally, I didn’t have a GitLab account, so I had to set one up. I’m assuming that if you’re using it already or you’ve just started using it you’re taking advantage of the more business features but I’ve just stuck with the good ol’ free account for now! It was remarkably simple, sign up, email address, confirm and here we are:

Ok there is something very cool I like about setting up a new project, can you tell what it is?

You can completely set up a new blank project but they have templates, you can import projects OR, and I love this, you can setup a full CI/CD pipeline from another repo! Having done this before in Azure DevOps it was not easy, let me tell you. It really seems like Azure DevOps hates you for setting up CI/CD from an external repo, even though it has plenty of helpful ways of doing so!

So I initialized my repository with a README and updated it:

Don’t ever say I’m not descriptive enough!

The first thing I did was a renamed my default branch to ‘trunk’ by going to branches, creating the new branch and then in Settings > Repository changing it to the default and then swapping out the protected status with the outdated master:

Then finally delete the old default in Repository > Branches:

Excellent. Now it’s time to clone trunk onto my machine as we will need the local repository to put our change automation project in!

I created a folder called GitLab test and cloned the mostly empty repository into it:

Simple!

Create a new SQL Change Automation project and push it to trunk

In SSMS I opened up the most recent version of SQL Change Automation an created a new project called “DoggosAreCoolDB” using a copy of a Dev database I had lying around from a previous demonstration (BlogsDotRedgate):

Then I created my baseline as a migration script against the up-stream copy, BlogsDotRedgate_Integration, because who has access to Prod for this? Am I right? *cough & shifty eyes* not me!

I successfully generated my baseline and a change script (I added a column to a table, nice and simple) and then committed them to my local repo, and pushed! Forget branching, today isn’t about that, we’re just PUSHING TO TRUNK, WOO-HOO!

Setting up the CI/CD Pipeline

Now that we have our project and migrations in GitLab we can build out a pipeline! So first stop I went straight to CI/CD > Pipelines and was presented and I hit “Get Started”:

They immediately throw you into a Quick Start “Help” style guide which is immediately a little un-intuitive but surprisingly helpful if you read the whole thing. Effectively we need a YAML file called .gitlab-ci.yml that will store our pipeline as code telling it how and where to build, and we need a runner to actually fire up and execute these steps.

In my experience with some other CI/CD tools, it’s been advantageous to actually create the Runner / Agent first on the machines you’re going to be using, so as I just have my laptop to do this on, I will set one up on there! I found the full documentation for a Windows Runner here, and followed it just so I would have it available.

The GitLab Runner was up and running in my services but I’ll be darned if I can see them anywhere in GitLab…

confused britney spears GIF

Aha! So it turns out after a bit of digging that you need to register the runner specifically using the CI/CD section on the project settings, that was probably my bad for not reading the documentation thoroughly but my counter-argument… who actually does? So I issued the register command, applied tags and a description and chose my runner type, I chose shell because I need to be able to run PowerShell on the machine (I’ll need the SQL Change Automation PowerShell components available on the machine where the Build and Deployment are happening of course):

I’ve got the runner on the machine, I’m using an instance of SQL Server to build against, now I just need the YAML file (fortunately GitLab has full documentation for how to structure this as well!)

So I can build my project I’m going to need to know where the repo is cloned to during the process (i.e. to find the .sqlproj file) so by taking a look I managed to find a list of environment variables that can be used in the YAML file, just to be sure though, I created and committed the most basic YAML file that would just echo back the location of the cloned files:

stages:
  - DatabaseBuild

Build:
  stage: DatabaseBuild
  tags: 
   - sql
  script: 
   - echo $env:CI_PROJECT_DIR

After this let me know the environment variable worked correctly and the build pipeline was being fired up correctly on my private runner, I tried something a little more ambitious, building the .sqlproj file using the cmdlet reference from the SQL Change Automation documentation for help:

stages:
  - DatabaseBuild

Build:
  stage: DatabaseBuild
  tags: 
   - sql
  script: 
   - $projectPath = $env:CI_PROJECT_DIR
   - $project = "$projectPath\DoggosAreCoolDB\DoggosAreCoolDB.sqlproj"
   - $validatedProject = $project | Invoke-DatabaseBuild -TemporaryDatabaseServer "Data Source=PSE-LT-CHRISU\"

and we successfully built a database!

All that’s left to do now is 2 things:

  1. Create a NuGet package as part of the CI build
  2. Release the database changes to the target DB

I’m still using the same machine for the release portion too, so naturally I can use the same runner for this, if you have other servers you’re deploying to you will of course need additional runners.

We can very easily extend what we already have in our YAML file by just telling the process to create and export a new build artifact – I’m going to name it the same as everything else, and then append the BuildId to the end of the file so we always get something unique:

stages:
  - DatabaseBuild

Build:
  stage: DatabaseBuild
  tags: 
   - sql
  script: 
   - $projectPath = $env:CI_PROJECT_DIR
   - $project = "$projectPath\DoggosAreCoolDB\DoggosAreCoolDB.sqlproj"
   - $validatedProject = $project | Invoke-DatabaseBuild -TemporaryDatabaseServer "Data Source=PSE-LT-CHRISU\"
   - $buildArtifact = $validatedProject | New-DatabaseBuildArtifact -PackageId DoggosAreCool.Database -PackageVersion 1.$env:CI_JOB_ID
   - $buildArtifact | Export-DatabaseBuildArtifact -Path "$env:CI_PROJECT_DIR\Export"
  artifacts:
    paths:
     - $env:CI_PROJECT_DIR\Export\DoggosAreCool.Database.1.$env:CI_JOB_ID.nupkg
    expire_in: 1 week

You’ll notice how I’m exporting the NuGet package to the project directory and then uploading it, this is so that we’ll have access to it to release but also so that we can use the artifacts argument in our YAML to upload the file and make it a downloadable package through the GitLab interface (if you go to that SPECIFIC job):

Whilst we’re on a roll here (and things haven’t gone wrong for a while) I’m going to add 2 additional stages ALL AT ONCE to both “Create a Database Release Artifact” and “Deploy from a Database Release Artifact” using, once again, the SQL Change Automation PowerShell cmdlets.

Woo-Hoo! I’m invincible!

I broke it.

Can you see what I did wrong? The error is:

New-DatabaseReleaseArtifact : The specified value for the Source parameter is neither a valid
41database connection string nor a path to an existing NuGet package file or scripts folder:
42'CI_PROJECT_DIR\Export\DoggosAreCool.Database.1.725147351.nupkg'

So 2 fun things. 1 – I forgot to highlight there was an environment variable at one point, so it was just looking for the name of the variable in the path and 2) it keeps erroring out saying my NuGet file isn’t a NuGet file, weird right?

On further inspection it is yet another mistake I made. I’m using the job ID to name the NuGet package, which means when it tries to find the file it’s 2 steps ahead because each stage is counted as a different job! Duh!

sylvester stallone facepalm GIF

A few quick changes should hopefully sort this out! I’m going to put the instance of the pipeline ID in ($env:CI_PIPELINE_ID) and see if that makes a difference!

Wait. Did it just say the pipeline ran? SUCESSFULLY? That’s exactly what it said! We can verify that this actually happened as well by checking the DatabaseDeploymentResources folder for the Release Artifact to Integration:

And everything is there! Note you won’t have a changes.html report just yet because this is the first time we’ve successfully deployed to Integration, however if we run 1 more change through (I’ll add a stored procedure):

Boom.

Prince Harry Mic Drop GIF

Now of course we can add additional stages to this, for manual intervention or to promote to other environments, but I’m going to call it a win here and retire (until the next post) gracefully. I’m sure you’re all wondering what my final YAML file looked like too – well (counterintuitively) I’ve popped it all into GitHub for you and pasted it below. Enjoy!

stages:
  - DatabaseBuild
  - CreateRelease
  - DeployToIntegration

Build:
  stage: DatabaseBuild
  tags: 
   - sql
  script: 
   - $projectPath = $env:CI_PROJECT_DIR
   - $project = "$projectPath\DoggosAreCoolDB\DoggosAreCoolDB.sqlproj"
   - echo "Building project $project"
   - $validatedProject = $project | Invoke-DatabaseBuild -TemporaryDatabaseServer "Data Source=PSE-LT-CHRISU\"
   - $buildArtifact = $validatedProject | New-DatabaseBuildArtifact -PackageId DoggosAreCool.Database -PackageVersion 1.$env:CI_PIPELINE_ID
   - echo "Exporting artifact to $env:CI_PROJECT_DIR\Export"
   - $buildArtifact | Export-DatabaseBuildArtifact -Path "$env:CI_PROJECT_DIR\Export"
  artifacts:
    paths:
     - $env:CI_PROJECT_DIR\Export\DoggosAreCool.Database.1.$env:CI_PIPELINE_ID.nupkg
    expire_in: 1 week

CreateRelease:
  stage: CreateRelease
  tags: 
   - sql
  script: 
   - $integrationDB = New-DatabaseConnection -ServerInstance "PSE-LT-CHRISU\" -Database "BlogsDotRedgate_Integration"
   - $buildArtifact = "$env:CI_PROJECT_DIR\Export\DoggosAreCool.Database.1.$env:CI_PIPELINE_ID.nupkg"
   - echo "Creating Release Artifact for DoggosAreCuteDB - check C:\DatabaseDeploymentResources\DoggosAreCuteInc\ReleaseArtifacts\$env:CI_PIPELINE_ID\Integration for more information"
   - $releaseArtifact = New-DatabaseReleaseArtifact -Source $buildArtifact -Target $integrationDB
   - $releaseArtifact | Export-DatabaseReleaseArtifact -Path "C:\DatabaseDeploymentResources\DoggosAreCuteInc\ReleaseArtifacts\$env:CI_PIPELINE_ID\Integration" -Format Folder

Integration:
  stage: DeployToIntegration
  tags: 
   - sql
  script: 
   - $integrationDB = New-DatabaseConnection -ServerInstance "PSE-LT-CHRISU\" -Database "BlogsDotRedgate_Integration"
   - echo "Deploying changes to Integration"
   - Import-DatabaseReleaseArtifact -Path "C:\DatabaseDeploymentResources\DoggosAreCuteInc\ReleaseArtifacts\$env:CI_PIPELINE_ID\Integration" | Use-DatabaseReleaseArtifact -DeployTo $integrationDB

5 (Noticeable) Business Benefits of Secure Database Provisioning

“Quality is never an accident; it is always the result of high intention, sincere effort, intelligent direction and skillful execution; it represents the wise choice of many alternatives.”
William A. Foster

I know what you’re thinking.

Chris. Your title looks like it was written to be a corporate whitepaper that I see ads for when I’m browsing social media; it should have a CLICK HERE button, a boilerplate photo of a smiling person holding a pen and it should say something like ‘executives hate them, find out their secret here!’

But something has become immediately obvious to me in the last few months, I still speak to people daily who are forced to:

  • Work in shared development models
  • Work on empty (schema-only) / heavily subset databases
  • Work on old, out of date and/or irrelevant data
  • Make decisions without knowing enough about their data or what they hold

When speaking to them though it becomes immediately obvious that the reason there is no dedicated option available for developers is actually not related to the “traditional” problems that one would expect. You would naturally assume that the reason for not refreshing these environments is because of the large amounts of space or time taken to refresh often enough, or even that ‘we simply cannot due to sensitive PII and regulatory concerns‘.

No. In fact it comes down to, as all things do, time and money.

paid make it rain GIF by Thalia de Jong

In the most recent State of Database DevOps report (2020 that is), a whopping 70% of 2000+ respondents replied that they were using a shared development database and this comes with a whole heap of associated problems, like poor code quality, looser controls around sensitive data and defective deployments. Just these figures alone already point to the solution being to spin up copies for developers on demand and it’s not like we can’t do that. There is SO much technology in the world, across almost all database platforms, that will allow us to virtualize, containerize, sanitize… (effectively all of the ‘izes‘) our databases so that we can have full, safe, realistic copies as frequently as we like. So what is stopping us?

From experience, it’s justification*. People going to senior stakeholders and saying “we need this technology” and hearing a cacophony of classic business challenges back: “but is it broken?”, “do we really NEED it?“, “it costs HOW much!?!“, “how much time will it take to implement?” etc… It’s dev and test hygiene, not a sexy major modernization project like using Azure Arc, using Blockchain or creating Artificial Intelligence. Who cares that developers have to share a database? We’ve got bigger Tofish to fry!

*Sometimes, but much less frequently, it’s down to complexity of implementation – but we’ll leave that one aside for now!

As you will know from my last post on why now is the time to adopt better working practices, it’s important for us to highlight the gains that can be made from newer, updated practices, and why now is not the time to be closing our minds off to a better way of life. It’s not going to be easy to sum this up in 5 points, and there are many other benefits to solid database provisioning but these are in my opinion, the ones that will revolutionize the way you develop.

Very important side note for this blog post: there are lots of subjective key practices, processes and tools that can form part of the “database provisioning process” specifically and they will vary wildly by experience, opinion and company – so for the purposes of the below I will be describing the benefits of a process that involves 3 primary components / steps, given these are the three I tackle most often:

  • Data Identification and Classification / Cataloging
  • Data De-Identification i.e. Data Masking
  • Data Provisioning i.e. Real Time Database Cloning / Provisioning

1 – Increase developer happiness / contentedness

Developers are employed to do 1 thing: innovate. It’s even in the name! Developers are on the cutting edge and are focused on providing value to end users as quickly and efficiently as possible, with shortened release cycles, incremental stories and optimized workflows they can produce this innovation. But a big part of the story is the setup.

Even if you’re working to a more agile methodology it is hard to deliver and test changes which are, in development environments, fundamentally destructive and experimental if you are sharing a workspace with multiple colleagues. Writing on shared Word documents can be frustrating at the best of times, so how can developers be expected to produce high-quality, rigorously tested, game-changing code when at any minute another developer can take the environment down, cause it to run slowly, or overwrite those changes with their own? When you cannot produce changes in an isolated, sandbox environment where they can be individually assessed, re-worked and improved then you have no guarantee that it should be promoted.

All of these sound like arguments that are focused around the production of code, but in fact these issues can all have a huge impact on something that is widely under regarded and scrutinized: developer happiness.

Developers are the people who make stuff go, and without them feeling content and valued in their roles, we can’t expect our productivity and product quality to reflect that – so when developers witness the poor management of their code, something they have worked so hard on as it goes sliding down the priority list or gets rolled back or overwritten etc. they don’t feel motivated to continue doing the best that they can do.

With dedicated environments for dev and test, for different branches, pull requests etc. developers can finally work on innovative and exciting projects, and optimize the code that goes out the door to end users.

2 – Develop a common language about data & make better decisions

It’s very hard to speak about things when you use different language to describe the same thing. That much is obvious. In the United Kingdom alone we have many different words for bread rolls. So when someone comes into a sandwich shop in London and asks for a “Stotty”, can you guarantee that the person serving will know exactly what they mean, exactly when they say it?

The Office Reaction GIF

No. There will be a gap where some translation will be required: some “down-time“, if you will. Now imagine taking something as simple as a bread roll and applying it to an enterprise data estate… you’re going to have a very bad time.

As I talked about in my blog posts here (importance of database classification) and here (classifications role in DevOps) before you can really make a fully informed decision about your data, you must know 2 simple things:

  1. What data you hold
  2. Where your data is

I should hurriedly add that I don’t just mean sensitive data now – all data deserves to be classified because whether you’re a full stack developer adding a column to a table you’ve never used before, an auditor trying to carry out a Data Protection Impact Assessment (DPIA) and trying desperately to include the database, or you’re a BI developer setting up some new reports or processes, you’re going to need to know about the data. This is where people have questions, and this is where you shouldn’t have to reply on anecdotal knowledge or being pushed around from one person to another at the company who supposedly “might be able to help“.

Better insights into data leads to better practices, less waiting (waste reduction) and greater insight. When we then act on this insight we move faster and deliver greater value in our pipelines.

Have you picked up on the trend yet? How all of these are going to end? Well don’t spoil the ending for those who haven’t, they’ll have to wait fort he conclusion!

3 – Move faster and better enable the DevOps pipeline

It’s apt that I’m listening to an amazing EDM remix of the Green Hill Zone from Sonic when writing this section, but isn’t this just what we need as a business? We want to be able to move faster, or to put it in more ‘agile’ terms, we need to be able to pivot and adapt with only a moments notice. Until now, the database has been a monolithic and difficult to steer behemoth, and it shows in our processes.

Yank Tug Of War GIF by BEERLAND

A tangible example of what I mean when I say “move faster”, is branching. It’s fairly commonplace now for a developer to be able to clone a repository and checkout a specific branch, create new branches etc. without fear of switching between those branches and what it might entail. On a dev environment, especially when one is working database-first with your changes (it does make sense to know how the changes will impact the database first – that’s all I’m saying) it is, without a reasonable process in place, exceptionally difficult to easily switch between branches and keep work separate.

This often forces developers to stick to one environment when changes are all made in tandem and can play havoc when it comes to capturing those changes in the right place – a manual state-based comparison of a dev database with multiple branches of work on it to a target upstream could be disastrous.

This is why taking advantage of something like database virtualization, allowing you to spin up copies of databases in seconds, could be the answer. You can automate the provisioning of environments as githooks, during Pull Request automation or as release candidates and the experience will be exactly the same across the board – boom *code base*, fresh and ready to go. When developers can move fast, value comes through a whole lot faster.

4 – Minimize space constraints on new copies, on premise or in the cloud

Space is always a big player in these conversations, and for some it’s enough to boil it down to “well just how much space can we save??” and that’s enough to put a dollar value on the ROI, and people storm ahead with a solution (that’s not always right for them).

But space is a very real problem, much as we (as technology professionals) like to believe that in these modern times of cloud-native solutions, easily scaled serverless-compute VMs and Big Data Clusters, we know there are still a LOT of people out there firefighting legacy, necessary technology and wrestling with what they CAN get out of backups or their SAN tech.

Even using cloud providers costs money, data egress and ingress costs $, BLOB storage costs $, additional security measures cost $. So it’s really not ideal when our databases, for historical reasons or by virtue of the sheer AMOUNT of data we hold and process, are 5, 10, 50, 100TB+, because we’re going to be struggling with this Dev/Test issue still for years to come.

As before with point 3, database virtualization has come of age and has now we have a lot of different solutions from containerization through DBaaS that can aid us in minimizing the amount of space that we ACTUALLY require, meaning we have less money that we need to pour into maintaining large, unwieldy Dev/Test environments or paying a large bill for the privilege of doing so in the cloud (and when developers will be using their dev machines anyway it just makes sense to see what we can do to leverage this existing hardware).

Whilst this one doesn’t directly add specific value to the end of the pipeline, or speed up this delivery, it can help reduce overhead costs associated with the infrastructure needed when providing this value.

5 – Work on realistic data without worrying about data breaches

This is probably one of the most obvious reasons given that I tend to blog about data regulations and compliance ALL THE TIME but I feel like I need to keep saying this.

If you remove all of the data from development and test database copies, this will not help with development and developers will have nothing meaningful to go on, nor any testing that isn’t limited to pre-defined values.

If you leave all of the data in development and test database copies, all you’re doing is duplicating your attack surface area and creating a lot of potential risks for that data to be surfaced where it shouldn’t be – on the internet, in screenshots, emails and of course, hacked.

So there needs to be a happy medium where we can have both the useful data that gives us insight and intelligence of a full data set, the business logic, trends, demographics etc. that we need during testing or analytics – but it should also be sanitized so that data subjects contained therein cannot be re-identified. Static masking, applied to lower environments allows us to retain the data with none of the data.

The Next Generation Data GIF by Star Trek

Protective measures can be built into the DevOps process from the very beginning as you’ve already seen right here on my blog; so as long as it is a part of the process, and we have multiple controls (or guard rails) that allow us to operate safely and quickly without fearing that same speed will cause us to release any sensitive information, allowing us to focus on one thing, value.

Conclusion

As you’ve seen above, it all comes down to time and money but there are many ways to save and speed up within a DevOps process by means of a good, solid database provisioning process. Whilst none of these reasons comes with a fixed ROI (unless you have ALL of your pre-prod database storage costs to hand) they contribute to something far better than that:

The ease of delivering value.

In a world where we can be concerned about everything, and where it’s hard to keep up with emerging technologies – it makes sense to start pruning away blockers to the process, the problems that are stopping us from delivering value faster – THAT is the theme and point of this blog post; our end users. We’re already delivering excellent value to them, we trust our developers and teams, but what’s stopping them from moving faster with database changes? Adopting a good provisioning process will mean you start to notice all of the above become true of your database development lifecycle.

Using things weirdly – Part 2: Static Data in the Hybrid Model (w/ Redgate SQL Source Control and Change Automation)

“Where’s your will to be weird?”
Jim Morrison

I had some really positive feedback on my last “Using things weirdly” post, and truth be told, I love to use things weirdly. The number of times I’ve heard: “Oh, well, sure yeah I guess it works that way too…” is just too many to count. So imagine how my eyes lit up when I realized that you can do something weird with one of my favorite things to talk about at the moment, The Hybrid Model.

Now if you don’t know about the Hybrid Model then I would suggest you check out my post here that’s all about the different source control models available for your databases!

The Problem

Across both the State based and Migrations based offerings within the SQL Toolbelt, you have access to something very cool: the ability to easily (alongside the schema) source control your static data. Now don’t ‘@’ me because you think I should be referring to it as “Semi-Static” because it might change occasionally and ‘that’s not truly static then is it?‘; I could easily also refer to it as ‘Lookup Data’ or ‘Reference Data’, basically whatever you class things like “Country Codes” and “Currency Codes” as.

Whatever you call it, it can go into your VCS like any part of the database schema:

SQL Source Control: State Based
SQL Change Automation: Migrations Based

But. One thing that – as of writing this RIGHT NOW (23/07/2020 10:09am BST) – is not officially available in the Hybrid Model combining these two methods… is Static Data. The Data tab in SQL Change Automation even disappears when you set it up as a Hybrid workflow:

And this gives me a sad.

sad monty python and the holy grail GIF

The Solution?

Got your Hybrid Model setup and ready to go? Let’s use it weirdly!

1 – Use SQL Source Control to commit some static data to your state repository. This is as easy as right clicking on your (highlighted green) source control linked database and selecting “Other SQL Source Control Tasks” > “Link/Unlink Static Data“, and picking your tables.

Nothing should be showing in SQL Change Automation:

2 – Unlink SQL Change Automation from the state repository for a moment and link it instead directly to the development database. This will cause it to go into Migrations-First mode. You can do this by clicking the blue source name in SSMS and picking Existing Database instead:

Because it’s technically the same database as you’re source controlling in the state repository, it should all just work™ and should tell you there are no dev changes to the source. Then you’ll see the “Data” tab has been enabled:

3 – Select the same tables to source control as you did with SQL Source Control by using the Add Tables wizard:

BUT WAIT!!

shocked oh my god GIF

Isn’t it now going to generate a migration for our static data?? This isn’t included in the baseline or anything at all so far, so is it going to try and insert all of my static data into tables later on that already have it?

No. Actually we’re fine!

4 – Generate the static data migration script (and look at it for peace of mind). Notice that the script actually has checks in there – because we’re newly adding these tables, the migration will check if the tables are empty before trying to run the script, and only AFTER this migration will SQL Change Automation start generating the differential, incremental static data migrations:

Commit this migration script to your migrations repository, and that’s all we need to do here!

5 – All that’s left now, is to re-hook-up the Hybrid pipeline, follow the same steps you did before in Step 2 but this time, instead of an existing database, just link it back to the state repository like it was before. If you’ve done this right, you should see no changes pending for migrations:

BUT you will notice that if you change any static data with SQL Source Control, it should now show up in SQL Change Automation!

Change to static data prior to commit in SQL Source Control
Change before migration script generation in SQL Change Automation
Generated migration script to be committed

Conclusion

Is it an intended use? No absolutely not, the reason it’s disabled is that with all things at Redgate they are considered heavily to ensure users are offered the best possible user experience, functionality and essentially something that meets requirements across the board.

But. We can use it weirdly to, as i say, just make it work™.

What have you used weirdly lately? Let me know!

I will be running for the Ridge, will you?

“Until one has loved an animal, a part of one’s soul remains unawakened.”
Anatole France

TL;DR: If you love to walk, jog or run – sign up to a fantastic cause here to support a shelter in need!

The Full Story: This is going to be a reasonably short post, because I’m hoping if you’ve made it this far, you’ll read to the end.

This August, I will be “running for the Ridge“. This is an event taking place (virtually of course) where one can run, jog or walk any distance you like, as long as it is above 5km (3.11 miles) for Adults or 2km (1.25 miles) for ages 12 and below, wherever you like, anytime throughout the months of August and September.

For the last 2 months, I have aspired to be a healthier version of myself – I have managed to adopt healthier eating habits; removing a lot of the processed sugars and oils I was eating, not drinking alcohol during the week, always exercising at least once a day etc. and as a result I have managed to lose just over 12kg (26.5 lbs) and I feel so much better for it.

Running has been an instrumental part of the improvement in my quality of life and now I’m happy to be supporting something close to my heart with it.

So why this? What makes “running for the Ridge” special?

Well, because it’s in aid of Jacob’s Ridge – an animal shelter in Spain that, thanks to the impact of one nasty global pandemic, is sorely in need of donations in order to stay open and safeguard the well-being of the animals that it looks after, a cause deserving of support all in itself.

So that’s why this is special – because you can directly do something to:

  • Positively impact the lives of rescued animals
  • Positively impact your own health and mental well-being (in the form of exercise)
  • Do something nice with friends or family (at a responsible social distance where necessary)

To me, this just sounds like the easiest decision ever; run/jog/walk a distance of my own choosing, on my own comfortable running route and animals benefit from that? Amazing.

You will even get an awesome sustainable wooden medal for participating! And the cost? 12 GBP per person. That’s it. You can even get sponsored by friends, family and/or colleagues too if you want to contribute more!

So that’s it. That’s all I wanted to say. I’ll be running for the Ridge this August and you know – if you have a spare couple hours or you enjoy walking/jogging/running alone or in a group, think about putting it to good use and let’s get out, have some fun and do some good!

You can sign up here: https://www.jacobsridge.com/product-page/run-for-the-ridge – just sign up on the site and they’ll send you all the instructions direct to your email.

If you do sign up, tweet me @PlantBasedSQL with your entry number and I’ll give it a big ol’ like! I’m number 21 🙂