SQL Data Catalog, Data Masker and your DevOps pipeline: How do I make sure everything is being masked?

“However fast regulation moves, technology moves faster. Especially as far as data is concerned.”
Elizabeth Denham

You’re probably sick of me constantly talking about how the cataloging of columns should be part of the DevOps upstream deployment process. I’ve blogged about it. I’ve even produced a video demonstrating this in action. But one question that this always throws up is:

If we include cataloging in the upstream process, how do we make sure our masking sets are also staying up to date?

The benefits of including the classifications in the upstream pipeline is that nothing ever gets to Production that hasn’t been classified – we constantly have a perfectly up to date idea of the nature of our structured data estate, how data is distributed, what risk is associated with which systems? etc. but one of the biggest wins is that we constantly know which fields need to be masked when we’re pulling copies back into non-Production, constantly.

Add a table? Add columns? We know about them, they’re classified, they’re deployed… so now they need to be masked on our next refresh. But how? Well it all depends on which approach we want to take:

  • Automated
  • Manual

Doesn’t it always boil down to those options? What I mean is that either we have an appetite to completely generate our masking set afresh every single time based on our classifications (Automated) or do we want to ensure that we configure each rule ourselves (Manual). They both have benefits and drawbacks.

Automated

How: We can generate a masking set using the SQL Data Catalog PowerShell each and every time as part of our pipeline. Add a column, tag it in the pipeline and then simply wait for the necessary rulesets to be generated in the pipeline (perhaps as part of your build) for you to run when you bring a copy back. Check out my walkthrough for how to set this up.

Benefits: The process is automatic. Its headless and you don’t need to think about it at all. As long as the classifications are provided (and if you follow the steps from the blog post and video you should be providing them) then you’re always generating rules for every classified column.

Drawbacks: This process can be fragile. If we don’t classify correctly we can end up masking in the wrong way or trying to mask the wrong field (e.g. a Primary Key, Constraint, Identity etc.) which can cause masking failures and then you have to spend time fixing the pipeline/masking set. This also means that the nature of the masking is dependent on your classifications, and the values you will get masked into the columns will be less realistic as a result (i.e. you can’t generate Row-Internal Sync Rules using the integration).

Drawback Mitigation: To avoid the process breaking, be sure to really focus on how you set up your API settings / how you pick which categories and tags are used to generate the masking rules (like I discussed here). This will at least help you make sure you map common data sets into columns (and don’t hit columns that have constraints or keys).

Only columns marked with Static Masking as the Treatment Intent will get a rule created for them
An Information Type is given to every column we intend on masking – these are then mapped to templates in masker to ensure more realistic data

Manual

How: Either rely on developers to check in masking set changes along side their code changes or build in a manual intervention step to your upstream process to ensure that someone opens and updates the schema and rules within the masking set, and then put this back into the pipeline.

Benefits: This results in more accurate, more likely to run sets that generate significantly more reliable and realistic data as an end result. Masked DB copies can be more easily used for anything beyond simple development changes, including analytics etc. You can include any specialist rules you need and apply your own understanding and knowledge of the database.

Drawbacks: This is obviously a manual process – less than ideal. Anything that involves a human can invariably go wrong because we are humans and we make mistakes (which we learn from of course). This also takes significantly more time as part of the process.

Drawback Mitigation: This is a harder one to mitigate as we’re reliant on manual intervention, however the best way to check this could be with another team member checking (maybe as part of a pull request) or you could include an automated PowerShell script to effectively rationalize the columns to be masked from Catalog vs the columns currently in the Data Masker masking set – this would help you understand if all the necessary columns have been updated or if any were missed. I have included an example of this PowerShell below.

### CHANGE THIS ###

$MaskingSet = "yourmaskingset.DMSMaskSet" # Your masking set including the DMSMaskSet file extension
$instance = "yourinstance" # The Instance as it is shown in Data Catalog that hosts the database
$DatabaseName = "yourdatabase" # The DB you want classification info for
$CatalogServer="http://yourmachine:15156" # The lcoation of your catalog server, ending on :15156
$authToken="redacted" # Your Data Catalog Auth token from the Settings page
$tagName = "Static Masking" # The tag you're using to identify which columns need to be masked

### DONT CHANGE THIS ###

Invoke-WebRequest -Uri "$CatalogServer/powershell" -OutFile 'data-catalog.psm1' -Headers @{"Authorization"="Bearer $authToken"}
Import-Module .\data-catalog.psm1 -Force
Connect-SqlDataCatalog -ServerUrl $CatalogServer -AuthToken $authToken 
$ColumnsMarkedForMasking = Get-ClassificationColumn `
    -InstanceName $instance `
    -DatabaseName $DatabaseName | Where-Object {$_.tags.name -eq $tagName} 
$MaskingSetXML = [xml](Get-Content -Path $MaskingSet)
$subrules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleSubstitution')
$internalrules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleRowInternal')
$shufflerules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleShuffle')
$searchreplacerules = $MaskingSetXML.SelectNodes('//DMSSetContainer_MaskingSet/DMSSetContainer/DMSRuleBindingList/RuleSearchReplace')
$TablesAndColumns = @()

$subrules | ForEach-Object {`
    $CurrentTable = $_.TargetTableName.value
    $_.DMSPickedColumnAndDataSetCollection.DMSPickedColumnAndDataSet.N2KSQLServerEntity_PickedColumn.ColumnName.value | ForEach-Object {$TablesAndColumns+= $CurrentTable + "." + $_ }
}

$internalrules | ForEach-Object {`
    $TablesAndColumns+= $_.TargetTableName.value + "." + $_.TargetColumnName.value
}

$shufflerules | ForEach-Object {`
    $CurrentTable = $_.TargetTableName.value
    $_.DMSPickedColumnCollection.DMSPickedColumn.N2KSQLServerEntity_PickedColumn.ColumnName.value | ForEach-Object {$TablesAndColumns+= $CurrentTable + "." + $_ }
}

$searchreplacerules | ForEach-Object {`
    $TablesAndColumns+= $_.TargetTableName.value + "." + $_.TargetColumnName.value
}

$result = $TablesAndColumns | Sort -Unique
$ColumnsNeedingRules = $ColumnsMarkedForMasking | Where-Object {($_.tableName + "." + $_.columnName) -notin $result}

"`nThere are " + $ColumnsMarkedForMasking.count + " columns that require masking for database " + $DatabaseName + "in SQL Data Catalog."
"You are masking " + $result.count + " distinct columns in masking set: " + $MaskingSet
"`nThe columns that do not currently have a mask configured are:`n"

$ColumnsNeedingRules | ForEach-Object {$_.tableName + "." + $_.columnName + "     (" + $_.dataType + ")"}

$next = Read-Host -Prompt "`nWould you like to see the columns currently in your masking set? (Y/N)"
if ($next -in ("Y", "y")) {$result}

This code can also be found on my GitHub here.

The output of running the script – 13 columns in the masking set, 14 columns outstanding to be masked

You can technically even use this same check approach for the automated masking set generation option, to ensure that everything has been tagged correctly.

PASS Data Community Summit 2021

“Education is the kindling of a flame, not the filling of a vessel.”
Socrates

I will be speaking at PASS Data Community Summit 2021

I have spoken at previous PASS Summits; both through the virtue of working for Redgate, and off my own back through dedication and passion to the subject matter I speak about: Data Privacy and Protection.

In 2018 I stood on stage with Microsoft to speak about the nature of Static Data Masking, how it differs from Dynamic Masking and what challenges need to be considered for a successful static masking rollout.

In 2019 I stood on stage alone to talk about creating a strategy for masking non-Production environments, including a walkthrough of the dbatools.io masking functionality utilized alongside Azure SQL Database classifications. PASS Summit 2019 was also when Kendra Little encouraged me to set up this blog, for which I’m forever grateful.

In 2020… well. You know what happened.

In 2021 Summit sees a new lease of life. Data Community Summit will be entirely online (no surprises there) but one big surprise you might not know is that it is completely free to attend. Never before will there have been SUCH a swathe of incredible speakers, with such a huge variety of topics and learning pathways for free and available on demand afterwards.

The dates for your diary? November 8-12, 2021

As it happens, I will also be speaking about setting up an end to end deployment pipeline using the Flyway Community Edition, Azure SQL Database and Azure DevOps it would be great to see you but with so much on offer I could absolutely understand if you watched on catch up!

You can see all the speakers here, but here’s a short list of some oft he sessions I will definitely be tuning in to!

  • Erin Stellato – Demystifying Statistics in SQL Server
  • Grant Fritchey – Identify Poorly Performing Queries – Three Tools You Already Own
  • Tracy Boggiano – Azure SQL Fundamentals
  • Angela Tidwell – Azure Devops Dashboards EZ as pie-charts!
  • Indira Bandari – Getting started with Python for Data professionals
  • Jess Pomfret – Azure SQL & Our Toolbox To Manage It
  • Taiob Ali – Think like the Cardinality Estimator
  • Neil Hambly – Azure Notebooks – Data Science fundamentals
  • and many more!

So please go check it out & register, support the community and do a bunch of learning in the process – it will be amazing to see you there and hopefully I’ll even get to see some of you in person in the not-so-far future!

“But I don’t wanna INSTALL it!”: Data Masker on the fly in Azure DevOps (with an Azure SQL DB)

“There is always a way to go if you look for it.”
Ernest A. Fitzgerald

As many of you know, I really enjoy talking about Data Masking. I fundamentally believe it is an absolutely ESSENTIAL part of Test Data Management and specifically the provisioning of Pre-Production environments. If you hold sensitive PII/PHI/PCI in your Production environments, you have no excuses for porting any of that back into Dev and Test.

I also stand firmly behind the belief (and it is just that, my belief) that masking is much safer than anything you can achieve with encryption or limiting access alone. Static masking, when done correctly, means that even if all other security measures are bypassed, or we accidentally expose data somehow, it doesn’t matter because the PII/PHI has been wiped, but it is still fundamentally useful for development and testing environments.

I’m also a huge proponent of using classification and data masking as part of a DevOps process, and often you’ll find me using Azure DevOps to actually kick off my masking process – but one of the most frequent questions I get is “do I have to install Data Masker somewhere?” and the answer I always have to give is… yes.

When we use SQL Provision to spin up our environments in non-Prod, generally we know they have to be in IaaS VMs or On-Prem, due to the nature of the technology (keep your eyes open on THIS because big changes coming soon *squeals in excitement*) but sometimes we’re working entirely with PaaS DBs and VMs don’t even come into our vocabulary – but we still need them to be masked for non-Prod use.

I have some data in an Azure SQL DB – a copy of DMDatabase (get it from my GitHub):

DM_CUSTOMER table in the DMDatabase

and I would like to get this masked before I create copies of it. Now I have written scripts in the past to make sure that Azure SQL DBs can be masked and then copied into non-Prod environments and you can get those scripts here, and you could easily wrap something like the below INTO a script like that – but much of the work I’ve done on this always involves having Data Masker installed somewhere and invoked on that machine – a VM, my laptop, whatever.

So, how do we avoid having to install Data Masker each time?

Data Masker for SQL Server does not (at time of writing) have publicly available a docker container or method for installing using Choco or something like that – once we do, trust me, I’ll be blogging about it A LOT as I will be very excited. But there is an install file available: https://download.red-gate.com/installers/DataMaskerforSQLServer/ and this might be enough.

The Process

Initially I created a Masking Set locally for the DMDatabase copy in Azure, it was nice and simple just masking a few of the fields on the CUSTOMER table:

It relies on SQL auth for the connection and I’m remembering the credentials, though these could be subbed in later on using the PARFILE (documentation on that here).

Next I put this masking set into a newly created Azure DevOps Git repo, which I cloned down onto my local machine – and then committed and pushed my changes up into Azure DevOps:

Now that this was all in ADO, it was time to set up a pipeline for it – so let’s jump into some YAML! Now, Data Masker currently needs to run on a Windows machine so we’ll set the pool to Windows-Latest:

trigger:
- main

pool:
  vmImage: windows-latest

The next step is to grab the installer – which I know I can easily do with PowerShell. I’m sure you could be more clever about this, but with limited time I chose the most recent version and hard coded that in to a PowerShell Invoke-WebRequest cmdlet:

- task: PowerShell@2
  inputs:
    targetType: 'inline'
    script: |
      $source = 'https://download.red-gate.com/installers/DataMaskerforSQLServer/2021-03-15/DataMaskerforSQLServer.exe'
      New-Item -ItemType directory -Path C:\download
      $destination = 'C:\download\DataMaskerforSQLServer.exe'
      Invoke-WebRequest -Uri $source -OutFile $destination
      New-Item -ItemType directory -Path C:\download\DMlogs
  displayName: 'Download Data Masker'

This PowerShell task is going to grab the most recent exe for Data Masker and pull it down into C:\download on the hosted VM we’re using for the pipeline and it’s also going to create a directory for the Data Masker logs as well (if you wanted to extend the YAML at the end to wrap these logs up and publish them as a result of the pipeline masking, then go for it and tweet me to let me know!)

Next we have to extract and install Data Masker from that download, which is fairly easy to do with a cmd call:

- task: CmdLine@2
  inputs:
    script: |
      "C:\download\DataMaskerforSQLServer.exe" products "Data Masker for SQL Server" log c:\download temp c:\download /IAgreeToTheEula RG_LICESE=%RGLICENSE%
  displayName: 'Install DMS Headlessly'

Note I’m using the guidance from this page, making sure to accept the EULA and I’m passing in my Redgate License as a variable, that I have specified for the pipeline and kept secret. This will put Data Masker in the default location in C:\Program Files\… and means we will then be able to call it. I do however need to make sure that this now works. So I saved my pipeline and ran it to see what happened:

Fabulous, that all works nicely. Now to pass the DMSMaskSet file to Data Masker and get it to run – ah but I forgot, I’m going to need a PARFILE as per the cmdline documentation that specifies where the files etc. are for the run. So I create my PARFILE.txt as such:

MASKINGSET=C:\download\DMDB_MaskingTime.DMSMaskSet
LOGFILEDIR=C:\download\DMlogs
DATASETSDIR=C:\Program Files\Red Gate\Data Masker for SQL Server 7\DataSets 
REPORTSDIR=C:\download\DMlogs
INTERIM_REPORTS=false

As you can see, very simple. But to make sure these files are “simply” in the right folder, and because I don’t have time to explore how I could pass an environment variable into the .txt file itself, I’m going to add a quick file copy task to make sure that my masking set and PARFILE both make it into that location:

- task: CopyFiles@2
  inputs:
    SourceFolder: '$(Build.Repository.LocalPath)'
    Contents: '**'
    TargetFolder: 'C:\download'
  displayName: 'Copy files from working directory'

Of course now that we have Data Masker installed on the pipeline VM, the masking set AND the PARFILE… let’s get masking!

- task: CmdLine@2
  inputs:
    script: '"C:\Program Files\Red Gate\Data Masker for SQL Server 7\DataMaskerCmdLine.exe" PARFILE=C:\download\PARFILE.txt'
  displayName: 'Run Data Masker'

I agree the command is now less than impressive given all the prep work, but when you run all of this in it’s entirety…

Big success! But let’s check the data of course and make sure it is as we expect:

It all works, Lynne has been masked to Muna (and the rest has been masked too!), and I didn’t need to have Data Masker installed on a VM in my environment, with an Azure DevOps self hosted agent to run it – I could just do it programmatically. #Winning.

I’ll put the full YAML below however some caveats:

  • You cannot run this pipeline a LOT in a short space of time, I found that out. The RedGate downloads page is not particularly designed for this process, so it should probably be run sparingly otherwise the pipeline times out because the server, by way of balancing, will prevent you from pulling the file too often.
  • You will need to update the PowerShell step reasonably often if you want the DMS most recent installer, or make it more futureproof to grab the latest version – I just didn’t investigate that.
  • Data Masker is dependent on the power of the machine it is running on in MANY ways and Azure DevOps pipeline VMs are not particularly the most powerful beasts in the world – so if you have a lot of masking that needs doing, I would be weary of this method and might stick to an Application Server VM you’ve got hanging around, just in case.
  • This is likely to change a lot in the future so may not be relevant when you’re reading it after a few months – so before implementing anything like this, if it’s been a few months, contact me or Redgate and just confirm that there isn’t some better way of doing this, if I haven’t already blogged it!

Thanks for stopping by and have a great week!

Full YAML:

trigger:
- main

pool:
  vmImage: windows-latest

steps:
- task: PowerShell@2
  inputs:
    targetType: 'inline'
    script: |
      $source = 'https://download.red-gate.com/installers/DataMaskerforSQLServer/2021-03-15/DataMaskerforSQLServer.exe'
      New-Item -ItemType directory -Path C:\download
      $destination = 'C:\download\DataMaskerforSQLServer.exe'
      Invoke-WebRequest -Uri $source -OutFile $destination
      New-Item -ItemType directory -Path C:\download\DMlogs
  displayName: 'Download Data Masker'

- task: CmdLine@2
  inputs:
    script: |
      "C:\download\DataMaskerforSQLServer.exe" products "Data Masker for SQL Server" log c:\download temp c:\download /IAgreeToTheEula RG_LICESE=%RGLICENSE%
  displayName: 'Install DMS Headlessly'

- task: CopyFiles@2
  inputs:
    SourceFolder: '$(Build.Repository.LocalPath)'
    Contents: '**'
    TargetFolder: 'C:\download'
  displayName: 'Copy files from working directory'

- task: CmdLine@2
  inputs:
    script: '"C:\Program Files\Red Gate\Data Masker for SQL Server 7\DataMaskerCmdLine.exe" PARFILE=C:\download\PARFILE.txt'
  displayName: 'Run Data Masker'

Moving teams and users between multiple SQL Clone Servers

“Portability should be the default.”
Larry Wall

Greetings one-and-all! It’s been a long time hasn’t it?

Well – I’ll be honest. I’ve not been great… I’ve been struggling. With the pressures of everything going on; from National Lockdown 3: The Lockdown Strikes Back, to working hard, and trying to lose weight it’s all been a bit too much – but that’s not really any excuse for neglecting y’all is it?

As we go further into 2021 I can’t promise that you’ll see LOADS more here, but I’m determined to bring you something, and so long as it helps 1 person out there, I’ll consider it a success.

Well I was asked a question yesterday I wasn’t really sure of the answer to, but is definitely a nice-to-know and I thought I would share it with all of you.

“If I have multiple SQL Clone Servers, or even just a new SQL Clone Server, how do I move the teams and permissions from one to the other without manually setting everything up again?”

Quite the head-scratcher indeed!

Emoji Emotion GIF by Anne Arundel Community College

Well fortunately we know where to start – in SQL Clone every action, every change is recorded by virtue of a back-end SQL Server DB called the SQL Clone Config DB (you may remember this from such classics as Using things weirdly part 1!) and this holds all of the information we might need to move things over.

Let’s start by taking a look at the schema in the DB:

There is already fairly comprehensive documentation on moving the whole SQL Clone Config DB and even the SQL Clone Server to another instance here, but we ONLY care about users and permissions for this particular run, so we’ll focus on that.

The schema is reasonably straight forward and well-named thankfully which means we can isolate the tables we’ll need for the move. Anything that mentions Teams or Principles (Users) is generally going to be in our remit. I took a reasonably fresh install of SQL Clone and added a few users and teams (I’ve been watching a lot of Sword Art Online lately, so you might sense a trend with these teams…

Now when we investigate some of the tables in the config DB we’ll notice that everything is really nicely normalized and mapped together:

But what does this look like in a fresh install with NO teams defined?

Wait… no teams defined/ activated but there’s… a team. The default team.

aaron paul what GIF by Breaking Bad

It turns out once you hit the “Enable Teams” option in the SQL Clone Dashboard, that entry magically goes away though and it looks more like we’re expecting:

Now we know the lay of the land – how do we move our users? Well to answer that I’m going to work on 3 assumptions here:

  • I’m going to stick to just these 3 tables (Principles, Teams and the mappings). There’s a wider question around Images and Clones etc. that we’ve added for those teams and any overlap that may occur, but we’re working on the belief that the SQL Clone Server we’re moving to is segregated and independent (so will not have access to the same images), so it only needs the users and teams.
  • The target SQL Clone Config DB we’re moving these to is brand new, and does not already have any users/permissions or teams set up except clicking the “enable teams” button in the dash.
  • The same user has set both SQL Clone Servers up, and is therefore position 1 in the dbo.SqlClonePrinciples table, however you could easily overwrite this one value manually if needed or use a local copy with the same #1 user to generate the script initially E.g. Chris.Unwin is ID1 in both the Source and Target Config DBs below, and then just add in the missing user manually to the target config DB.

1. SQL Data Compare – my recommended

SQL Data Compare is our friend here. It knows all about constraints and identities and anything we may come across. This is a fairly small scale need however if you open it up and simply compare those 3 tables we’ve been looking at, it should be able to pick up the users and permissions to insert, and generate us a script quickly and easily we can run against the target Config DB:

I’m going to generate the SQL Script and then run it against my target – notice how Data Compare takes care of any issues we might face with constraints and identity inserts to make sure everything 100% ties up? Good isn’t it!?

This was my script:

SET NUMERIC_ROUNDABORT OFF
GO
SET ANSI_PADDING, ANSI_WARNINGS, CONCAT_NULL_YIELDS_NULL, ARITHABORT, QUOTED_IDENTIFIER, ANSI_NULLS, NOCOUNT ON
GO
SET DATEFORMAT YMD
GO
SET XACT_ABORT ON
GO
SET TRANSACTION ISOLATION LEVEL Serializable
GO
BEGIN TRANSACTION

PRINT(N'Drop constraints from [dbo].[TeamSqlClonePrincipalMappings]')
ALTER TABLE [dbo].[TeamSqlClonePrincipalMappings] NOCHECK CONSTRAINT [FK_TeamSqlClonePrincipalMappings_SqlClonePrincipals_SqlClonePrincipalId]
ALTER TABLE [dbo].[TeamSqlClonePrincipalMappings] NOCHECK CONSTRAINT [FK_TeamSqlClonePrincipalMappings_Teams_TeamId]

PRINT(N'Drop constraint FK_TeamImageMappings_Teams_TeamId from [dbo].[TeamImageMappings]')
ALTER TABLE [dbo].[TeamImageMappings] NOCHECK CONSTRAINT [FK_TeamImageMappings_Teams_TeamId]

PRINT(N'Drop constraint FK_TeamInstanceMappings_Teams_TeamId from [dbo].[TeamInstanceMappings]')
ALTER TABLE [dbo].[TeamInstanceMappings] NOCHECK CONSTRAINT [FK_TeamInstanceMappings_Teams_TeamId]

PRINT(N'Add rows to [dbo].[SqlClonePrincipals]')
SET IDENTITY_INSERT [dbo].[SqlClonePrincipals] ON
INSERT INTO [dbo].[SqlClonePrincipals] ([Id], [FriendlyName], [Role], [Sid], [IsGroup]) VALUES (2, N'RED-GATE\Chris.Kerswell', 2, 0x010500000000000515000000fd43461e19525f12828ba62895680000, 0)
INSERT INTO [dbo].[SqlClonePrincipals] ([Id], [FriendlyName], [Role], [Sid], [IsGroup]) VALUES (3, N'RED-GATE\David.Ong', 3, 0x010500000000000515000000fd43461e19525f12828ba628cb7b0000, 0)
SET IDENTITY_INSERT [dbo].[SqlClonePrincipals] OFF
PRINT(N'Operation applied to 2 rows out of 2')

PRINT(N'Add rows to [dbo].[Teams]')
SET IDENTITY_INSERT [dbo].[Teams] ON
INSERT INTO [dbo].[Teams] ([Id], [Name], [Color], [IncludeAllPrincipals], [IncludeAllImages], [IncludeAllInstances]) VALUES (2, N'TeamKirito', N'#373737', 0, 0, 0)
INSERT INTO [dbo].[Teams] ([Id], [Name], [Color], [IncludeAllPrincipals], [IncludeAllImages], [IncludeAllInstances]) VALUES (3, N'TeamAsuna', N'#fc9003', 0, 1, 1)
SET IDENTITY_INSERT [dbo].[Teams] OFF
PRINT(N'Operation applied to 2 rows out of 2')

PRINT(N'Add rows to [dbo].[TeamSqlClonePrincipalMappings]')
INSERT INTO [dbo].[TeamSqlClonePrincipalMappings] ([TeamId], [SqlClonePrincipalId]) VALUES (2, 3)
INSERT INTO [dbo].[TeamSqlClonePrincipalMappings] ([TeamId], [SqlClonePrincipalId]) VALUES (3, 2)
PRINT(N'Operation applied to 2 rows out of 2')

PRINT(N'Add constraints to [dbo].[TeamSqlClonePrincipalMappings]')
ALTER TABLE [dbo].[TeamSqlClonePrincipalMappings] WITH CHECK CHECK CONSTRAINT [FK_TeamSqlClonePrincipalMappings_SqlClonePrincipals_SqlClonePrincipalId]
ALTER TABLE [dbo].[TeamSqlClonePrincipalMappings] WITH CHECK CHECK CONSTRAINT [FK_TeamSqlClonePrincipalMappings_Teams_TeamId]
ALTER TABLE [dbo].[TeamImageMappings] WITH CHECK CHECK CONSTRAINT [FK_TeamImageMappings_Teams_TeamId]
ALTER TABLE [dbo].[TeamInstanceMappings] WITH CHECK CHECK CONSTRAINT [FK_TeamInstanceMappings_Teams_TeamId]
COMMIT TRANSACTION
GO

and success! Everything has been replicated as we expect:

I also tried this on a VM that does not have the same AD access as I do locally and the users were imported successfully, but because they can’t be resolved they don’t show in the Clone dash so it doesn’t cause us any dead or invalid relationship issues.

2. & 3. Script from one DB to the other or SSIS load

Another simple-ish option would be to write the INSERT INTO SELECT FROM statement, or create a very simple SSIS package if you can connect to both – you would have to take care of the Constraints and Identity Inserts yourself, but it would be equally effective:

Important Note(s)

You will have to manually tie up any differences if you have already started using the new install with Teams and Permissions as THAT will get MUCH more complicated, especially if you have already assigned images and instances and users to teams – so this process is best done at the very start of life for the new SQL Clone Server.

Also. You will have to restart the SQL Clone Server Service after loading the users and teams in, as SQL Clone won’t recognize they’re there straight away.

Conclusion

This is going to be a rare occasion for most – the SQL Clone Server is not designed nor intended to deal with being duplicated many times. Although there is a relatively easy method for MOVING SQL Clone Servers simply moving a part of the Config like Users & Teams is definitely possible, but is not 100% fool-proof, so on said rare occasion, just be very, very careful!

3 RDBMS’, 3 models, 3 end-to-end deployment pipelines with Azure DevOps and Redgate Deploy

“Choice is the most powerful tool we have. Everything boils down to choice. Every choice we make shuts an infinite number of doors and opens an infinite number of doors.”
– Lori Deschene (https://tinybuddha.com/)

Picking a Set-Up

One of the hardest parts of my job is that at any moments notice we could be asked to walk through better database change management processes. That’s not the challenge, the problem is that it could be with any kind of tech stack. I might need a Git Repo of some shape or form (Azure DevOps, plain ol’ Git, Bitbucket etc.) and then a CI server of some kind (Azure DevOps, GitLab, TeamCity, Bamboo etc.) and finally something to handle releases (Azure DevOps, Octopus Deploy, Bamboo etc.) – this is fairly easy to reproduce in multiple combinations with automation, terraform etc. but when you’re actually helping someone set it up – you’ve got to know where all the bits go.

The Redgate tools work with all of these options and combinations so making sure we’re setting everything up right usually means questions about the Repo/CI/CD tooling people choose.

The commonality above and the one I run into the most for all 3 stages, is Azure DevOps. Its straightforward to understand, all in the same place and just plain fun to use (AND it supports emojis ^_^).

Finally now, we have to pick a Relational Database Management System (RDBMS) to use – Redgate Deploy is one of the newest offerings from Redgate and it comprises capabilities for “Database DevOps” across MS SQL Server, Oracle Database and 18 (well actually 19 now thanks to Flyway v7!) other RDBMs‘! So instead of choosing, I’m going to pick the two key ones there, and one of the 18 others: MSSQL, Oracle DB and PostgreSQL.

One final question I had to ask of myself was what models I wanted to use. There are a couple of choices available within the Redgate solution, specifically for MSSQL and Oracle at the moment, so I decided that I would do State based deployments for Oracle and Hybrid deployments for MSSQL, given that PostgreSQL will have to be migrations anyway. Fear not though, the setup is not hugely dissimilar when it comes to the actual pipelines!

Setting up Azure DevOps Repos

This stage was relatively easy – I simply created 3 new projects in my DefaultCollection where I’m going to put the repos for each of the DBs.

and then I created 3 readme files, and cloned all 3 git repos down onto my machine as local repos:

and we’re ready to go!

A quick note: I’m using a mixture of Azure DevOps hosted (for PostgreSQL) and Azure DevOps Server locally installed on my Virtual Machine (for MSSQL/Oracle) with a local agent present to run everything below – you can adopt this methodology or you can use the hosted version, but for the Oracle solution below at least you will need a local agent available (unless you use the DockerHub Image for Schema/Data Compare).

Microsoft SQL Server

The first thing I need to do for all of these is to pick the databases I’ll be working on – for me I’m rather lucky as our demonstration environment has a rather nifty set of databases for me to choose from!

I’m going with SQL Source Control (the MSSQL State component in Redgate Deploy) and SQL Change Automation (the MSSQL Migrations component) both plugged into Management Studio (SSMS) with a set of databases called the ScaryDBA_Dev/Test/Prod environments (which I used SQL Clone to create the copies of), in homage to the wonderful Grant Fritchey.

So the first thing we need to do is get Dev under source control – we’ve refreshed back from Prod so there shouldn’t be any differences and we’re using the Hybrid model, so we’ll need to create the State first. I do this by going to SQL Source Control in SSMS, and linking my DB to Git, creating a State Folder in the top level of my local repo as I do so:

Then once linked I go ahead and source control the initial schema (not sure how? Watch the Redgate University videos here):

Next I setup my Migrations project using SQL Change Automation, creating the Migrations folder in the same top level of my local repo, but now instead of pointing to the database, I’m pointing to my SQL Source Control generated State folder:

Now at this point we get the options to choose filters and comparison options – I would recommend if you’re not sure speak to someone at Redgate or look up the documentation – I often see people wanting to filter out Security/Users/Roles at this stage so it might be worth a look! I just carried on as I only have a few objects anyway!

Connect to the target and create a baseline script (i.e. what does Prod look like now?) again, because I have a minimal setup I’ll go straight from my “Prod” database:

Commit and push and we’re on our way – everything is in version control:

Now i may have cheated by doing MSSQL first – because now actually building and deploying the project is pretty straight forward – much like I have done in previous posts here and here I just used the SQL Change Automation plugins from the Azure DevOps marketplace to first build:

and then deploy the project:

and it all succeeded… the 2nd time around when I remembered to specify which DB I was deploying to!

Oracle Database

The first thing I need to do for all of these is to pick the schemas I’ll be working on… wait, Deja Vu! – well once again I have a little set of schemas present on the demonstration machine that will serve me just fine!

Because we’re working in the State setup, out of Redgate Deploy I’m going to use Source Control for Oracle which allows me to specify the remote repo, the folder to create and even the fact I’m using Azure DevOps Git:

(Step 1 was simply providing the connection details to my Oracle Database, hence why I was on step 2!) – I select the Schema I’ll be putting in Source Control and even get a nifty run down of the structure:

Hit next and give a name to the Project (unsurprisingly I went with HR) and then check in all of your initial objects:

Now one thing that you may have noticed if you’re following along that I should clarify (and which I forgot when setting up this blog post):

  1. You don’t need to specify the local repo you cloned down because Source Control for Oracle handles this itself in the back end, if you want it to be part of a local repo with other code in it, use the Working Folder instead
  2. If you are using Git and NOT the working folder, committing will also Push your objects to the remote – you’ve been warned!

As above, I now head over to Pipelines and hit Create New Pipeline! I check out my repo with the schema objects in it, and add a job to my agent. But what am I going to pick? Well unlike SQL Change Automation there’s not a plugin available on the Azure DevOps Marketplace, we’ll need some good old fashioned command line calls!

First, let’s clean out the CI Schema, I’m going to use the script to remove all objects from the Redgate documentation site and make a call to run the script using sqlplus (I’m storing the file locally but you could even include it in your repo under a build folder maybe?)

echo on
Call exit | sqlplus hr/[passwordredacted]@//localhost:1521/CI @C:\DemoFiles\DropAllObjects.sql
echo off

Next we’ll add a call to the cmdline of Schema Compare for Oracle to build the database from our repo, using the files that were checked out by the agent (an Azure DevOps pre-defined environment variable) – again we’re using a similar script from the Redgate DevOps for Oracle site but because we’re deploying ALL objects from version control, we don’t really want a report per say, this is just to test the schema can be built from the ground up:

"C:\Program Files\Red Gate\Schema Compare for Oracle 5\sco.exe" /deploy /source $(Build.SourcesDirectory)\Schema{HR} /target SYSTEM/[passwordredacted]@localhost:1521/CI{HR} AS SYSDBA /indirect 

echo Build database from state:%ERRORLEVEL%
 
rem IF ERRORLEVEL is 0 then there are no changes.
IF %ERRORLEVEL% EQU 0 (
    echo ========================================================================================================
    echo == Warning - No schema changes detected. == echo ========================================================================================================
)
 
rem IF ERRORLEVEL is 61 there are differences, which we expect.
IF %ERRORLEVEL% EQU 61 (
    echo ========================================================================================================
    echo == Objects were found and built. ==
    echo ========================================================================================================
    rem Reset the ERRORLEVEL to 0 so the build doesn't fail 
    SET ERRORLEVEL=0
)

and assuming this all works, we’ll package up the files into a zip and publish them as an artifact so we can consume them at the release stage!

and guess what? It all just worked *cough* on build #23 when I got the syntax right finally…

Of course we can add additional stages to the build as well, such as a check for Invalid Objects and some Unit Testing, but I’ll keep this pretty lean for now!

Now, just like we did for MSSQL we’re going to set up a new deployment pipeline, grab the artifact we’re publishing from the build, enable a CD trigger and we’re going to deploy to, in this case, Acceptance.

Let’s first create a job on the agent to unpack the zip file and see how far we get – I’m just going to dump them in a DeploymentState folder in the working directory:

and… awww thanks Azure DevOps, I needed to hear that!

and now we add yet another command line task, but this one is just going to do a comparison, it’s not actually going to deploy anything – because we’re going to add a manual intervention step to approve the deployment first! I had a little help again from the Redgate docs for this one, because I keep having to catch cmdline error codes – if I was wise like Alex Yates I probably would have just handled this with PowerShell…

echo off
rem  We generate the deployment preview script artifact here
"C:\Program Files\Red Gate\Schema Compare for Oracle 5\sco.exe" /abortonwarnings:high /b:hdre /i:sdwgvac /source $(System.DefaultWorkingDirectory)\DeploymentState\Schema{HR} /target SYSTEM/Redgate1@localhost:1521/Acceptance{HR} AS SYSDBA /indirect /report:$(System.DefaultWorkingDirectory)\DeploymentState\changes_report.html /scriptfile:$(System.DefaultWorkingDirectory)\DeploymentState\deployment_script.sql > $(System.DefaultWorkingDirectory)\DeploymentState\Warnings.txt

echo Warnings exit code:%ERRORLEVEL%
rem In the unlikely event that the exit code is 63, this mean that a deployment warning has exceeded the allowable threshold (eg, data loss may have been detected)
rem If this occurs it is recommended to review the script, customize it, and perform a manual deployment
 
IF %ERRORLEVEL% EQU 0 (
    echo ========================================================================================================
    echo == No schema changes to deploy
    echo ========================================================================================================

    GOTO END
)
 
IF %ERRORLEVEL% EQU 63 (
    echo ========================================================================================================
    echo == High Severity Warnings Detected! Aborting the build. 
    echo == Review the deployment script and consider deploying manually.
    echo ========================================================================================================
    rem Aborting deployment because high severity warnings were detected
        SET ERRORLEVEL=1
    GOTO END
)
 
rem This is the happy path where we've identified changes and not detected any high warnings
IF %ERRORLEVEL% EQU 61 (
    echo ========================================================================================================
    echo == Schema changes found to deploy - generating deployment script for review
    echo ========================================================================================================
    rem Set ERROLEVEL to 0 so the build job doesn't fail
	SET ERRORLEVEL=0
    GOTO END
)
 
:END
EXIT /B %ERRORLEVEL%

I then throw in an agentless job (Manual Intervention Step) and then finally (once I have reviewed the deployment report that is produced) one further cmdline call to actually run the deployment script again my Acceptance target:

echo on
Call exit | sqlplus hr/[passwordRedacted]@//localhost:1521/Acceptance @$(System.DefaultWorkingDirectory)/DeploymentState\deployment_script.sql
echo off

I have saved my pipeline, now it’s time to test. So I’m going to make a very quick change (so that something is produced) and see what happens…

Boom. Pipeline done.

One word on this though – I haven’t included an awful lot of frills (error handling, checks, NuGet instead of Zip etc.) so you’re free to bulk this out how you see fit, but by golly it works! Also make sure you tick this on the second Agent Job, else it’ll wipe out your working directory – something that obviously definitely did not happen to me…

PostgreSQL

This one might be cheating a little. As you know I’ve already setup a CI pipeline with Flyway before, using Azure SQL DBs and the Flyway Docker container as part of the build, and in some cases even tSQLt for Unit Testing too! But this is PostgreSQL, and this is a new blog post, darn it!

Still getting your head around Flyway? Check out the Redgate University videos!

I started out by creating myself a PostgreSQL 10 server in the Azure Portal, because:

  • I can
  • I didn’t want a local install of PostgreSQL
  • I’m not self sabotaging

and I set up a Dev and Test database on it – that is once I remembered to allow my client IP address *sigh* and then connected from Azure Data Studio:

I already have some basic scripts from my last demo that I can use – so I pulled down the latest version of Flyway (V7) and unzipped it into my files:

Then I created a SQL folder in my local repository for the PostgreSQLPipeline (and popped a couple of migrations in – I’m using the StackOverflow scripts, adapted for PostgreSQL from Kendra Little’s GitHub, thank you Kendra!) – in the previous posts we’ve had to source control the state or initial baseline of the database, however as we’re using Flyway for PostgreSQL this requires us to create and name/order the migrations ourselves, so we have plenty of control over that – hence why we can jump straight into building some scripts this time around.

Finally, I pointed the config file for Flyway to that, also taking the opportunity to point it at my Dev DB using the PostgreSQL JDBC:

Now i didn’t really NEED to do this step and try things out against Dev, because I already have the scripts, so I could have just started building the pipeline – but it’s always worthwhile getting local validation first by running things against Dev and then migrating up!

A quick Flyway Info later and we were good to go – the scripts are recognized so we know we’ve set everything up correctly.

One git add / commit / push and everything is in my repo:

Now as you may know from my other post we can do 1 of 2 things here – we can now either build what we eventually push to the repo using a cmdline call (like we did with the Oracle build) to a machine where we have Flyway installed, or we can use the Docker image.

I’m actually going to use Docker again but this time, instead of specifying the various credentials in a config file that was getting passed to the container, I’m actually going to use Azure DevOps environment variables and build the connection string that way – it’s really easy to keep the variables secret in Pipelines, so I can pass my JDBC connection, complete with Username and Password, as well as my Flyway license key, without worrying someone might get hold of them!

I’m actually going to build against a live PostgreSQL database before deploying, so I also created another DB for me to use: demodb_ci

I actually stole the YAML from my previous pipeline (below) and updated the variables accordingly:

trigger:
- master
 
pool:
  vmImage: 'ubuntu-latest'
 
steps:
- task: DockerInstaller@0
  inputs:
    dockerVersion: '17.09.0-ce'
  displayName: 'Install Docker'

- task: Bash@3
  inputs:
    targettype: 'inline'
    script: docker run --rm -v $(FLYWAY_LOCATIONS):/flyway/sql flyway/flyway clean -url=$(JDBC) -licenseKey=$(licenseKey) -user=$(userName) -password=$(password) -enterprise 
  displayName: 'Clean build schema'
 
- task: Bash@3
  inputs:
    targettype: 'inline'
    script: docker run --rm -v $(FLYWAY_LOCATIONS):/flyway/sql flyway/flyway migrate -url=$(JDBC) -licenseKey=$(licenseKey) -user=$(userName) -password=$(password) -enterprise 
  displayName: 'Run flyway build'

and it ran just fine! Well actually it failed first, because I didn’t have permissions from the IP address that the container was running from, but fortunately Azure has a handy switch in the PostgreSQL Server settings to simply allow Azure Services traffic through the firewall:

Once that was sorted, the first stage (as always) is to download Docker and then we have 2 Flyway containers steps:

1 – Clean the schema and make sure the database is empty
2 – Migrate the schema changes

Then we have two options – we could do like we did in the Oracle pipeline and zip up the files, spitting them out at Release stage and consuming them, either calling Flyway from the command line, or we can go ahead and promote our deployment using the same pipeline.

I’m lazy, so I’m going for the latter!

In a normal “production like” situation I would probably take the opportunity to test and check etc. like I did above, but let’s keep this super lean – if the build works, I trust the deployment. Lets go ahead and deploy to Production – I’ll add this as an additional task in my YAML:

- task: Bash@3
  inputs:
    targettype: 'inline'
    script: docker run --rm -v $(FLYWAY_LOCATIONS):/flyway/sql flyway/flyway migrate -url=$(ProdJDBC) -licenseKey=$(licenseKey) -user=$(userName) -password=$(password) -enterprise 
  displayName: 'Promote to Production'

And the deployment was successful! Phew – I think I’ve earned a cup of tea!

Conclusion

In this blog post I have demonstrated 3 different (and initially very simple*) approaches to the source control and deployment of database changes – but there’s actually a much wider combination we could have adopted – all 3 models with MSSQL, all 3 models with Oracle, and Migrations for up to 18 other systems like DB2, Snowflake and even SAP HANA! But what did I need to do ALL of this? A single solutionRedgate Deploy**.

Thank you for stopping by! Have an amazing week!

*There is a lot missing from the code I have provided, like additional error handling, tests etc. and all of the above CAN be improved – but did we manage to build and deploy across three different systems all using Azure DevOps? Yes we did. If you intend on using any of the above, please ensure you build in the necessary controls and process around it and always pick what is best for you and your team.

**Redgate Deploy is going from strength to strength, expect to see a wide range of improvements made over the coming months – I won’t be surprised if this blog post is already out of date by the time I finish writing it – that’s how awesome the teams working on all of this are!

Using things weirdly – Part 3: Moving from State to Hybrid-ish Source Control with Microsoft SSDT and Redgate SQL Change Automation

“It works on MY machine”
Everyone

Some time back I blogged “which database source control model works well for you” and it has come in SO handy when explaining the different models to people – but the one question I get asked not infrequently, is:

Is there any way that I can use Redgate SQL Change Automation with Visual Studio based SSDT?

It’s always a really difficult question to answer because fundamentally SQL Source Control (Redgate’s state based tool) and SSDT (Microsoft’s state based tool) functionally seek to do the same thing, making them competitors. However there are, on the odd occasion, good reasons as to why I am asked the question and one of those same scenarios came up today:

  • Our developers work in Visual Studio and have already been using SSDT for a long time, it works for them, we just want to extend it with Migrations to handle complex changes.

So the option here is, leave it as it is, or try to work with both. Not always going to be my first choice but it got me thinking.

Starting from a memory

A few months ago, when life was “normal” and I was working in my office with *gulp* PEOPLE, I tried to make this scenario work by simply linking SQL Change Automation to the project folder created by SSDT but unfortunately it was riddled with problems. The SSDT importer and repo look like this:

And if you point SQL Change Automation at the local repo with this in it will correctly tell you:

Now of course this wasn’t unexpected. It’s not designed to work this way, is it? No. But way back then I did figure out, shrewdly, that if I used SQL Source Control to carry out an initial commit just to a working folder, it would generate a RedGate.ssc and RedGateDatabaseInfo.xml file and you can copy them into the SSDT repo to trick SQL Change Automation into thinking that it’s a SQL Source Control repo… unfortunately this trick no longer works. Sad.

sad a christmas story GIF

Add a hop and a step

But what got me thinking today was the context with which the question was asked. It was more about separation of duties. Once the developers have effectively done their job and delivered the change into the repo, their job was effectively done! “That’s how it should look moving forwards. What’s next?” – and then I had an idea.

Given that SSDT allows you to push and pull the code and apply it to your own database, what is stopping us from using SQL Change Automation to pick up on the changes against the database we sync our changes to from our SSDT project?

Genius. Evil genius.

So I created a new Database to simulate having another developer on my instance and gave it to Peter Parker:

You can then do a schema compare to another DB from your project, effectively PULL down changes from the remote to your local repository, and then sync them back up to your local development DB; this is how Devs stay up to date with each other but could, in this methodology, be how DBAs or senior developers pull down the changes to their local DB, where they test the new state, and then generate a new migration from it.

So I made a change on my dev database and captured it in the project right click on the project name > schema compare > dev db compare to project > update) and then committed and pushed:

and sure enough my repository was updated:

But then I simulated pulling down the change and applying it to Peter Parker’s DB (again using Schema Compare) and then I created a SQL Change Automation project in VS, in the same solution but pointing the project to a migrations folder in the repo:

Yes I accidentally called the project Database1 don’t remind me I’m embarrassed enough!

Then I added my baseline database:

It created the baseline and the project immediately with no issues and picked up on the changes I had made using SSDT:

and I was able to commit my project and changes into my repository in Azure DevOps:

It was just that easy! Now what this means for the development process is that developers _could_ feasibly work with SSDT, as they are comfortable with it, and then more senior members of the team can generate migration scripts from there, building the database from scratch and deploying in a reliable, repeatable fashion.

Just to prove to you my build even ran green from this:

So in summary what this gives us is the ability to adapt a regular SSDT workflow, one that developers are comfortable with and which has been in the team for months or years, add in the knowledge of DBAs or team leads, a greater separation of duties for high risk schema changes, and the control and flexibility (and peace of mind) that comes with a migrations based deployment process.

Nice.

The fine print

I’m sure by now you’ve realized something: this is not, nor will it ever (I believe) be a supported workflow. If you implement the above in a production sense for something other than just testing then it’s not something you’ll be able to get help with from one of the Redgate engineers if you need to troubleshoot.

Also, if you’re going to introduce a sequence of changes like this to achieve the hybrid model, it does make more sense that you implement SQL Source Control for the state side (given that it’s right there in the SQL Toolbelt with SQL Change Automation anyway).

But IS IT POSSIBLE to achieve a similar, Visual Studio based* hybrid workflow with SSDT and SQL Change Automation by using a database to ‘hop’ the changes across?

Yes, it certainly looks that way!

*If you’re planning on using SSDT in Azure Data Studio too then this workflow could also work for you, SQL Change Automation is present in SSMS and VS so it’s really up to you!

Using things weirdly – Part 2: Static Data in the Hybrid Model (w/ Redgate SQL Source Control and Change Automation)

“Where’s your will to be weird?”
Jim Morrison

I had some really positive feedback on my last “Using things weirdly” post, and truth be told, I love to use things weirdly. The number of times I’ve heard: “Oh, well, sure yeah I guess it works that way too…” is just too many to count. So imagine how my eyes lit up when I realized that you can do something weird with one of my favorite things to talk about at the moment, The Hybrid Model.

Now if you don’t know about the Hybrid Model then I would suggest you check out my post here that’s all about the different source control models available for your databases!

The Problem

Across both the State based and Migrations based offerings within the SQL Toolbelt, you have access to something very cool: the ability to easily (alongside the schema) source control your static data. Now don’t ‘@’ me because you think I should be referring to it as “Semi-Static” because it might change occasionally and ‘that’s not truly static then is it?‘; I could easily also refer to it as ‘Lookup Data’ or ‘Reference Data’, basically whatever you class things like “Country Codes” and “Currency Codes” as.

Whatever you call it, it can go into your VCS like any part of the database schema:

SQL Source Control: State Based
SQL Change Automation: Migrations Based

But. One thing that – as of writing this RIGHT NOW (23/07/2020 10:09am BST) – is not officially available in the Hybrid Model combining these two methods… is Static Data. The Data tab in SQL Change Automation even disappears when you set it up as a Hybrid workflow:

And this gives me a sad.

sad monty python and the holy grail GIF

The Solution?

Got your Hybrid Model setup and ready to go? Let’s use it weirdly!

1 – Use SQL Source Control to commit some static data to your state repository. This is as easy as right clicking on your (highlighted green) source control linked database and selecting “Other SQL Source Control Tasks” > “Link/Unlink Static Data“, and picking your tables.

Nothing should be showing in SQL Change Automation:

2 – Unlink SQL Change Automation from the state repository for a moment and link it instead directly to the development database. This will cause it to go into Migrations-First mode. You can do this by clicking the blue source name in SSMS and picking Existing Database instead:

Because it’s technically the same database as you’re source controlling in the state repository, it should all just work™ and should tell you there are no dev changes to the source. Then you’ll see the “Data” tab has been enabled:

3 – Select the same tables to source control as you did with SQL Source Control by using the Add Tables wizard:

BUT WAIT!!

shocked oh my god GIF

Isn’t it now going to generate a migration for our static data?? This isn’t included in the baseline or anything at all so far, so is it going to try and insert all of my static data into tables later on that already have it?

No. Actually we’re fine!

4 – Generate the static data migration script (and look at it for peace of mind). Notice that the script actually has checks in there – because we’re newly adding these tables, the migration will check if the tables are empty before trying to run the script, and only AFTER this migration will SQL Change Automation start generating the differential, incremental static data migrations:

Commit this migration script to your migrations repository, and that’s all we need to do here!

5 – All that’s left now, is to re-hook-up the Hybrid pipeline, follow the same steps you did before in Step 2 but this time, instead of an existing database, just link it back to the state repository like it was before. If you’ve done this right, you should see no changes pending for migrations:

BUT you will notice that if you change any static data with SQL Source Control, it should now show up in SQL Change Automation!

Change to static data prior to commit in SQL Source Control
Change before migration script generation in SQL Change Automation
Generated migration script to be committed

Conclusion

Is it an intended use? No absolutely not, the reason it’s disabled is that with all things at Redgate they are considered heavily to ensure users are offered the best possible user experience, functionality and essentially something that meets requirements across the board.

But. We can use it weirdly to, as i say, just make it work™.

What have you used weirdly lately? Let me know!

xRDBMS Database Continuous Integration with Flyway, Azure DevOps and Docker… the simple way.

“Some people try to make everything complicated, be the person who tries to make everything simple.”
Dave Waters

Simplicity is in my blood. That’s not to say I am ‘simple’ in the sense I cannot grasp more than the most basic concepts, but more that I am likely to grasp more complex problems and solutions when they are phrased in simple ways.

This stems from my love of teaching others (on the rare occasion it falls to me to do so), where I find the moment that everything just ‘clicks’ and the realization comes over them to be possibly one of the most satisfying moments one can enjoy in life.

shocked star trek GIF

Now recently I’ve been enjoying getting my head around Flyway – an open source JDBC based migrations tool that brings the power of schema versioning and deployments together with the agility that developers need to focus on innovation in Development. There’s something about Flyway that just… ‘clicks’.

It doesn’t really matter what relational database you’re using; MySQL, IBM DB2, even SAP HANA! You can achieve at least the core tenants of database DevOps with this neat and simple little command line tool – there’s not even an installer, you just have to unzip!

Now I’ve had a lot of fun working with Flyway so far and, thanks to a few people (Kendra, Julia – i’m looking at you both!) I have been able to wrap my head around it to, I would say, a fair standard. Caveat on that – being a pure SQL person please don’t ask me about Java based migrations, I’m not quite there yet!! But there is one thing that I kept asking myself:

“When I’m talking to colleagues and customers about Database DevOps, I’m always talking about the benefits of continuous integration; building the database from scratch to ensure that everything builds and validates…” etc. etc. so why haven’t I really come across this with Flyway yet?

think tom hanks GIF by The Late Show With Stephen Colbert

Probably for a few reasons. You can include Flyway as a plugin in your Maven and Gradle configurations, so people writing java projects already get that benefit. It can easily form part Flyway itself by virtue is simply small incremental scripts and developers can go backwards and forwards however and as many times as they like with the Flyway Migrate, Undo and Clean commands, so is there really a need for a build? And most importantly, Flyway’s API just allows you to build it in. So naturally you’re building WITH the application.

But naturally when you’re putting your code with other people’s code, things have to be tested and verified, and I like to do this in isolation too – especially for databases that are decoupled from the application, or if you have a number of micro-service style databases you’d want to test all in parallel etc. it’s a great way to shift left. So I started asking myself if there was some way I could implement a CI build using Flyway in Azure DevOps, like I would any of the other database tooling I use on a regular basis? Below you’ll find the product of my tinkering, and a whole heap of help from Julia and Kendra, without whom I would still be figuring out what Baseline does!

Option 1) The simplest option – cmdline

Flyway can be called via the command line and it doesn’t get more simple than that.

You can pass any number of arguments and switches to Flyways command line, including specifying what config files it’s going to be using – which means that all you have to do, is unzip the Flyway components on a dedicated build server (VM or on-prem) and then, after refreshing the migrations available, invoke the command line using Azure DevOps pipelines (or another CI tool) to run Flyway with the commands against a database on the build server (or somewhere accessible to the build server) and Bingo!

No Idea Build GIF by Rooster Teeth

And that’s all there is to it! You get to verify that all of the migrations up to the very latest in your VCS will run, and even if you don’t have the VERY base version as a baseline migration, you can still start with a copy of the database – you could even use a Clone for that!

But yes, this does require somewhere for Flyway to exist prior to us running with our migrations… wouldn’t it be even easier if we could do it without even having to unzip Flyway first?

Option 2) Also simple, but very cool! Flyway with Docker

Did you know that Flyway has it’s own docker image? No? Well it does!* Not only that but we can map our own version controlled Migration scripts and Config files to the container so that, if it can point at a database, you sure as heck know it’s going to migrate to it!

*Not sure what the heck all of this Docker/Container stuff is? You’re not alone! Check out this great video on all things containers from The Simple Engineer!

This was the method I tried, and it all started with putting a migration into Version Control. Much like I did for my post on using SQL Change Automation with Azure SQL DB – I set up a repo in Azure DevOps, cloned it down to my local machine and I added a folder for the migrations:

Into this I proceeded to add my base script for creating the DMDatabase (the database I use for EVERYTHING, for which you can find the scripts here):

Once I had included my migration I did the standard

Git add .
Git commit -m "Here is some code"
Git push

and I had a basis from which to work.

Next step then was making sure I had a database to work with. Now the beauty of Flyway means that it can easily support 20+ RDBMS’ so I was like a child at a candy store! I didn’t know what to pick!

For pure ease and again, simplicity, I went for good ol’ SQL Server – or to be precise, I created an Azure SQL Database (at the basic tier too so it’s only costing £3 per month!):

Now here’s where it gets customizable. You don’t NEED to actually even pass in a whole config file to this process. Because the Flyway container is going to spin up everything that would come with an install of Flyway, you can pass it switches to override the default behavior specified in the config file. You can adapt this either by hard-coding strings or by using Environment Variables alongside the native switches – this means you could pass in everything you might need securely through Azure Pipeline’s own methods.

I, on the other hand, was incredibly lazy and decided to use the same config file I use for my Dev environment, but I swapped out the JDBC connection to instead be my Build database:

I think saved this new conf file in my local repo under a folder named Build Configuration – in case I want to add any logic later on to include in the build (like the tSQLt framework and tests! Hint Hint!)

This means that I would only need to specify 2 things as variables, the location of my SQL migrations, and the config file. So the next challenge was getting the docker container up and running, which fortunately it’s very easy to do in Azure Pipelines, here was the entirety of the YAML to run Flyway in a container (and do nothing with it yet):

trigger:
- master

pool:
  vmImage: 'ubuntu-latest'

steps:
- task: DockerInstaller@0
  inputs:
    dockerVersion: '17.09.0-ce'
  displayName: 'Install Docker'

- task: Bash@3
  inputs:
    targettype: 'inline'
    script: docker run flyway/flyway -v
  displayName: 'Run Flyway'

So, on any changes to the main branch we’ll be spinning up a Linux VM, grabbing Docker and firing up the Flyway container. That’s it. Simple.

So now I just have to pass in my config file, which is already in my ‘build config’ folder, and my migrations which are in my VCS root. To do this it was a case of mapping where Azure DevOps stores the files from Git during the build to the containers own mount location in which it expects to find the relevant conf and sql files. Fortunately Flyway and Docker have some pretty snazzy and super clear documentation on this – so it was a case of using:

-v [my sql files in vcs]:/flyway/sql

as part of the run – though I had to ensure I also cleaned the build environment first, otherwise it would just be like deploying to a regular database, and we want to make sure we can build from the ground up every single time! This lead to me having the following environment variables:

As, rather helpfully, all of our files from Git are copied to the working directory during the build and we can use the environment variable $(Build.Repository.LocalPath) to grab them! This lead to me updating my YAML to actually do some Flyway running when we spin up the container!

trigger:
- master

pool:
  vmImage: 'ubuntu-latest'

steps:
- task: DockerInstaller@0
  inputs:
    dockerVersion: '17.09.0-ce'
  displayName: 'Install Docker'

- task: Bash@3
  inputs:
    targettype: 'inline'
    script: docker run -v $(FLYWAY_LOCATIONS):/flyway/sql -v $(FLYWAY_CONFIG_FILES):/flyway/conf flyway/flyway clean -enterprise
  displayName: 'Clean build schema'

- task: Bash@3
  inputs:
    targettype: 'inline'
    script: docker run -v $(FLYWAY_LOCATIONS):/flyway/sql -v $(FLYWAY_CONFIG_FILES):/flyway/conf flyway/flyway migrate -enterprise
  displayName: 'Run flyway for schema'

Effectively, this will spin up the VM in ADO, download and install Docker, fire up the Flyway container and then 1) clean the target schema (my Azure SQL DB in this case) and 2) then migrate all of the migrations scripts in the repo up to the latest version – and this all seemed to work great!*

*Note: I have an enterprise Flyway licenses which enables loads of great features and support, different version comparisons can be found described here.

So now, whenever I add Flyway SQL migrations to my repo as part of a branch, I can create a PR, merge them back into Trunk and trigger an automatic build against my Flyway build DB in Azure SQL:

Conclusion

Getting up and running with Flyway is so very very easy, anyone can do it – it’s part of the beauty of the technology, but it turns out getting the build up and running too, when you’re not just embedding it directly within your application, is just as straightforward and it was a great learning curve for me!

The best part about this though – is that everything above can be achieved using pretty much any relational database management system you would like, either via the command line and a dedicated build server, or via the Docker container at build time. So get building!

ready lets go GIF

Classification by design: Including data cataloging in a DevOps process

“The most powerful tool we have as developers is automation.”
Scott Hanselman

It is no secret that I love to talk about data protection, specifically from the perspective of structured data. When we talk about database development practices, we often find ourselves talking about 3 things most often:

  • Source Control
  • Continuous Integration and Continuous Delivery/Deployment (CI/CD)
  • Estate Monitoring

Some people refer to this as “DataOps“, others refer to it as “DevDataOps” but in reality, it’s all DevOps guys. This may be an unpopular opinion (and if it clashes with yours please forgive me, it’s just my opinion) but just because a certain niche area hasn’t been specifically called out within a subset of DevOps doesn’t mean you have to invent your own term for it!

Now this leads me on to DevSecOps, or as I like to call it… More secure DevOps.

rimshot GIF

No but seriously this is a slightly different case – DevSecOps is like DevOps but fortified with security from the ground up. There’s a fantastic article and diagram of this on Plutora from Mark Robinson of how this looks (below) and if you haven’t read his article I would definitely go and give it a read!

DevSecOps Diagram

Good DevOps practice is a combination of different things working together, bringing the right mentality, the principles, processes and amazing tools at our disposal like automation but this all includes security from the ground up too. DevOps is about putting those principles and practices in place to strengthen the pipeline, so why don’t we treat security in the same way?

Take, for example, 3 pieces of legislation that have been very much in the spotlight:

The controller shall implement appropriate technical and organisational measures for ensuring that, by default, only personal data which are necessary for each specific purpose of the processing are processed.
GDPR (Europe) Art. 25 “Data protection by design and by default”

Processing agents shall adopt security, technical and administrative
measures
able to protect personal data from unauthorized accesses and accidental or unlawful situations of destruction, loss, alteration, communication or any type of improper or unlawful processing.

– LGPD (Brazil) Chapter VII, Art. 46 “Security and Secrecy of Data”

“A Controller or Processor is required to implement appropriate technical and organisational measures to demonstrate that Processing is performed in accordance with this Law…”
DIFC LAW NO. 5 OF 2020 (Dubai) Part 2D, Art. 14 (2) “Accountability and notification”

There’s a common running theme here and although lots of global legislation will either allude to, or directly tell you ways you can be compliant and what some of these “organizational” and “technical” measures are, it’s still pretty blurry.

How do we know what we can do? How do we know what “default” and “design” mean in this context? Well, we build it into the DevOps process.

Now I could sit here forever and talk about why transforming your database development, deployment and provisioning processes allows us to be more secure, but that’s a lot of material and it might have to come in chunks! So what we’re going to focus on today is as the title suggests: Data Classification and Cataloging.

Why is Cataloging important?

Cataloging structured data is incredibly important because it can be one of the first steps we take to securing sensitive Personally Identifiable Information (PII) or Protected Health Information (PHI) wherever it exists across our database environments. It allows us to make strengthened, contextual decisions about the data we hold including how we treat it in pre-Production, how long we retain it for and which systems and processes consume it.

But the most important part of this is simply: it tells us where the risk is.

Read through any of the most recent data protection laws and you will notice that a few things come up quite a lot including “Data Protection Impact Assessment“, or DPIA. Effectively if you can assess the risk of processing activities you can more readily answer the data protection questions and challenges you may face.

Knowing where your data resides can be the first step to helping you assess this risk, and to more readily answer your own data questions. If you want to read more about Cataloging specifically and why it is useful, you can read more about it on my previous blog here.

Where does Cataloging fit into DevOps?

This one is simple to answer. Once you have fully classified your entire estate, you’re not done. No, if you’re a development house or indeed even a single developer – if you are making any schema changes to the tables holding that sensitive data, you’re never done.

The reason for this is that Cataloging is an evergreen activity – if you update tables by removing columns, adding columns, splitting tables, adding tables… anything! Well then you need to be ready to make sure that you are:

a) Prepared and equipped with knowledge of the tables you’re working on and if this is a high risk activity.

b) Updating classification information to reflect the new “truth”, i.e. if you’re adding a column that will collect people’s Twitter handles, then that column should be classified as sensitive, and this should be reflected the moment it is deployed to Production.

So it is important to have the correct people working on this, with the right knowledge, preparation and processes and using the correct tools ensuring that those updates are persisted properly and securely through your deployment pipelines.

Huh… people, processes and tools… That sounds familiar!

The Process: SQL Data Catalog, SQL Change Automation and Azure DevOps

For this little experiment of mine I used Redgate’s SQL Change Automation (Migrations First approach in SQL Server Management Studio) and SQL Data Catalog to both develop & deploy and classify/categorize respectively, and for simple version control and orchestration of this pipeline I opted for Azure DevOps (with SQL Change Automation CI/CD plugins):

NOTE: Heads up, all of the code I used for this can be found right here in my GitHub, feel free to have at it!

Step 1) Setup SQL Change Automation project with an Azure DevOps Git Repo and then create the YAML file to build it, and create a deployment process to Acceptance and the Production.

Ok – that’s a bit of a mouthful and a massive ask. There’s a lot of things there you have to be familiar with, but we don’t have time to go into right here. BUT fortunately if you checkout Redgate University right here, my blog post on using Change Automation with Azure DB and Redgate’s Product Learning section, you’ll be a Database DevOps ‘Whizz-Kid’ in no time!

Fast forward a little and I had my example databases, VCS and pipeline all up and running:

Step 2)The “Theory”: This is where things get interesting. So we have an example pipeline set up and we are able to completely deploy all the way through to “Production” so let’s talk theory.

In SQL Data Catalog I have covered both my Production and Acceptance Databases:

Now, in development we don’t make changes directly to Production, so why should Classification be any different? Now how you adapt the above code is up to you, feel free to split it, move it around, incorporate it into Pull Requests if you want to… But I’m going with a bit more of a simple situation.

Situation: Developer makes a change in Development, which gets committed, reviewed and merged o the main branch, resulting in a build and a deployment, in this case to Acceptance and then it is later deployed to Production.

Now, by Acceptance we should only have the “good work”, i.e. all of our testing is shifted left within DevOps so Acceptance is basically the last stop before Production. Therefore we should classify the work we have done on Acceptance, crucially, before it gets to Production and starts gathering sensitive data, and then copy this classification up on deployment.

Ideal: We should have no columns on Production that have not been classified.

Step 3) – In Practice: Fortunately it’s very easy to automate a lot of these steps with SQL Data Catalog utilizing it’s PowerShell cmdlets and REST API. The cmdlets are fully documented and very easy to use (docs here). This allows us to easily scan, classify and copy classifications up to other databases, but we’ll also need to do some checks and report if there are discrepancies, as part of the deployment pipeline that can be investigated.

  • Are there any columns on Acceptance that aren’t classified but have been deployed to Production? (failure to comply with process)
  • Are there any columns on Production that have not been classified? (classification drift)
  • Are there any unclassified columns on Acceptance that have not yet been deployed to Production (for pipeline hygiene purposes)

The other part of this ‘fun’ is reporting what has been changed in the same process. Now fortunately SQL Change Automation spits out a Changes.json file with its Release Artifacts and we can steal that away and find out how many tables have been created or changed in this release and report that back so we can correlate what has been done and what is missing:

So actually getting this up and running is just going to require 3 things:

  1. The PowerShell script from GitHub (or your own personalized variant) as a step in your production deployment
  2. Data Catalog available and pointed at Acceptance and Production (or your versions of these environments)
  3. Variables set in Azure DevOps to fill the gaps (e.g. Where is Data Catalog? Whats my PowerShell Auth token? What are my Acceptance and PROD DBs called? etc.)

3 is the last step there so you’ll need something like this to run the script:

  • DatabaseDeploymentJSON – where the JSON file will be with the latest changes in the Prod release
  • DataCatalogAuthToken – Your PowerShell Auth token from Settings in Data Catalog
  • DataCatalogUrl – The full URL to your Data Catalog installation, missing the “\” at the end (ending :15156)
  • ExportPath (Optional) – I specified the path for my Database Deployment Resources to save typing it out in the Redgate plugins
  • ProdDB / StageDB – As you would expect, the Production and Acceptance/Staging DBs you’re deploying to/from
  • ProdInstance / StageInstance – As above, except the instance the Database are located on

In the variables above the Instance and DB names are purely used within Data Catalog, so there’s no need to worry about anything happening to the actual databases themselves!

Once you’ve run through the deployment pipeline a couple of times and the changes.json file is being produced, you can go ahead and copy the script into an inline PowerShell script step in your release and you should find it will fire to life! I simulated an example by modifying my Contacts table and my Articles table, adding 1 column each and deploying both to Acceptance. I then classified just 1 of these in Acceptance in Data Catalog:

and then approved the deployment to Production and tada!

Ok you probably can’t make all that out, but it effectively says:

(Information) Table dbo.Articles was modified in this deployment.
(Information) Table dbo.Contacts was modified in this deployment.

That much we knew!

1 column(s) with classifications were discovered on VoiceOfTheDBA Acceptance that are not classified in VoiceOfTheDBA Production:
dbo.Articles.TestingPineapple

Excellent, we classified that one so it gets copied up and we can verify that in data catalog against Production:

and finally, we get a warning about Production now containing unclassified columns:

(Alert) The following columns have been discovered on VoiceOfTheDBA Production that require classification:

dbo.Contacts.TestingPineapple

You should classify these columns in VoiceOfTheDBA Acceptance prior to the next deployment.

Just as we expected. Success!

Happy Tom And Jerry GIF

Conclusion

Classification and categorization belongs as part of DevOps, if you expect the context for your business decisions around data to remain evergreen and informed then it cannot sit on the shoulders of one or two people to support it, and it cannot live in a manually updated Excel sheet or document.

By including it within the DevOps process, not only do you add an additional layer of security but you also make it an automated, team activity that can be audited, checked and easily kept up to date.

Is this DevSecOps? Well… not really no. Is this a more secure approach to Database DevOps? Absolutely! Happy DevOpsing!