“Keep it simple, stupid!” – My year 12 & 13 English Lit. Teacher
Recently I’ve been fascinated with something really cool. A couple of my colleagues at Redgate wrote a GitHook which allows you to easily switch branches using Redgate SQL Clone. You can see the hook here with full instructions – and I thought it was pretty neat.
But it got me thinking – I’ve posted a lot about when people are using just Azure SQL DBs (PaaS), about Masking and DB Change Automation, but when you’re using AzureSQL for Dev and Test DBs as well as Prod, you still don’t have the same agility one would expect from a local copy, like a clone.
But, the above GitHook leverages PowerShell (among some other fancy wizardry) so, what if we could do this exact same thing, using the PowerShell Az module to dynamically create and switch Azure SQL DBs in our own private resource groups every time we checkout a branch?
I don’t imagine it would be fast because I’m restricted to using very low tier SQL DBs by my tiny allowance of (as Kendra Little calls them) “Azure Bucks”, but it should absolutely be possible!
So I decided to write a PowerShell script to do just that.
The first question I had to come up with an answer to was, how do I replace the Clone “Image” in this process, because I need something that is effectively a copy of our Production (or as near as possible) environment so we have something to base EVERY copy from – so I created the idea of a Golden Copy within the script; effectively this golden copy could be created by copying masking and copying back down from Production using something like Redgate Data Masker and my scripts here in GitHub but as a stop-gap, if it doesn’t find one in the Resource Group and Server you select, it will simply create one from your Dev DB. Best answer I could think of, you’re welcome to improve it!
All that remained was effectively to go through and just mimic the functionality of the Clone script but using Az: so if you are switching to a new branch where you don’t already have an existing Dev DB, then you get a new copy of Golden. If you’re switching to a branch you’ve checked out before, it renames the DBs to swap you back to the correct branch.
Here is an empty AzureSQL DB called DMDatabase_Dev:
When I now issue the git checkout “feature/newfeature” command it asks me to sign in to my Azure account:
and then gets to work:
And… that was it really.
I now have an Azure SQL DB called DMDatabase_Dev_master as I switched from the master branch, and I have a branch new DMDatabase_Dev DB that I can use for my featurebranch. You’ll notice I didn’t include -b in my git command, let’s assume a colleague is already working on this branch. I can now just update my copy (of my golden copy) with their work:
And we’re good to go!
But now if I switch back to my main branch, the object is gone and I can carry on with work on this branch:
It was really straightforward I can’t believe I haven’t seen this in use in more places, but hey guess what? The PowerShell is yours right here if you want it:
You should have a Dev DB and you should update the values at the top of the script with the Dev DB name, server and resource group it is in
The script make reference to and creates a Golden copy DB so that you have something you should always be creating from, for consistency – so when you get started, create your own “Golden copy” back from Test/UAT or something if you can – maybe using the script mentioned above – it should be the name of your Dev DB appended with “_Golden”
If you want to change how it is authenticated so you don’t have to enter your credentials each time, then go for it – this was just the simplest method for me (and it’s currently 11:05pm so I’m going to bed!)
Feel free to improve it, I’m sure there are plenty improvements that can be made, but it’s a starter for 10 for anyone out there just getting started with development in Azure SQL. Plus it’s kinda neat!
“Is there any way that I can use Redgate SQL Change Automation with Visual Studio based SSDT?“
It’s always a really difficult question to answer because fundamentally SQL Source Control (Redgate’s state based tool) and SSDT (Microsoft’s state based tool) functionally seek to do the same thing, making them competitors. However there are, on the odd occasion, good reasons as to why I am asked the question and one of those same scenarios came up today:
Our developers work in Visual Studio and have already been using SSDT for a long time, it works for them, we just want to extend it with Migrations to handle complex changes.
So the option here is, leave it as it is, or try to work with both. Not always going to be my first choice but it got me thinking.
Starting from a memory
A few months ago, when life was “normal” and I was working in my office with *gulp* PEOPLE, I tried to make this scenario work by simply linking SQL Change Automation to the project folder created by SSDT but unfortunately it was riddled with problems. The SSDT importer and repo look like this:
And if you point SQL Change Automation at the local repo with this in it will correctly tell you:
Now of course this wasn’t unexpected. It’s not designed to work this way, is it? No. But way back then I did figure out, shrewdly, that if I used SQL Source Control to carry out an initial commit just to a working folder, it would generate a RedGate.ssc and RedGateDatabaseInfo.xml file and you can copy them into the SSDT repo to trick SQL Change Automation into thinking that it’s a SQL Source Control repo… unfortunately this trick no longer works. Sad.
Add a hop and a step
But what got me thinking today was the context with which the question was asked. It was more about separation of duties. Once the developers have effectively done their job and delivered the change into the repo, their job was effectively done! “That’s how it should look moving forwards. What’s next?” – and then I had an idea.
Given that SSDT allows you to push and pull the code and apply it to your own database, what is stopping us from using SQL Change Automation to pick up on the changes against the database we sync our changes to from our SSDT project?
Genius. Evil genius.
So I created a new Database to simulate having another developer on my instance and gave it to Peter Parker:
You can then do a schema compare to another DB from your project, effectively PULL down changes from the remote to your local repository, and then sync them back up to your local development DB; this is how Devs stay up to date with each other but could, in this methodology, be how DBAs or senior developers pull down the changes to their local DB, where they test the new state, and then generate a new migration from it.
So I made a change on my dev database and captured it in the project right click on the project name > schema compare > dev db compare to project > update) and then committed and pushed:
and sure enough my repository was updated:
But then I simulated pulling down the change and applying it to Peter Parker’s DB (again using Schema Compare) and then I created a SQL Change Automation project in VS, in the same solution but pointing the project to a migrations folder in the repo:
Yes I accidentally called the project Database1 don’t remind me I’m embarrassed enough!
Then I added my baseline database:
It created the baseline and the project immediately with no issues and picked up on the changes I had made using SSDT:
and I was able to commit my project and changes into my repository in Azure DevOps:
It was just that easy! Now what this means for the development process is that developers _could_ feasibly work with SSDT, as they are comfortable with it, and then more senior members of the team can generate migration scripts from there, building the database from scratch and deploying in a reliable, repeatable fashion.
Just to prove to you my build even ran green from this:
So in summary what this gives us is the ability to adapt a regular SSDT workflow, one that developers are comfortable with and which has been in the team for months or years, add in the knowledge of DBAs or team leads, a greater separation of duties for high risk schema changes, and the control and flexibility (and peace of mind) that comes with a migrations based deployment process.
The fine print
I’m sure by now you’ve realized something: this is not, nor will it ever (I believe) be a supported workflow. If you implement the above in a production sense for something other than just testing then it’s not something you’ll be able to get help with from one of the Redgate engineers if you need to troubleshoot.
Also, if you’re going to introduce a sequence of changes like this to achieve the hybrid model, it does make more sense that you implement SQL Source Control for the state side (given that it’s right there in the SQL Toolbelt with SQL Change Automation anyway).
But IS IT POSSIBLE to achieve a similar, Visual Studio based* hybrid workflow with SSDT and SQL Change Automation by using a database to ‘hop’ the changes across?
Yes, it certainly looks that way!
*If you’re planning on using SSDT in Azure Data Studio too then this workflow could also work for you, SQL Change Automation is present in SSMS and VS so it’s really up to you!
“You never know what you can do until you try, and very few try unless they have to.” – C.S. Lewis
Well I don’t have to, but many of the people I speak to on a daily basis are moving into GitLab, so it’s about time I tried it! You can find here testament to the mistakes I make as I try to set up a full end-to-end database change management process with SQL Change Automation and GitLab.
PLEASE NOTE (edit 18/12/2021): If you are just starting out with Redgate source control and deployment processes and are looking into using GitLab for database deployments, please read my updated blog post here using the newest source control and deployment technology Flyway Desktop)
Will it all work perfectly? I don’t doubt that everything will fall over at some point, but let’s see how we get on all the same, and hopefully if you’re setting up this same pipeline, you’ll be able to avoid the errors and failings I inevitably cause! So here we go!
Let’s set up a GitLab Project (and rename the default branch)
Naturally, I didn’t have a GitLab account, so I had to set one up. I’m assuming that if you’re using it already or you’ve just started using it you’re taking advantage of the more business features but I’ve just stuck with the good ol’ free account for now! It was remarkably simple, sign up, email address, confirm and here we are:
Ok there is something very cool I like about setting up a new project, can you tell what it is?
You can completely set up a new blank project but they have templates, you can import projects OR, and I love this, you can setup a full CI/CD pipeline from another repo! Having done this before in Azure DevOps it was not easy, let me tell you. It really seems like Azure DevOps hates you for setting up CI/CD from an external repo, even though it has plenty of helpful ways of doing so!
So I initialized my repository with a README and updated it:
Don’t ever say I’m not descriptive enough!
The first thing I did was a renamed my default branch to ‘trunk’ by going to branches, creating the new branch and then in Settings > Repository changing it to the default and then swapping out the protected status with the outdated master:
Then finally delete the old default in Repository > Branches:
Excellent. Now it’s time to clone trunk onto my machine as we will need the local repository to put our change automation project in!
I created a folder called GitLab test and cloned the mostly empty repository into it:
Create a new SQL Change Automation project and push it to trunk
In SSMS I opened up the most recent version of SQL Change Automation an created a new project called “DoggosAreCoolDB” using a copy of a Dev database I had lying around from a previous demonstration (BlogsDotRedgate):
Then I created my baseline as a migration script against the up-stream copy, BlogsDotRedgate_Integration, because who has access to Prod for this? Am I right? *cough & shifty eyes* not me!
I successfully generated my baseline and a change script (I added a column to a table, nice and simple) and then committed them to my local repo, and pushed! Forget branching, today isn’t about that, we’re just PUSHING TO TRUNK, WOO-HOO!
Setting up the CI/CD Pipeline
Now that we have our project and migrations in GitLab we can build out a pipeline! So first stop I went straight to CI/CD > Pipelines and was presented and I hit “Get Started”:
They immediately throw you into a Quick Start “Help” style guide which is immediately a little un-intuitive but surprisingly helpful if you read the whole thing. Effectively we need a YAML file called .gitlab-ci.yml that will store our pipeline as code telling it how and where to build, and we need a runner to actually fire up and execute these steps.
In my experience with some other CI/CD tools, it’s been advantageous to actually create the Runner / Agent first on the machines you’re going to be using, so as I just have my laptop to do this on, I will set one up on there! I found the full documentation for a Windows Runner here, and followed it just so I would have it available.
The GitLab Runner was up and running in my services but I’ll be darned if I can see them anywhere in GitLab…
Aha! So it turns out after a bit of digging that you need to registerthe runner specifically using the CI/CD section on the project settings, that was probably my bad for not reading the documentation thoroughly but my counter-argument… who actually does? So I issued the register command, applied tags and a description and chose my runner type, I chose shell because I need to be able to run PowerShell on the machine (I’ll need the SQL Change Automation PowerShell components availableonthe machine where the Build and Deployment are happening of course):
So I can build my project I’m going to need to know where the repo is cloned to during the process (i.e. to find the .sqlproj file) so by taking a look I managed to find a list of environment variables that can be used in the YAML file, just to be sure though, I created and committed the most basic YAML file that would just echo back the location of the cloned files:
After this let me know the environment variable worked correctly and the build pipeline was being fired up correctly on my private runner, I tried something a little more ambitious, building the .sqlproj file using the cmdlet reference from the SQL Change Automation documentation for help:
I’m still using the same machine for the release portion too, so naturally I can use the same runner for this, if you have other servers you’re deploying to you will of course need additional runners.
We can very easily extend what we already have in our YAML file by just telling the process to create and export a new build artifact – I’m going to name it the same as everything else, and then append the BuildId to the end of the file so we always get something unique:
You’ll notice how I’m exporting the NuGet package to the project directory and then uploading it, this is so that we’ll have access to it to release but also so that we can use the artifacts argument in our YAML to upload the file and make it a downloadable package through the GitLab interface (if you go to that SPECIFIC job):
Whilst we’re on a roll here (and things haven’t gone wrong for a while) I’m going to add 2 additional stages ALL AT ONCE to both “Create a Database Release Artifact” and “Deploy from a Database Release Artifact” using, once again, the SQL Change Automation PowerShell cmdlets.
Woo-Hoo! I’m invincible!
I broke it.
Can you see what I did wrong? The error is:
New-DatabaseReleaseArtifact : The specified value for the Source parameter is neither a valid
41database connection string nor a path to an existing NuGet package file or scripts folder:
So 2 fun things. 1 – I forgot to highlight there was an environment variable at one point, so it was just looking for the name of the variable in the path and 2) it keeps erroring out saying my NuGet file isn’t a NuGet file, weird right?
On further inspection it is yet another mistake I made. I’m using the job ID to name the NuGet package, which means when it tries to find the file it’s 2 steps ahead because each stage is counted as a different job! Duh!
A few quick changes should hopefully sort this out! I’m going to put the instance of the pipeline ID in ($env:CI_PIPELINE_ID) and see if that makes a difference!
Wait. Did it just say the pipeline ran? SUCESSFULLY? That’s exactly what it said! We can verify that this actually happened as well by checking the DatabaseDeploymentResources folder for the Release Artifact to Integration:
And everything is there! Note you won’t have a changes.html report just yet because this is the first time we’ve successfully deployed to Integration, however if we run 1 more change through (I’ll add a stored procedure):
Now of course we can add additional stages to this, for manual intervention or to promote to other environments, but I’m going to call it a win here and retire (until the next post) gracefully. I’m sure you’re all wondering what my final YAML file looked like too – well (counterintuitively) I’ve popped it all into GitHub for you and pasted it below. Enjoy!
“Quality is never an accident; it is always the result of high intention, sincere effort, intelligent direction and skillful execution; it represents the wise choice of many alternatives.” – William A. Foster
I know what you’re thinking.
“Chris. Your title looks like it was written to be a corporate whitepaper that I see ads for when I’m browsing social media; it should have a CLICK HERE button, a boilerplate photo of a smiling person holding a pen and it should say something like ‘executives hate them, find out their secret here!’“
But something has become immediately obvious to me in the last few months, I still speak to people daily who are forced to:
Work on empty (schema-only) / heavily subset databases
Work on old, out of date and/or irrelevant data
Make decisions without knowing enough about their data or what they hold
When speaking to them though it becomes immediately obvious that the reason there is no dedicated option available for developers is actually not related to the “traditional” problems that one would expect. You would naturally assume that the reason for not refreshing these environments is because of the large amounts of space or time taken to refresh often enough, or even that ‘we simply cannot due to sensitive PII and regulatory concerns‘.
No. In fact it comes down to, as all things do, time andmoney.
In the most recent State of Database DevOps report (2020 that is), a whopping 70% of 2000+respondents replied that they were using a shared development database and this comes with a whole heap of associated problems, like poor code quality, looser controls around sensitive data and defective deployments. Just these figures alone already point to the solution being to spin up copies for developers on demand and it’s not like we can’t do that. There is SO much technology in the world, across almost all database platforms, that will allow us to virtualize, containerize, sanitize… (effectively all of the ‘izes‘) our databases so that we can have full, safe, realistic copies as frequently as we like. So what is stopping us?
From experience, it’s justification*. People going to senior stakeholders and saying “we need this technology” and hearing a cacophony of classic business challenges back: “but is it broken?”, “do we really NEED it?“, “it costs HOW much!?!“, “how much time will it take to implement?” etc… It’s dev and test hygiene, not a sexy major modernization project like using Azure Arc, using Blockchain or creating Artificial Intelligence. Who cares that developers have to share a database? We’ve got bigger Tofish to fry!
*Sometimes, but much less frequently, it’s down to complexity of implementation – but we’ll leave that one aside for now!
As you will know from my last post on why now is the time to adopt better working practices, it’s important for us to highlight the gains that can be made from newer, updated practices, and why now is not the time to be closing our minds off to a better way of life. It’s not going to be easy to sum this up in 5 points, and there are many other benefits to solid database provisioning but these are in my opinion, the ones that will revolutionize the way you develop.
Very importantside notefor this blog post: there are lots of subjective key practices, processes and tools that can form part of the “database provisioning process” specifically and they will vary wildly by experience, opinion and company – so for the purposes of the below I will be describing the benefits of a process that involves 3 primary components / steps, given these are the three I tackle most often:
Data Identification and Classification / Cataloging
Data De-Identification i.e. Data Masking
Data Provisioning i.e. Real Time Database Cloning / Provisioning
1 – Increase developer happiness / contentedness
Developers are employed to do 1 thing: innovate. It’s even in the name! Developers are on the cutting edge and are focused on providing value to end users as quickly and efficiently as possible, with shortened release cycles, incremental stories and optimized workflows they can produce this innovation. But a big part of the story is the setup.
Even if you’re working to a more agile methodology it is hard to deliver and test changes which are, in development environments, fundamentally destructive and experimental if you are sharing a workspace with multiple colleagues. Writing on shared Word documents can be frustrating at the best of times, so how can developers be expected to produce high-quality, rigorously tested, game-changing code when at any minute another developer can take the environment down, cause it to run slowly, or overwrite those changes with their own? When you cannot produce changes in an isolated, sandbox environment where they can be individually assessed, re-worked and improved then you have no guarantee that it should be promoted.
All of these sound like arguments that are focused around the production of code, but in fact these issues can all have a huge impact on something that is widely under regarded and scrutinized: developer happiness.
Developers are the people who make stuff go, and without them feeling content and valued in their roles, we can’t expect our productivity and product quality to reflect that – so when developers witness the poor management of their code, something they have worked so hard on as it goes sliding down the priority list or gets rolled back or overwritten etc. they don’t feel motivated to continue doing the best that they can do.
With dedicated environments for dev and test, for different branches, pull requests etc. developers can finally work on innovative and exciting projects, and optimize the code that goes out the door to end users.
2 – Develop a common language about data & make better decisions
It’s very hard to speak about things when you use different language to describe the same thing. That much is obvious. In the United Kingdom alone we have many different words for bread rolls. So when someone comes into a sandwich shop in London and asks for a “Stotty”, can you guarantee that the person serving will know exactly what they mean, exactly when they say it?
No. There will be a gap where some translation will be required: some “down-time“, if you will. Now imagine taking something as simple as a bread roll and applying it to an enterprise data estate… you’re going to have a very bad time.
As I talked about in my blog posts here (importance of database classification) and here (classifications role in DevOps) before you can really make a fully informed decision about your data, you must know 2 simple things:
What data you hold
Where your data is
I should hurriedly add that I don’t just mean sensitive data now – all data deserves to be classified because whether you’re a full stack developer adding a column to a table you’ve never used before, an auditor trying to carry out a Data Protection Impact Assessment (DPIA) and trying desperately to include the database, or you’re a BI developer setting up some new reports or processes, you’re going to need to know about the data. This is where people have questions, and this is where you shouldn’t have to reply on anecdotal knowledge or being pushed around from one person to another at the company who supposedly “might be able to help“.
Better insights into data leads to better practices, less waiting (waste reduction) and greater insight. When we then act on this insight we move faster and deliver greater value in our pipelines.
Have you picked up on the trend yet? How all of these are going to end? Well don’t spoil the ending for those who haven’t, they’ll have to wait fort he conclusion!
3 – Move faster and better enable the DevOps pipeline
It’s apt that I’m listening to an amazing EDM remix of the Green Hill Zone from Sonic when writing this section, but isn’t this just what we need as a business? We want to be able to move faster, or to put it in more ‘agile’ terms, we need to be able to pivot and adapt with only a moments notice. Until now, the database has been a monolithic and difficult to steer behemoth, and it shows in our processes.
A tangible example of what I mean when I say “move faster”, is branching. It’s fairly commonplace now for a developer to be able to clone a repository and checkout a specific branch, create new branches etc. without fear of switching between those branches and what it might entail. On a dev environment, especially when one is working database-first with your changes (it does make sense to know how the changes will impact the database first – that’s all I’m saying) it is, without a reasonable process in place, exceptionally difficult to easily switch between branches and keep work separate.
This often forces developers to stick to one environment when changes are all made in tandem and can play havoc when it comes to capturing those changes in the right place – a manual state-based comparison of a dev database with multiple branches of work on it to a target upstream could be disastrous.
4 – Minimize space constraints on new copies, on premise or in the cloud
Space is always a big player in these conversations, and for some it’s enough to boil it down to “well just how much space can we save??” and that’s enough to put a dollar value on the ROI, and people storm ahead with a solution (that’s not always right for them).
But space is a very real problem, much as we (as technology professionals) like to believe that in these modern times of cloud-native solutions, easily scaled serverless-compute VMs and Big Data Clusters, we know there are still a LOT of people out there firefighting legacy, necessary technology and wrestling with what they CAN get out of backups or their SAN tech.
Even using cloud providers costs money, data egress and ingress costs $, BLOB storage costs $, additional security measures cost $. So it’s really not ideal when our databases, for historical reasons or by virtue of the sheer AMOUNT of data we hold and process, are 5, 10, 50, 100TB+, because we’re going to be struggling with this Dev/Test issue still for years to come.
As before with point 3, database virtualization has come of age and has now we have a lot of different solutions from containerization through DBaaS that can aid us in minimizing the amount of space that we ACTUALLY require, meaning we have less money that we need to pour into maintaining large, unwieldy Dev/Test environments or paying a large bill for the privilege of doing so in the cloud (and when developers will be using their dev machines anyway it just makes sense to see what we can do to leverage this existing hardware).
Whilst this one doesn’t directly add specific value to the end of the pipeline, or speed up this delivery, it can help reduce overhead costs associated with the infrastructure needed when providing this value.
5 – Work on realistic data without worrying about data breaches
This is probably one of the most obvious reasons given that I tend to blog about data regulations and compliance ALL THE TIME but I feel like I need to keep saying this.
If you remove all of the data from development and test database copies, this will not help with development and developers will have nothing meaningful to go on, nor any testing that isn’t limited to pre-defined values.
If you leave all of the data in development and test database copies, all you’re doing is duplicating your attack surface area and creating a lot of potential risks for that data to be surfaced where it shouldn’t be – on the internet, in screenshots, emails and of course, hacked.
So there needs to be a happy medium where we can have both the useful data that gives us insight and intelligence of a full data set, the business logic, trends, demographics etc. that we need during testing or analytics – but it should also be sanitized so that data subjects contained therein cannot be re-identified. Static masking, applied to lower environments allows us to retain the data with none of the data.
Protective measures can be built into the DevOps process from the very beginning as you’ve already seen right here on my blog; so as long as it is a part of the process, and we have multiple controls (or guard rails) that allow us to operate safely and quickly without fearing that same speed will cause us to release any sensitive information, allowing us to focus on one thing, value.
As you’ve seen above, it all comes down to time and money but there are many ways to save andspeed up within a DevOps process by means of a good, solid database provisioning process. Whilst none of these reasons comes with a fixed ROI (unless you have ALL of your pre-prod database storage costs to hand) they contribute to something far better than that:
The ease of delivering value.
In a world where we can be concerned about everything, and where it’s hard to keep up with emerging technologies – it makes sense to start pruning away blockers to the process, the problems that are stopping us from delivering value faster – THAT is the theme and point of this blog post; our end users. We’re already delivering excellent value to them, we trust our developers and teams, but what’s stopping them from moving faster with database changes? Adopting a good provisioning process will mean you start to notice all of the above become true of your database development lifecycle.
“Thetime is always right to do what is right.” – Martin Luther King Jr.
Over the past few months, we have been on lock-down. The product of a devastating and deadly disease that has well and truly stamped it’s legacy on human history forever. But it is out of these times that we receive a glimpse, a look into what is possible, and what humanity can do. It is out of this fight, out of these ever decreasing odds that we finally see what a combined effort can do, and what a focus on our fellow people can bring about. Don’t know what I mean, take a look at the Good News Network and subscribe (just like me) to see the best of us.
But it only works, we only triumph, when we work together.
Recent Example: Scientists at Oxford University have seen a tremendous breakthrough with their virus efforts and are making unprecedented strides towards a viable vaccine – but it involves an incredibly strong partnership with pharmaceutical companies and governments worldwide for staged testing, large scale results gathering and continuous improvement.
So. Why then in the last few months have I been speaking to people who say things like:
“We have delayed our non-BAU process implementation, because we have seen a spike in usage, and we need all hands on deck.“
“We have been investigating tooling and processes to help our development teams, but this is on hold as we come to terms with this newer way of working.“
“Our teams are keen to adopt more agile ways of working, but they are overburdened at the moment, so we have decided to postpone any research into this for the next few months.“
It is feasible, amid a global panic, that people and companies will do (and definitely have done) what comes naturally to us; that is to “bunker down“. We believe that if we shift all efforts from projects and ongoing testing / new processes, we can have all hands available to deal with anything that comes our way. Processes are established for a reason, right? Legacy methods of dealing with ad-hoc changes and semi-frequent deployments, waterfall-esk development cycles and decade(s) old systems represent the familiar, thesafe… Supposedly.
Now so, more than ever, it is time to actually change direction and to put more effort into some of the key principles and processes that will lead us to DevOps nirvana; it is this trinity of people, processes and tooling that can ultimately be the salvation for many global companies as they try to maintain their agility and competitiveness within an uncertain, shifting post-pandemic international market. There are many reasons why I, and many others, believe this but I have detailed 3 key reasons below:
1 – Delaying DevOps creates waste and costs businesses money
DevOps is a culmination of learning, experience and effort and it cannot be classified as one single thing, however it is possible to define a number of things that DevOps is and what it most certainly isnot. One of the things that DevOps is, is “the constant delivery of value to end users“; the idea that by adopting certain technical measures and working practices we can minimize the time to delivery of new features and functions, which equates to greater value for us, our end users and significantly more agility to shift in different directions as required.
These ideas of flow and value streams are covered quite nicely by Lean IT, which extended from Lean Manufacturing principles and it is nicely explained here (and I would highly recommend you read The Phoenix Project if you haven’t already) – but the purpose of it (in a similar vein to agile principles) in this sense, is that it defines a number of things that don’t add value to the resulting product or service. These ‘things’ are referred to as waste and this is precisely what we should be looking to remove from our existing, legacy processes because why would we work on anything that doesn’t deliver any value?
There are a few different types of waste but I want to highlight three important ones that often exist as a result of legacy processes still being in place:
Defects – Includes lack of testing (poor execution) and hot fixing environments (unauthorized changes)
Waiting – Including everything from waiting for refreshed environments to waiting for feedback/results and even manual processeslike deployment approvals
Motion (excess) – Effectively doing the same thing over and over again, fire-fighting problems that arise on a near daily basis, engaging and monopolizing resources constantly who could otherwise be working on other, more important or value-add tickets (for those of you who HAVE read the Phoenix Project, see Brent as an example!)
The product of this waste is very simple and it fits into 3 main buckets: poor customer experience, increased costs and lost productivity. All of these things boil down to one fundamental truth – bad practices cost us moneyand reputation, transformation now could help us prevent this, and people will remember us for stepping up when we needed to.
2 – The workforce is increasingly distributed/remote and needs to collaborate better
At the beginning of 2020, one of the biggest questions faced by companies all around the globe was “how do I find and retain talent?”; this is not a new question and had already been around for years. Companies restricted to their offices (base OR satellite) realized they were increasingly fighting for candidates in one of the most competitive markets served by an ever dwindling local pool of options.
This was a situation which necessitated companies to stretch beyond their existing capabilities to enable a better quality experience for remote and distance workers and/or teams – a situation which would later be exacerbated 100x-fold by the global pandemic crisis. Not only has this crisis confirmed that most companies, certainly those feeding software markets, can work remotely, it has also posed the question of if we should work remotely more often, and has fundamentally changed the way we as a workforce will continue to work in the future.
One thing is for sure, whilst we will try to “return to normal” as much as possible, normal has been forever changed and remote working and collaboration is here to stay. That’s the important word in play here, collaboration.
The spirit of DevOps as I’ve mentioned before is good quality communication, collaboration and accountability. But at the heart of those three ideas is visibility. In an office we can over-hear, we can drop-in or bring things up “over the water cooler”/”at the coffee machine”. In a remote working context, that isn’t possible. So we have to adhere to 2 of the most important practices in modern day software development: transfer knowledge and record decisions.
When we work and communicate better in a remote/distributed workforce, and use tools and processes available to us, people don’t make unauthorized changes, or make decisions that affect you that you weren’t aware of until 3 weeks after they were made. It becomes easier to make decisions and generate better work faster, rather than being paralyzed by indecision and uncertainty as to whether you hold the latest version of the truth or if it is outdated. When we adopt the right processes and tools into our DevOps methodology, we knowfor certain what we should be doing, and why we should be doing it.
Process-wise this can easily take the form of common functions, many already at the disposal of teams when remote; stand-ups, retrospectives, mob- and pair- programming, OKR and sprint planning, there are lots of different ways for us to work well and know what we’re supposed to be working on at all times (and why). Tooling-wise we can then match these how we will be doing something with the respective record of what is being done, what decisions have been made and crucially, why. Using work management software like Trello, bug/feature tracking software like JIRA or Azure DevOps work items, source controlling everything (even having a strong branching and merging strategy to control workflow) with rigorous testing routines, policies and pull requests and automation all lead to better informed, happier*, well-performing developers with a crucial sense of purpose.
*Important side note: It is also crucial that we don’t simply lose ourselves in the business benefits completely – developers, testers etc. are all human and we all crave job satisfaction and happiness in our roles. Yes you might be able to increase your deployment frequency, minimize costs or complaints, but nothing compares to a satisfied, motivated team, which DevOps can help breed and inspire.
3 – DevOps breeds innovation and improves company performance, with a tangible return on investment (and not just financial!)
Automation is one of the single greatest ways we can modernize our processes, and is often the first principle we think of when adopting DevOps practices; taking something that is manual or held together by legacy scripts that forever fail and cause outages, and instead continuously integrating, continuously testing and continuously improving using the latest processes and tooling available to us. Automation allows us to create high cycle rates, enhance and multiply the feedback options we have within our pipeline(s) and allows us to reduce manual concerns and issues, to dedicate teams to the very thing they were employed to do: innovate. We only need watch how Netflix does DevOps to realize what we can unlock.
On a weekly basis I discuss existing processes with developers all over the world and one key trend always emerges that we need to focus on: they have a process that is currently manual, and it needs improving. I have lost track of the number of times I’ve been told about a process where developers generate scripts, manually test themselves and then “do x” with it, whether that be just deploying to Production themselves (without review), or putting it in an ever mounting pile of scripts in a folder on a file share for someone to sift through periodically.
Across every single one of the conversations mentioned above that I have, there is not a single discussion that doesn’t include some kind of quantifiable cost to the business, whether that be downtime, customer refunds, regulatory penalties and even in extreme cases, high developer churn. All can still be expressed in terms of 2 things: Time and Money.
If we remove those roadblocks for our developers, if we give them tools to enable them to more easily do their jobs, we put in place processes that allow them to more easily deliver that innovation, and tight automated controls to remove error-prone, manual jobs – we end up with something more akin to harmony. By this of course I mean the “constant delivery of value to end users“. This creates a more positive user experience, allows us to respond more quickly in an uncertain market and make decisions on what we should or, just as importantly, shouldn’t do, faster.
The world at the moment is a very uncertain place and has destroyed jobs, companies and whole industries. We should expect that consumer confidence is at an all time low and as we all come out of lock-down across the globe, we should be prepared to metaphorically “put our best foot forward” to help our developers believe in what we’re doing, stimulate faith in our industries and ultimately deliver more value bidirectionally.
But this idea only works if we do this now.
DevOps is more than just “picking up some new tools” or “rolling out agile” to development teams. It is fundamentally a mindset change that can drastically and fundamentally alter the underlying motivations and thinking within an organisation, allowing you to focus on the most important thing – delivering value, faster.
There are always times where delaying large scale roll-outs is a pertinent decision to make, and a hard one at that! But DevOps seeks to unify every part of of the development cycle; giving you greater visibility, communication, accountability and control, with maximum flexibility to test, validate and even pivot where needed.
But the time to do DevOps is not “once everything improves“, “once we get back to normal” or “when we have more time“… it is now. Now is the opportunity we all have to capitalize on the wave of change we have been consistently waiting to implement, to strengthen our position and future growth in our markets. DevOps is how we can come out on the other side of this disaster ready to embrace new technologies and ways of thinking, to respond to our customers needs, and deliver value and speed at scale.
“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” – Sir Arthur Conan Doyle, Sherlock Holmes
EDIT: Modified 23/12/2020 to include updates to Flyway in v7, method 4 below
Can you tell I’m loving Flyway at the moment? Well I am. It’s JUST SO GOOD! Honestly there are so many things you can do with it! Don’t know what I’m talking about? Check out my posts on xRDBMS DevOps with Flyway and tSQLt unit tests with Flyway and you’ll see what I mean!
As a result of the above posts though I was asked a question that I had to think about for a little bit before having the best possible answer, how can we seed some testing data INTO the build database so that we can run some meaningful tests against it?
This makes perfect sense to me, but there’s also a few different ways to do this – so let’s go fly(way)!
1 – Test Data Migration Scripts
In my previous posts on Flyway (above) I talked about having an entirely separate build folder present within the repository, and a folder of test migrations alongside our schema migrations – I called these the Build_Config folder, (containing the build configuration file) and the Test_Migrations folder (unsurprisingly containing testing migrations) in the _Migrations location:
I was using the same build config for 2 purposes; 1) to build the schema migrations from the base version, by passing it the Schema_Migrations location dynamically and 2) then building the tSQLt framework and testing objects by passing it the Test_Migrations location dynamically.
This actually worked surprisingly well, but even beyond this – the same method can be repurposed, or added to, by augmenting your testing scripts and adding a data insertion task (as an additional script or group of scripts). In my folder, I can simply add a migration like this:
Because of course I like dogs.
and once pushed to the repository and the build has run we should be able to verify our testing data is present:
A bonus win for this step of course, is that where Devs have their own Flyway config files locally for their development databases they could also overwrite this behavior and point the testing and/or data scripts at their own database so they have some seed data to work with too!
2 – Add a data generation step to the pipeline
There are SO MANY technologies out on the inter-webs for generating data. SO MANY. Many of them also have a command line or PowerShell module that we can use to easily invoke them against a target, especially if that target is going to be persistent like my Flyway Azure SQL Build DBs!
There are numerous ways to invoke tools and applications and fortunately good CI/CD tools like Azure DevOps offer multiple ways to, for instance, run PowerShell or CLI steps from within the pipeline – so we could easily invoke SQL Data Generator on a VM or physical machine we have an Azure DevOps agent on – but this thinking also opens up the possibility of using something like Chocolatey to dynamically install the software on the Azure DevOps hosted pool VM during build (for the Redgate tools at the moment I suppose you’d need a Windows VM).
I will be writing a future blog post about this step because it sounds _very_ interesting, but I’m not sure yet what can be done specifically using Chocolatey or if I’ll have to look elsewhere, although I have read this post in the past (thanks Paul!) detailing limitations and a great workaround using Azure DevOps, so it’s likely that’ll be my first port of call!
Just to give you an idea of end result with SQL Data Generator specifically though:
3 – Use existing data, don’t generate
Ok this one is going to be controversial already, I can tell! Let’s all stay calm!
The best data to be tested is our data. What we have in Production is what will have these changes deployed to it… eventually! So shouldn’t we just test against that? Well. Maybe, maybe not depending on what is in there.
There’s a few methods to achieve this – my personal favorite would be to use a SQL Clone, spin that up on a build VM rather than using an Azure SQL DB, and we can have all the data in an instant. Of course if we hold any sensitive PII/PHI then we should ensure that is protected first!
Of course there are lots of other options, like restoring a backup or spinning up a container etc. and these can all just be a stage in the YAML file before invoking Flyway but the point is, if we use an existing copy of our Prod database from some source or another, it will have 2 things we really care about:
Data. Ready to go, ready to test, ready to give us the best possible insight into our changes.
The flyway_schema_history table. Instead of running EVERY migration we’ve ever written, which could take a while for a large team, we run only the latest migrations to check that they would deploy happily to the Production target.
To get this stage to work though, you would need to do a couple of things differently:
The build DB would have to be created from the clone/backup/other every time instead of simply cleaning the schema down.
You would need to remove the Flyway Clean step from the pipeline in my previous post, because it would otherwise drop all the tables (and then we wouldn’t have any data!)
By extension, this also makes the callback to remove the tSQLt objects void, so you can remove that too.
4 (Bonus Method) – Script Migrations
In Flyway v7 the team added the ability to also run script Migrations and Callbacks which mean it is possible to invoke .ps1, .bat, .cmd, .sh, .bash, and .py files as part of the version control > build and migration process.
This means that you can include a script to invoke any loading or processing of data you may prefer – you could invoke a data generation utility, data masking and of course anything else that can be invoked with these file formats. A good example of this might be calling Data Generator as above, or you could use DBATools, DTM Data Generator or even a more platform agnostic approach by using a Hazy generator to produce and then load an incredibly realistic data set.
There are a lot of different ways to generate data, you can generate completely synthetic data, you can mask data or use Prod data, it’s up to you! Ultimately it will just for another part of your pipeline – just be careful of ordering! You don’t want to try generating data into a table that hasn’t been built yet.
Respect your YAML file and you’ll get schema, data and unit tests and this will lead to one thing. Greater insight, earlier.
“If you truly have faith in your convictions, then your convictions should be able to stand criticism and testing.” – DaShanne Stokes
Welcome fellow Test–Driven-Development enthusiasts… is what I would say if i actually ever did TDD and didn’t just, you know… write regular unit tests after the fact instead.
I’m going to be honest, I love the idea of TDD but have I ever actually been able to do it? No. Have competent developers been able to do it successfully? Yes, of course. Don’t know anything about TDD? You’re in luck! Click here for an introduction (don’t worry though, THIS post is not going to be about TDD anyway, so you can also keep reading).
But one thing we can all agree on is that testing is pretty important. Testing has evolved over the years though and there are a million-and-one ways to test your code, but one of the most difficult and frustrating things to test, from experience, is database code.
Some people argue that the days of testing, indeed, the days of stored procedures themselves are gone and that everything we do in databases should be tested using a combination of different logic and scripting languages like Python or PowerShell… but we’re not quite there yet, are we?
Fortunately though we’re not alone in this endeavor, we have access to one of the best ways to test T-SQL code: tsqlt. You can read more about tsql at the site here but in short – we have WAYS to test your SQL Server* code. The only problem is, when you’re using a migrations approach… how?
*There are also many ways to unit test code from other RDBMS’ of course, like utPLSQL for Oracle Database or pgTAP for PostgreSQL – would this method work for those? Maybe! Try adapting the method below and let me know how you get on!
I’ve already talked about how implementing tests is easier for state based database source control in a previous post because we can easily filter tests out when deploying to later stage environments, however with migrations this can be a real pain because you have to effectively work on tests like you would any normal database changes, and maybe even check them in at the same time – so ultimately, they should be managed in the same way as database schema migrations… but we can’t filter them out of migrations or easily pick and choose what migrations get run against test and Prod, without a whole lot of manual intervention.
Basically. It’s a mess.
But during my last post about Flyway I was inspired. This simple and easy to use technology just seems to make things really easy and seemingly has an option for EVERYTHING, so the question I started asking myself was: “How hard would it be to adapt this pipeline to add unit tests?” and actually although there were complications, it was still easier than I thought it would be! Here’s how you can get up and running with the tSQLt framework and Flyway migrations.
1 – Download the scripts to create the tSQLt framework and tests from the site
Ok this was the easiest step of them all, largely because in the zip file you download from the tsqlt website all you have is a set of scripts, first needed to enable CLR and the second to install the tsqlt framework:
As part of my previous pipeline I’m actually using Azure SQL Database as my development environment, where RECONFIGURE is not a supported keyword and where we don’t need to run the CLR script anyway, so all I needed was the tSQLt.class.sql file.
The good thing about this is that we can copy it across into a migration and have this as our base test class migration, and then any tests we write on top of it will just extend it – so as long as we remember to update it _fairly_ frequently with any new tsqlt update, we should be fine! (Flyway won’t throw an error because these are non persistent build objects, so no awkward checksum violations to worry about!)
2 – Adapt the folder structure in the repository for tests
I added 2 new folders to my _Migrations top level folder, a Schema_Migrations folder and a Test_Migrations folder. When you pass Flyway a location for migrations, it will recursively scan folders in that location looking for migrations to run in order. I copied the migrations I had previously into the Schema Migrations folder and then my new tSQLt creating migration into the Test Migrations folder. This allows them to be easily coupled by developers, whether you’re writing unit tests or practicing TDD:
You’ll have noticed I called my base testing migration V900__ – this is because I do still want complete separation and if we have a V5 migration in schema migrations and a V5 testing migration, we’re going to have some problems.
3 – Add a callback to handle removal of the objects
As I was putting this together, I noticed that I could use flyway migrate to run the tSQLt framework against my Dev database, but every time I tried to then flyway clean that database I got a very nasty error stating that the tSQLt assembly could not be removed because of dependent objects.
Flyway does not handle complex dependencies very well unfortunately, that’s where you’d use an industry leading comparison tool like SQL Compare so, with some advise from teh wonderful Flyway team, I set to work on a callback. A callback is how you can hook into Flyway’s own processes, telling it to do something before, during or after certain commands. In my case we were going to remove all of the tSQLt objects prior to running Flyway clean to remove the rest of the schema. To make it future proof (in case objects are added or removed from the tSQLt framework), I wrote a couple of cursors to go through the different objects that were dependent on the assembly and remove them, rather than generating a script I know to have all of the tSQLt objects in right now. You can find the code for the callback in my GitHub here, you are welcome to it!
All you have to do is name it beforeClean.sql and ensure it is in the directory with your other sql migrations so that it will pick this up and run it – I put it in my Test_Migrations folder, because I only want it to run this callback when cleaning the build DB, as this is the only place we’re utilizing automated unit tests… for now!
4 – Update the Azure DevOps pipeline
I’ve got my callback, I’ve got my tSQLt migration and the folder structure is all correct and is pushed to Azure DevOps but naturally it is breaking the build *sad* but fortunately all we now have to do is update the YAML pipeline file:
You will notice a couple of important things that I have highlighted above:
I’m cleaning the build schema using the Test_Migrations repository – this is because that is where my callback is and I need that to run before the clean otherwise it will fail due to the tSQLt assembly issue (line 17)
I am running the migrate for the tests and the schema separately in the file, instead of just calling flyway to recursively run everything in the _Migrations folder. This is because I want them to be 2 separate steps, in case I need to modify or remove either one of them, or insert other steps in between and so that I can see the testing output in a separate stage of the CI pipeline (lines 23 and 29).
Caveat: As a result of (Option 2) running the 2 processes separately, it means running Flyway twice but specifying the Schema_Build and Test_Build folders in the YAML as being mapped to Flyway’s sql directory (lines 16 and 22 in the file above) but the problem this causes is that the second time Flyway runs, when it recursively scans the Test_Migrations folder it will not find the migrations that are present in the Flyway_Schema_History table, resulting in an error as Flyway is unable to find and resolve the migrations locally.
The way to fix this though is pretty simple – you find the line in the Flyway Config file that says “IgnoreMissingMigrations” which will allow it to easily continue. We wouldn’t have to worry about this setting though, if we were just recursively looking to migrate the Schema and Test migrations in the same step (but I’m a control freak tee-hee).
Now, once committed this all runs really successfully. Velvety smooth one might even say… but we’re not actually testing anything yet.
5 – Add some tests!
I’ve added a single tSQLt test to my repository (also available at the same GitHub link), it was originally created by George Mastros and is part of the SQLCop analysis tests – checking if I have any user procedures named “SP_”, as we know that is bad practice – and I have wrapped it up in a new tSQLt test class ready to run.
You’ll notice I also have a V999.9__ migration in the folder too, the purpose of this was to ‘top and tail’ the migrations; first have a script to set up tSQLt that could be easily maintained in isolation and then end with a script that lets me do just 1 thing: execute all of the tests. You can do this by simply executing:
and we should be able to capture this output in the relevant stage of the pipeline.
Some of you may be asking why I chose to have the run unit tests as part of the setting up of the testing objects – this was because I had 2 options:
I’m already executing scripts against the DB with Flyway, I may as well just carry on!
The only other way I could think to do it was via a PowerShell script or run SQL job in Azure DevOps but the 2 plugins I tried fell over because I was using a Ubuntu machine for the build.
So naturally being the simple person I am, I opted for 1! But you could easily go for the second if you prefer!
6 – Test, Test, Test
Once you’ve handled the setup, got the callback in place (and also followed the steps from the last blog post to get this set up in the first place!) you should be able to commit it all these changes and have a build that runs, installs tSQLt and then runs your tests:
I realize there are a lot of “Warnings” in there, but that is just Azure DevOps capturing the output, the real part of this we’re interested in is lines 31-40 and if we clean up the warnings a little you’ll get:
|Test Execution Summary|
|No|Test Case Name|Dur(ms)|Result |
Test Case Summary:
1 test case(s) executed, 1 succeeded, 0 failed, 0 errored.
But if I introduce a migration to Flyway with a new Repeatable Migration that creates a stored procedure named SP_SomeNewProc…
|Test Execution Summary|
|No|Test Case Name|Dur(ms)|Result |
Test Case Summary:
1 test case(s) executed, 0 succeeded, 1 failed, 0 errored.
It even tells us the name of the offending sproc:
All I have to do now is make the corresponding change to remove SP_ in dev against a bug fix branch, push it, create a PR, approve and merge it in and then boom, the build is right as rain again:
Thus bringing us back into line with standard acceptable practice, preventing us from delivering poor coding standards later in the pipeline and ensuring that we test our code before deploying.
Just because you adopt a more agile, migrations based method of database development and deployment, doesn’t mean that you have to give up on automated testing during Continuous Integration, and you can easily apply these same principles to any pipeline. With just a couple of tweaks you can easily have a fully automated Flyway pipeline (even xRDBMS) and incorporate Unit Tests too!
“Until one has loved an animal, a part of one’s soul remains unawakened.” – Anatole France
TL;DR: If you love to walk, jog or run – sign up to a fantastic cause here to support a shelter in need!
The Full Story: This is going to be a reasonably short post, because I’m hoping if you’ve made it this far, you’ll read to the end.
This August, I will be “running for the Ridge“. This is an event taking place (virtually of course) where one can run, jog or walk any distance you like, as long as it is above 5km (3.11 miles) for Adults or 2km (1.25 miles) for ages 12 and below, wherever you like, anytime throughout the months of August and September.
For the last 2 months, I have aspired to be a healthier version of myself – I have managed to adopt healthier eating habits; removing a lot of the processed sugars and oils I was eating, not drinking alcohol during the week, always exercising at least once a day etc. and as a result I have managed to lose just over 12kg (26.5 lbs) and I feel so much better for it.
Running has been an instrumental part of the improvement in my quality of life and now I’m happy to be supporting something close to my heart with it.
So why this? What makes “running for the Ridge” special?
Well, because it’s in aid of Jacob’s Ridge – an animal shelter in Spain that, thanks to the impact of one nasty global pandemic, is sorely in need of donations in order to stay open and safeguard the well-being of the animals that it looks after, a cause deserving of support all in itself.
So that’s why this is special – because you can directly do something to:
Positively impact the lives of rescued animals
Positively impact your own health and mental well-being (in the form of exercise)
Do something nice with friends or family (at a responsible social distance where necessary)
To me, this just sounds like the easiest decision ever; run/jog/walk a distance of my own choosing, on my own comfortable running route and animals benefit from that?Amazing.
You will even get an awesome sustainable wooden medal for participating! And the cost? 12 GBP per person. That’s it. You can even get sponsored by friends, family and/or colleagues too if you want to contribute more!
So that’s it. That’s all I wanted to say. I’ll be running for the Ridge this August and you know – if you have a spare couple hours or you enjoy walking/jogging/running alone or in a group, think about putting it to good use and let’s get out, have some fun and do some good!
“Doing the best at this moment puts you in the best place for the next moment.” – Oprah Winfrey
One of the things that comes up the most when I’m talking to people about database classification is multi language support. Often people are resigned to:
Waiting on additional language support from their vendor of choice (defer)
Writing their own software to do it in their native tongue (shift effort)
Going the long way around i.e. manual classification (brute force)
Do nothing (abandon)
There are currently over 7000 languages in the world and when it comes down to classification, you can’t expect that whatever your choice of technology for this classification workflow will have packs or native support for your languages and/or the languages your databases have been configured in.
Now, deferring and abandoning as above really aren’t options any more and if you’ve read my post here then you’ll know that database classification is a key concern to address before trying to base any business decisions on data. But how then do we tackle t he above issues of shifting effort and brute force?
Sometimes it’s about finding ways to still be effective and still take advantage of the benefits such a classification technology affords you.
With this in mind – I wanted to share with you 3 tips and tricks in SQL Data Catalog to make it work better for your language(s):
1 – Customize your taxonomy to be relevant / Passen Sie Ihre Taxonomie an, um relevant zu sein / Ajustați-vă taxonomia pentru a fi relevante
The great thing about SQL Data Catalog is that it is incredibly easy to alter the taxonomy to best reflect your view of your data. Not all standard fields will be relevant to you and some fields that are ABSOLUTELY necessary will be missing.
So add them in!
Let’s say, in my country of Christopia, it is mandatory that people identify a certain government specified “Classification Category” from 1-5, where 1 is the most sensitive and 5 is basically public knowledge. I can add that as a category and tag:
That goes equally for fields that we will not be making use of. If it is irrelevant then remove it. Simple.
I will say this for the above step though – don’t think of this as being the easiest and most obvious part, in many ways this is actually the hardest step because you will need buy in. Guess who also cares how you describe data stored in databases? Data Governance, Legal, InfoSec, Database Admins… the list goes on. Just make sure you do the 2 most fundamental parts that are required for success here – communicate & collaborate. If you have buy in from other concerned parties, you will establish a ‘common language’ with which data can be described and consumed, and based on which complex data-driven business decisions can be made.
2 – Customize your search rules / Passen Sie Ihre Suchregeln an / Personalizați-vă regulile de căutare
Going through every single column is a colossal waste of time, especially when we have more than about 6 databases. I’ve spoken with people who have spent day, weeks and even months working on classifications in SSMS or Excel, only to find out that 90% of their columns matched common regular expressions, and moreover that 50-60% of their columns were out of scope anyway!
Fortunately within SQL Data Catalog, you can configure the rules that the tool runs against columns and tables to identify any clearly sensitive columns which it will then suggest to you each time it finds something that matches.
In my experience, between the analytical filters and the suggestions, you can make excellent progress very very quickly, and do you know the best part?
Data Catalog is evergreen.
It constantly keeps track of your ever evolving schema and checks the suggestions against them and that means if somebody makes a change to the schema (like in my post from last week) you’ll get the suggestions coming through and you can easily catch anything that might put your sensitive data in harms way.
3 – Use PowerShell / Verwenden Sie PowerShell / Utilizați PowerShell
Ok I get it – this is my catch all for everything whenever I talk about using Data Catalog – but did you know, using PowerShell with Data Catalog is really easy.
Like even I can do it easy.
But the good thing is, when you use PowerShell you’re not constrained by particular functions or waiting for things to happen, rather you can take control of the process and fill it with your own logic! There’s even a full on guide that shows you:
How to generate an authorization token
A full cmdlet reference
Complete worked examples
and the best part is it’s just part of the documentation! Don’t you love it when people keep documentation up to date and useful? Yeah me too!
Bonus Tip / Bonus-Tipp / Sfaturi Bonus
If you work in an environment that is multi-lingual and has a mix of languages associated with different databases which haven’t been standardized, that should just highlight further that this is a team game.
Pairing and mobbing, just like in coding, can be vital and powerful tools at your disposal – so it’s important to have the right people in place to handle this.
Classification is not infallible, much as you may wish to think it is. Data is data, and if there’s one thing I have learned about data it’s that it always manages to surprise you, so a second pair of eyes (or more) can be invaluable in helping you weed out some of the problems you may be facing.
“Some people try to make everything complicated, be the person who tries to make everything simple.” – Dave Waters
Simplicity is in my blood. That’s not to say I am ‘simple’ in the sense I cannot grasp more than the most basic concepts, but more that I am likely to grasp more complex problems and solutions when they are phrased in simple ways.
This stems from my love of teaching others (on the rare occasion it falls to me to do so), where I find the moment that everything just ‘clicks’ and the realization comes over them to be possibly one of the most satisfying moments one can enjoy in life.
Now recently I’ve been enjoying getting my head around Flyway – an open source JDBC based migrations tool that brings the power of schema versioning and deployments together with the agility that developers need to focus on innovation in Development. There’s something about Flyway that just… ‘clicks’.
It doesn’t really matter what relational database you’re using; MySQL, IBM DB2, even SAP HANA! You can achieve at least the core tenants of database DevOps with this neat and simple little command line tool – there’s not even an installer, you just have to unzip!
Now I’ve had a lot of fun working with Flyway so far and, thanks to a few people (Kendra, Julia– i’m looking at you both!) I have been able to wrap my head around it to, I would say, a fair standard. Caveat on that – being a pure SQL person please don’t ask me about Java based migrations, I’m not quite there yet!! But there is one thing that I kept asking myself:
“When I’m talking to colleagues and customers about Database DevOps, I’m always talking about the benefits of continuous integration; building the database from scratch to ensure that everything builds and validates…” etc. etc. so why haven’t I really come across this with Flyway yet?
Probably for a few reasons. You can include Flyway as a plugin in your Maven and Gradle configurations, so people writing java projects already get that benefit. It can easily form part Flyway itself by virtue is simply small incremental scripts and developers can go backwards and forwards however and as many times as they like with the Flyway Migrate, Undo and Clean commands, so is there really a need for a build? And most importantly, Flyway’s API just allows you to build it in. So naturally you’re building WITH the application.
But naturally when you’re putting your code with other people’s code, things have to be tested and verified, and I like to do this in isolation too – especially for databases that are decoupled from the application, or if you have a number of micro-service style databases you’d want to test all in parallel etc. it’s a great way to shift left. So I started asking myself if there was some way I could implement a CI build using Flyway in Azure DevOps, like I would any of the other database tooling I use on a regular basis? Below you’ll find the product of my tinkering, and a whole heap of help from Julia and Kendra, without whom I would still be figuring out what Baseline does!
Option 1) The simplest option – cmdline
Flyway can be called via the command line and it doesn’t get more simple than that.
You can pass any number of arguments and switches to Flyways command line, including specifying what config files it’s going to be using – which means that all you have to do, is unzip the Flyway components on a dedicated build server (VM or on-prem) and then, after refreshing the migrations available, invoke the command line using Azure DevOps pipelines (or another CI tool) to run Flyway with the commands against a database on the build server (or somewhere accessible to the build server) and Bingo!
And that’s all there is to it! You get to verify that all of the migrations up to the very latest in your VCS will run, and even if you don’t have the VERY base version as a baseline migration, you can still start with a copy of the database – you could even use a Clone for that!
But yes, this does require somewhere for Flyway to exist prior to us running with our migrations… wouldn’t it be even easier if we could do it without even having to unzip Flyway first?
Option 2) Also simple, but very cool! Flyway with Docker
Did you know that Flyway has it’s own docker image? No? Well it does!* Not only that but we can map our own version controlled Migration scripts and Config files to the container so that, if it can point at a database, you sure as heck know it’s going to migrate to it!
This was the method I tried, and it all started with putting a migration into Version Control. Much like I did for my post on using SQL Change Automation with Azure SQL DB – I set up a repo in Azure DevOps, cloned it down to my local machine and I added a folder for the migrations:
Into this I proceeded to add my base script for creating the DMDatabase (the database I use for EVERYTHING, for which you can find the scripts here):
Once I had included my migration I did the standard
Git add .
Git commit -m "Here is some code"
and I had a basis from which to work.
Next step then was making sure I had a database to work with. Now the beauty of Flyway means that it can easily support 20+ RDBMS’ so I was like a child at a candy store! I didn’t know what to pick!
For pure ease and again, simplicity, I went for good ol’ SQL Server – or to be precise, I created an Azure SQL Database (at the basic tier too so it’s only costing £3 per month!):
Now here’s where it gets customizable. You don’t NEED to actually even pass in a whole config file to this process. Because the Flyway container is going to spin up everything that would come with an install of Flyway, you can pass it switches to override the default behavior specified in the config file. You can adapt this either by hard-coding strings or by using Environment Variables alongside the native switches – this means you could pass in everything you might need securely through Azure Pipeline’s own methods.
I, on the other hand, was incredibly lazy and decided to use the same config file I use for my Dev environment, but I swapped out the JDBC connection to instead be my Build database:
I think saved this new conf file in my local repo under a folder named Build Configuration – in case I want to add any logic later on to include in the build (like the tSQLt framework and tests! Hint Hint!)
This means that I would only need to specify 2 things as variables, the location of my SQL migrations, and the config file. So the next challenge was getting the docker container up and running, which fortunately it’s very easy to do in Azure Pipelines, here was the entirety of the YAML to run Flyway in a container (and do nothing with it yet):
So, on any changes to the main branch we’ll be spinning up a Linux VM, grabbing Docker and firing up the Flyway container. That’s it. Simple.
So now I just have to pass in my config file, which is already in my ‘build config’ folder, and my migrations which are in my VCS root. To do this it was a case of mapping where Azure DevOps stores the files from Git during the build to the containers own mount location in which it expects to find the relevant conf and sql files. Fortunately Flyway and Docker have some pretty snazzy and super clear documentation on this – so it was a case of using:
-v [my sql files in vcs]:/flyway/sql
as part of the run – though I had to ensure I also cleaned the build environment first, otherwise it would just be like deploying to a regular database, and we want to make sure we can build from the ground up every single time! This lead to me having the following environment variables:
As, rather helpfully, all of our files from Git are copied to the working directory during the build and we can use the environment variable $(Build.Repository.LocalPath) to grab them! This lead to me updating my YAML to actually do some Flyway running when we spin up the container!
Effectively, this will spin up the VM in ADO, download and install Docker, fire up the Flyway container and then 1) clean the target schema (my Azure SQL DB in this case) and 2) then migrate all of the migrations scripts in the repo up to the latest version – and this all seemed to work great!*
*Note: I have an enterprise Flyway licenses which enables loads of great features and support, different version comparisons can be found described here.
So now, whenever I add Flyway SQL migrations to my repo as part of a branch, I can create a PR, merge them back into Trunk and trigger an automatic build against my Flyway build DB in Azure SQL:
Getting up and running with Flyway is so very very easy, anyone can do it – it’s part of the beauty of the technology, but it turns out getting the build up and running too, when you’re not just embedding it directly within your application, is just as straightforward and it was a great learning curve for me!
The best part about this though – is that everything above can be achieved using pretty much any relational database management system you would like, either via the command line and a dedicated build server, or via the Docker container at build time. So get building!