“Keep it simple, stupid!” – My year 12 & 13 English Lit. Teacher
Recently I’ve been fascinated with something really cool. A couple of my colleagues at Redgate wrote a GitHook which allows you to easily switch branches using Redgate SQL Clone. You can see the hook here with full instructions – and I thought it was pretty neat.
But it got me thinking – I’ve posted a lot about when people are using just Azure SQL DBs (PaaS), about Masking and DB Change Automation, but when you’re using AzureSQL for Dev and Test DBs as well as Prod, you still don’t have the same agility one would expect from a local copy, like a clone.
But, the above GitHook leverages PowerShell (among some other fancy wizardry) so, what if we could do this exact same thing, using the PowerShell Az module to dynamically create and switch Azure SQL DBs in our own private resource groups every time we checkout a branch?
I don’t imagine it would be fast because I’m restricted to using very low tier SQL DBs by my tiny allowance of (as Kendra Little calls them) “Azure Bucks”, but it should absolutely be possible!
So I decided to write a PowerShell script to do just that.
The first question I had to come up with an answer to was, how do I replace the Clone “Image” in this process, because I need something that is effectively a copy of our Production (or as near as possible) environment so we have something to base EVERY copy from – so I created the idea of a Golden Copy within the script; effectively this golden copy could be created by copying masking and copying back down from Production using something like Redgate Data Masker and my scripts here in GitHub but as a stop-gap, if it doesn’t find one in the Resource Group and Server you select, it will simply create one from your Dev DB. Best answer I could think of, you’re welcome to improve it!
All that remained was effectively to go through and just mimic the functionality of the Clone script but using Az: so if you are switching to a new branch where you don’t already have an existing Dev DB, then you get a new copy of Golden. If you’re switching to a branch you’ve checked out before, it renames the DBs to swap you back to the correct branch.
Here is an empty AzureSQL DB called DMDatabase_Dev:
When I now issue the git checkout “feature/newfeature” command it asks me to sign in to my Azure account:
and then gets to work:
And… that was it really.
I now have an Azure SQL DB called DMDatabase_Dev_master as I switched from the master branch, and I have a branch new DMDatabase_Dev DB that I can use for my featurebranch. You’ll notice I didn’t include -b in my git command, let’s assume a colleague is already working on this branch. I can now just update my copy (of my golden copy) with their work:
And we’re good to go!
But now if I switch back to my main branch, the object is gone and I can carry on with work on this branch:
It was really straightforward I can’t believe I haven’t seen this in use in more places, but hey guess what? The PowerShell is yours right here if you want it:
You should have a Dev DB and you should update the values at the top of the script with the Dev DB name, server and resource group it is in
The script make reference to and creates a Golden copy DB so that you have something you should always be creating from, for consistency – so when you get started, create your own “Golden copy” back from Test/UAT or something if you can – maybe using the script mentioned above – it should be the name of your Dev DB appended with “_Golden”
If you want to change how it is authenticated so you don’t have to enter your credentials each time, then go for it – this was just the simplest method for me (and it’s currently 11:05pm so I’m going to bed!)
Feel free to improve it, I’m sure there are plenty improvements that can be made, but it’s a starter for 10 for anyone out there just getting started with development in Azure SQL. Plus it’s kinda neat!
“Quality is never an accident; it is always the result of high intention, sincere effort, intelligent direction and skillful execution; it represents the wise choice of many alternatives.” – William A. Foster
I know what you’re thinking.
“Chris. Your title looks like it was written to be a corporate whitepaper that I see ads for when I’m browsing social media; it should have a CLICK HERE button, a boilerplate photo of a smiling person holding a pen and it should say something like ‘executives hate them, find out their secret here!’“
But something has become immediately obvious to me in the last few months, I still speak to people daily who are forced to:
Work on empty (schema-only) / heavily subset databases
Work on old, out of date and/or irrelevant data
Make decisions without knowing enough about their data or what they hold
When speaking to them though it becomes immediately obvious that the reason there is no dedicated option available for developers is actually not related to the “traditional” problems that one would expect. You would naturally assume that the reason for not refreshing these environments is because of the large amounts of space or time taken to refresh often enough, or even that ‘we simply cannot due to sensitive PII and regulatory concerns‘.
No. In fact it comes down to, as all things do, time andmoney.
In the most recent State of Database DevOps report (2020 that is), a whopping 70% of 2000+respondents replied that they were using a shared development database and this comes with a whole heap of associated problems, like poor code quality, looser controls around sensitive data and defective deployments. Just these figures alone already point to the solution being to spin up copies for developers on demand and it’s not like we can’t do that. There is SO much technology in the world, across almost all database platforms, that will allow us to virtualize, containerize, sanitize… (effectively all of the ‘izes‘) our databases so that we can have full, safe, realistic copies as frequently as we like. So what is stopping us?
From experience, it’s justification*. People going to senior stakeholders and saying “we need this technology” and hearing a cacophony of classic business challenges back: “but is it broken?”, “do we really NEED it?“, “it costs HOW much!?!“, “how much time will it take to implement?” etc… It’s dev and test hygiene, not a sexy major modernization project like using Azure Arc, using Blockchain or creating Artificial Intelligence. Who cares that developers have to share a database? We’ve got bigger Tofish to fry!
*Sometimes, but much less frequently, it’s down to complexity of implementation – but we’ll leave that one aside for now!
As you will know from my last post on why now is the time to adopt better working practices, it’s important for us to highlight the gains that can be made from newer, updated practices, and why now is not the time to be closing our minds off to a better way of life. It’s not going to be easy to sum this up in 5 points, and there are many other benefits to solid database provisioning but these are in my opinion, the ones that will revolutionize the way you develop.
Very importantside notefor this blog post: there are lots of subjective key practices, processes and tools that can form part of the “database provisioning process” specifically and they will vary wildly by experience, opinion and company – so for the purposes of the below I will be describing the benefits of a process that involves 3 primary components / steps, given these are the three I tackle most often:
Data Identification and Classification / Cataloging
Data De-Identification i.e. Data Masking
Data Provisioning i.e. Real Time Database Cloning / Provisioning
1 – Increase developer happiness / contentedness
Developers are employed to do 1 thing: innovate. It’s even in the name! Developers are on the cutting edge and are focused on providing value to end users as quickly and efficiently as possible, with shortened release cycles, incremental stories and optimized workflows they can produce this innovation. But a big part of the story is the setup.
Even if you’re working to a more agile methodology it is hard to deliver and test changes which are, in development environments, fundamentally destructive and experimental if you are sharing a workspace with multiple colleagues. Writing on shared Word documents can be frustrating at the best of times, so how can developers be expected to produce high-quality, rigorously tested, game-changing code when at any minute another developer can take the environment down, cause it to run slowly, or overwrite those changes with their own? When you cannot produce changes in an isolated, sandbox environment where they can be individually assessed, re-worked and improved then you have no guarantee that it should be promoted.
All of these sound like arguments that are focused around the production of code, but in fact these issues can all have a huge impact on something that is widely under regarded and scrutinized: developer happiness.
Developers are the people who make stuff go, and without them feeling content and valued in their roles, we can’t expect our productivity and product quality to reflect that – so when developers witness the poor management of their code, something they have worked so hard on as it goes sliding down the priority list or gets rolled back or overwritten etc. they don’t feel motivated to continue doing the best that they can do.
With dedicated environments for dev and test, for different branches, pull requests etc. developers can finally work on innovative and exciting projects, and optimize the code that goes out the door to end users.
2 – Develop a common language about data & make better decisions
It’s very hard to speak about things when you use different language to describe the same thing. That much is obvious. In the United Kingdom alone we have many different words for bread rolls. So when someone comes into a sandwich shop in London and asks for a “Stotty”, can you guarantee that the person serving will know exactly what they mean, exactly when they say it?
No. There will be a gap where some translation will be required: some “down-time“, if you will. Now imagine taking something as simple as a bread roll and applying it to an enterprise data estate… you’re going to have a very bad time.
As I talked about in my blog posts here (importance of database classification) and here (classifications role in DevOps) before you can really make a fully informed decision about your data, you must know 2 simple things:
What data you hold
Where your data is
I should hurriedly add that I don’t just mean sensitive data now – all data deserves to be classified because whether you’re a full stack developer adding a column to a table you’ve never used before, an auditor trying to carry out a Data Protection Impact Assessment (DPIA) and trying desperately to include the database, or you’re a BI developer setting up some new reports or processes, you’re going to need to know about the data. This is where people have questions, and this is where you shouldn’t have to reply on anecdotal knowledge or being pushed around from one person to another at the company who supposedly “might be able to help“.
Better insights into data leads to better practices, less waiting (waste reduction) and greater insight. When we then act on this insight we move faster and deliver greater value in our pipelines.
Have you picked up on the trend yet? How all of these are going to end? Well don’t spoil the ending for those who haven’t, they’ll have to wait fort he conclusion!
3 – Move faster and better enable the DevOps pipeline
It’s apt that I’m listening to an amazing EDM remix of the Green Hill Zone from Sonic when writing this section, but isn’t this just what we need as a business? We want to be able to move faster, or to put it in more ‘agile’ terms, we need to be able to pivot and adapt with only a moments notice. Until now, the database has been a monolithic and difficult to steer behemoth, and it shows in our processes.
A tangible example of what I mean when I say “move faster”, is branching. It’s fairly commonplace now for a developer to be able to clone a repository and checkout a specific branch, create new branches etc. without fear of switching between those branches and what it might entail. On a dev environment, especially when one is working database-first with your changes (it does make sense to know how the changes will impact the database first – that’s all I’m saying) it is, without a reasonable process in place, exceptionally difficult to easily switch between branches and keep work separate.
This often forces developers to stick to one environment when changes are all made in tandem and can play havoc when it comes to capturing those changes in the right place – a manual state-based comparison of a dev database with multiple branches of work on it to a target upstream could be disastrous.
4 – Minimize space constraints on new copies, on premise or in the cloud
Space is always a big player in these conversations, and for some it’s enough to boil it down to “well just how much space can we save??” and that’s enough to put a dollar value on the ROI, and people storm ahead with a solution (that’s not always right for them).
But space is a very real problem, much as we (as technology professionals) like to believe that in these modern times of cloud-native solutions, easily scaled serverless-compute VMs and Big Data Clusters, we know there are still a LOT of people out there firefighting legacy, necessary technology and wrestling with what they CAN get out of backups or their SAN tech.
Even using cloud providers costs money, data egress and ingress costs $, BLOB storage costs $, additional security measures cost $. So it’s really not ideal when our databases, for historical reasons or by virtue of the sheer AMOUNT of data we hold and process, are 5, 10, 50, 100TB+, because we’re going to be struggling with this Dev/Test issue still for years to come.
As before with point 3, database virtualization has come of age and has now we have a lot of different solutions from containerization through DBaaS that can aid us in minimizing the amount of space that we ACTUALLY require, meaning we have less money that we need to pour into maintaining large, unwieldy Dev/Test environments or paying a large bill for the privilege of doing so in the cloud (and when developers will be using their dev machines anyway it just makes sense to see what we can do to leverage this existing hardware).
Whilst this one doesn’t directly add specific value to the end of the pipeline, or speed up this delivery, it can help reduce overhead costs associated with the infrastructure needed when providing this value.
5 – Work on realistic data without worrying about data breaches
This is probably one of the most obvious reasons given that I tend to blog about data regulations and compliance ALL THE TIME but I feel like I need to keep saying this.
If you remove all of the data from development and test database copies, this will not help with development and developers will have nothing meaningful to go on, nor any testing that isn’t limited to pre-defined values.
If you leave all of the data in development and test database copies, all you’re doing is duplicating your attack surface area and creating a lot of potential risks for that data to be surfaced where it shouldn’t be – on the internet, in screenshots, emails and of course, hacked.
So there needs to be a happy medium where we can have both the useful data that gives us insight and intelligence of a full data set, the business logic, trends, demographics etc. that we need during testing or analytics – but it should also be sanitized so that data subjects contained therein cannot be re-identified. Static masking, applied to lower environments allows us to retain the data with none of the data.
Protective measures can be built into the DevOps process from the very beginning as you’ve already seen right here on my blog; so as long as it is a part of the process, and we have multiple controls (or guard rails) that allow us to operate safely and quickly without fearing that same speed will cause us to release any sensitive information, allowing us to focus on one thing, value.
As you’ve seen above, it all comes down to time and money but there are many ways to save andspeed up within a DevOps process by means of a good, solid database provisioning process. Whilst none of these reasons comes with a fixed ROI (unless you have ALL of your pre-prod database storage costs to hand) they contribute to something far better than that:
The ease of delivering value.
In a world where we can be concerned about everything, and where it’s hard to keep up with emerging technologies – it makes sense to start pruning away blockers to the process, the problems that are stopping us from delivering value faster – THAT is the theme and point of this blog post; our end users. We’re already delivering excellent value to them, we trust our developers and teams, but what’s stopping them from moving faster with database changes? Adopting a good provisioning process will mean you start to notice all of the above become true of your database development lifecycle.
“[My] kids haven’t responded to my GDPR requests so I don’t think I’m legally allowed to tell them when dinner’s on the table.” – @mrdaveturner
Ah masking. You would have thought I’d be sick of it by now, no? No, fortunately now, more so than ever, I find myself answering question after question and tackling use-case after use-case. So when I was asked this week:
I thought to myself, well that sounds easy enough… and it was! I know what you’re thinking, c’mon Chris, surely there is more to it? But no, it’s actually pretty straight forward!
I pointed them at the PowerShell module and cmdlets for SQL Provision and the Azure DevOps plugin to automate all of their Provisioning and Masking process, thinking all the while “pffft, they could have made this harder!” and then…
#1Figure out where you want Data Masking to run in your process
This empty Azure deployment stage looks good enough for now! If you wanted to chain other processes either side of it, that’s cool too! Maybe you have your own provisioning process in place and you want to point Data Masker at it to sanitize it? Makes sense to me! For now I’m going to stick with a single agent job for simplicity.
#2 Figure out what is actually going to run Data Masker
Data Masker is a client install and as such will need to be installed on a *gasp*actual machine!
No but seriously, any server you have lying around, physical or VM will do the trick as long as it meets these requirements. Now this Server/VM will need to have an Azure DevOps agent on it already, which of course is the ideal candidate for being the “thing” that calls Data Masker – this could be the Staging/Non-Functional/Pre-Prod environment also of course, so you could copy down PROD and then immediately invoke masking.
The PARFILE you could of course dynamically replace with variables so that it only calls the relevant parameter file for that particular database as well, a nice benefit!
My PARFILE just simply looked like this:
It was calling a local Data Masker set “AzureFun” – now the thing to bear in mind is that Data Masker will run with the Windows authentication credentials that are being run as by the Azure DevOps agent, unless you specify otherwise. In this case because the Azure DevOps agent has the correct permissions to update the databases on this instance anyway I’m fine to use Windows Authentication:
It’s very easy to simply call the command line of Data Masker for SQL Server directly from Azure DevOps, does this same approach work from other CI/CD tools? If they can call executables on the target server then absolutely! So it’s very easily included in the process – you just have to think about where Data Masker is installed and what credentials you’re using for it!
Bonus Point – what about if it’s all Azure SQL Database?
You had to do it didn’t you, you had to say it!
“But Chris, now we know we can call this all from Azure DevOps, what if wewanted to mask and copy Azure SQL Databases into Dev/Test etc.?”
Well actually the good thing is, it’s also pretty similar! When you’re connecting Data Masker to an Azure SQL DB you only need to specify this in the connections in the controller. Again, authentication will likely have to be SQL Auth at this point, and you need to be in Cloud mode, and I’d recommend setting the connection timeout to 10s rather than the standard 5s, but it can still be called as normal from the PARFILE:
So the Data Masker element is reasonably straight forward – that’s the good news. But the thing you REALLY need to stop and think about is:
Where are our Dev and Test copies going to BE?
Option #1: If they’re going to be on VMs or local dev and test servers / developer machines then you could follow a similar approach to one I laid out in this blog post for Redgate in which you create a BACPAC file and split it out on premise before importing it and then provisioning from there. And you could use this code in my Github to achieve something very similar. Caveat: I am no PowerShell guru, who do you think I am? Rob Sewell? Chrissy LeMaire? No. Sadly not. So you can build your own logic around my code though, have at it, I don’t mind! ^_^
Option #2: Keeping everything in Azure. You can copy databases around in Azure and it seems to work pretty well! So I wrote this PowerShell (also in my GitHub for y’all) to effectively copy a PROD DB into the same resource group, mask it and then copy it across to a Dev/Test resource group, dropping the temp copy so as not to incur lots of extra Azure costs (this is just one of the methods I’ve seen people use, again it’s up to you!) – again, see the caveat in option #1 above for my statement on PowerShell! The good thing is, you can use the ‘&’ simply from PowerShell to call Data Masker’s command line.
Either of these options can be run from Azure DevOps also as part of your provisioning or working processes, but instead of including a call to the command line, you can run a fun PowerShell script instead:
Second Conclusion *sigh*
There are lots of ways to get what you need into Dev and Test, but these copies should be masked if they contain personal, identifying information. There are some methods above but there are plenty of others out there on the internet and if you’re not sure about getting started with data masking; try my post here – happy masking!
But it’s happening more and more now. We have to work from home and this is starting to turn up some problems.
And no. I don’t just mean that my wife is now acutely and accurately aware of how annoying I am.
I’m talking about clones. Specifically SQL Clone, and clones do not work great from home.
Now for anyone out there who has never heard of SQL Clone, where have you been?? It’s an incredibly intuitive, reliable tool for rapid provisioning of database copies (*cough* definitely not just lifted from the website) – in any case, it’s pretty darn cool.
One of the coolest things about this technology though is how it seamlessly plugs into the Microsoft Ecosystem by leveraging the VHD(X) technology available in x64 Windows. This means there’s no special file systems, no hard or software “appliances”, it’s very much plug and play (with a little tinkering around ports 14145 and 14146 of course!)
Naturally though it does come with it’s challenges, as does all technology, and it has been highlighted recently more so than ever.
Whilst SQL Clone is a beautiful, elegant and easily automated answer to the database provisioning problem – there are 2 gotcha’s that you must be aware of when you start using the solution. SQL Clone relies on the relationship between the Image file (the centralized, virtualized parent, as it were that exists on a windows fileshare) and the Clones themselves (the 40mb diff disks on the hosts).
Now many people choose to put Clones onto their workstations, which for many of us is our laptops. We have SQL Server Developer installed and we pull down a few clones for different branches, for testing etc. etc. and all is well with the world.
When you’re in the office that is.
When a user queries or works with a SQL Clone and it requires any of the schema/data that was in the original copy and is not in the changes the user has made, a call is made back to the image file (the VHD mounted copy) to fetch it. When you’re in an office setting (with your cup of coffee in one hand and a colleague sat next to the other telling you about their weekend) this is fine because you’re connected directly to the company network and therefore the link between the clone on your laptop and the image file on the fileshare is short, strong and stable.
At home though this isn’t the case. Many of us work on VPNs that are “ok” on internet connections that can only be described as temperamental. So what happens when a clone tries to call back to the image file across this VPN, which is now much further away, across a sea of uncertainty and poor connection?
Bad things happen. Either the Clone cannot connect to the Image and it decides it no longer knows what it is, and it falls into Recovery Pending for a while until the connection is re-established, or if the connection is present it is just so slow.
This isn’t a fault of the tool, I hasten to add, it is just the nature of the technology. Much as we would expect our RDP sessions to be a little laggy if we were based in Greenland and the server was in Australia, it is just part of life.
So… are there solutions? You bet there are!
Option 1: The “jump box”
Many people I have worked with have found that they are still able to leverage SQL Clone whilst working from home by reducing the physical distance between the Clone and the Image, and have done so by introducing a Dev/Test “jump box”.
The way this works is by having an instance of SQL Server available within the company network onto which the Clones are provisioned.
This works great because it means that the link between the Clone and the Image is once again short and strong and stable, relying on the company network, but you can easily connect to or RDP onto this jump box if you need to work with them. Still using your VPN and internet from home? Check! Able to work with a Clone now though? Check!
Option 2: The Cloud
Welcome… to the world of tomorrow!
By which of course I mean, the world that has been available to us for a while. Infrastructure as a Service (or IaaS) allows us to very easily spin up an additional dev/test server which can be used in the interim. Whilst this does incur a little extra cost in infrastructure it still means that you won’t need as much space on the VMs themselves, thank you SQL Clone!
As long as you have a fileshare available in say Azure or AWS and a Windows VM you’re pretty much good to go and once you’ve got Clones and Images up in the Cloud, you’re reliant on the networks of the providers, and I’ve got to say, they’re not too shabby at all!
Sneaky side note: ALL demos I give of SQL Clone actually run on an EC2 VM with a 91GB Image on a fileshare and it works great!
Option 3: Roll-your-own solution
Ok. I realize this one is not really a solution. But the thing we’ve highlighted here is that it’s all about the distance between the Clone and the Image. Those two love-birds cannot exist long distance.
So if you might have another ingeniously simple way of solving this, I would recommend having a read of this magnificent document written by an incredibly clever ex-colleague of mine and the “How It Works” documentation and then let me know what YOU come up with:
There are a few ways of getting around this problem but they’re not always obvious or clear, and at this trying time, you just want solutions you have in place to work. We all do.
So if you are working with any of Redgate’s tools, or even if you just have a question that I or a colleague might be able to help with – please reach out to us, talking to us won’t cost you anything, but it sure as heck might gain you something!