Refreshing SQL Server Development workflows with Redgate SQL Provision

“If you quit on the process, you are quitting on the result.
Idowu Koyenikan

SQL Provision is really cool. But you knew that didn’t you? It’s obvious – we get teeny-tiny clones, based on an image with completely sanitized data we can use for just about anything in dev and test, and if we break them? Boom! There’s a new one.

I’m not just talking about refreshing Dev & Test environments though, oh no! I’m talking:

  • Clones as baseline with SQL Change Automation – baseline scripts for projects are a thing of the past, goodbye invalid object headaches!
  • Clones every single time you switch a branch – keeping everything separate and not cross-pollenating database work between branches
  • Clones to check Pull Requests instead of relying solely on the code itself in Version Control

Watch my session on all 3 of these from Redgate Streamed back in August: https://www.red-gate.com/hub/events/redgate-events/redgate-streamed/redgate-streamed-global-august-26

But one question always comes up about clones in any workflow and that is – how often should I refresh Images and Clones?

This question obviously depends a lot on the process but in reality I think the question should be less about clones and more about the images themselves. Clones are transient and can be flipped at a moments notice, but the image, or the “clone tax” as Steve Jones calls it, is the thing that takes time, resource and space.

I’m going to take my own go at answering this question as I would in any customer meeting or architecture session – but if you want some excellent detailed advice and examples, check out this awesome documentation page here: https://documentation.red-gate.com/clone/how-sql-clone-improves-database-devops/self-service-disposable-databases-for-development-and-testing

Q: So, how often should we refresh it?

A: It depends on your use of the Clone – how often do you need up to date data?

As a rule of thumb though, I tend to see the following behaviours:

  • Customer Support – overnight during the working week: Where you have data that needs people to troubleshoot customer issues, it always helps to have data as close to now as possible to help resolve issues. You want an image on standby ready so that at any second a member of support can pull down a copy to look through (if it NEEDS to have sensitive data for this purpose, you can restrict who can create clones from these images by using SQL Clone’s Teams functionality)
  • BI / MIS and Report testing – once a week (if not more often): Business Intelligence and reporting workflows can just mean that you’re reading a lot from your clones in which case they should stay small and you should be able to move seamlessly between clones. But. If your ETL process puts a very heavy load on your clones (like truncating and re-populating tables) you may cause bloat and need to rethink your refresh frequency to be more often where possible, perhaps overnight so that any transformations are captured in the new images, and clones by extension.
  • “BAU” Development (Schema and Static Data Changes) – Every 1 or 2 weeks: If you’re not affecting a large number of changes to your clone, or they are limited to schema and static data only then you should be absolutely fine with a wider refresh cadence – keeping the clones around for the whole sprint or only refreshing once during the sprint can mean everyone more easily stays up to date with the same environment consistently.
  • Ad-Hoc and Test workflows – once per month: There are going to be times where you occasionally need a copy of the live DB, but the fact it is 99% similar in terms of schema, and the data is a few weeks out of date isn’t a big deal. You can pull one down from this “cold copy” for any kind of test, destructive or even to validate certain behaviors / sense check if an update or query will work. It’s also handy to maintain a slightly older copy where possible if you need to start digging into failed updates made in development, so need to have a milestone to compare from.

Again – these workflows may vary and you may have needs to be more or less frequent based on differences being recorded, bloat, space available on the fileshare etc. but generally I find customers are pretty happy with this.

Q: Once we have our refresh rate in place – how do we move developers across?

This is a great question I get a lot of the time, and it stems from the fact a developer may have made a few dozen changes to a clone, and then the frequent refresh rate blows their clones away (and they forgot to commit to version control – D’oh!) – so it’s important to bear in mind that development work, and as a result the cloning of environments is not “cut and dried“. We should give developers a chance to move across as-and-when they’re ready, so I often end up recommending the below workflow, to ease this process.

For the sake of this proposed workflow I’m assuming a couple of things:

  1. The selected workflow is BAU Development and we want to refresh once per week
  2. We have enough space available on our fileshare to allow for 2 (or more) distinct copies of the primary image
  3. Clones are being delivered to jump boxes / VMs within the network that are always connected (and not developer machines), and we can control when they are deleted
  4. We operate on a standard western work schedule where the week begins on Sunday, Saturday and Sunday are considered non-working days and developers typically work anywhere between 8am and 6pm
  5. This can all be automated using SQL Clone’s PowerShell module

Week 1 – Sunday night

  • We create Image A of Primary Database from most recent backup file onto fileshare, applying data masking

Week 1 – Monday to Friday

  • Developers X, Y and Z create their own clones of Database A as they begin the working week
  • The clones are linked to a Git repo where, using SQL Change Automation, the developers commit all changes they make to their clones throughout the week
  • Developer X finishes with their changes, makes their final commit and push on Thursday and works on a different task on Friday

Week 2 – Sunday night

  • We create Image B of Database A – with slightly more up to date (and sanitized) data and capturing any deployed changes the team committed and pushed to git previously
  • We retain Image A for now but do a check for which developers have clones remaining (Developers Y and Z) and either nudge them in the team stand up that they only have a few days left, or automate the sending of an email to those developers warning them their clones are now 1 week old

Week 2 – Monday morning

  • Developer X creates their new clone from Image B and links it to Git ready to start making changes

Week 2 – Tuesday to Friday

  • Gradually over the course of the week as Developers Y and Z finish with their tasks and commit their changes they remove their clones and create new ones from Image B
  • A final reminder, as an email or a notification in MS Teams / Slack goes out on Friday morning that any clones of Image A will be deleted over the weekend

Week 3 – Sunday night

  • Image A with no clones remaining is deleted (or any remaining clones are deleted first) and Image C is created to begin the cycle again

Conclusion

Although this workflow requires the duplication of the central image, it has a number of benefits:

  • It is easily automated using PowerShell
  • The source control process suffers minimal disruption and developers don’t need to rush to finish anything
  • We don’t accidentally destroy developer work – the onus is on the developer to ensure work is committed
  • If, for any reason the image creation process fails, you still have a persisting image, so you don’t prevent developers from doing any work / waiting for the image process to manually complete
  • Moving to newer clones is a more organic process
  • If you wanted to maintain an image throughout the week and refresh a second image overnight for more up to date data, you can simply re-purpose the above principles. This could then be used for a number of the different teams and workflows simultaneously

Bonus Point – Naming Conventions

Many people choose to append the images they create with a date stamp like Image_A_16102020 so we know when it was taken and what the latest is. This is good practice but be warned if you’re using Clones as baseline or for branch switching etc. you will need to have a persistent name else that link will break. An alternative is always having the same name for the most current image and then simply renaming the older image with the date time stamp e.g. Image_A is current, but before creation of a new Image_A, it is renamed to Image_A_16102020 – this will not disrupt the clones that already exist on it, and it allows you to always know which one is most recent.

Azure DevOps Masking a.k.a “point, no click”

“[My] kids haven’t responded to my GDPR requests so I don’t think I’m legally allowed to tell them when dinner’s on the table.”
@mrdaveturner

Ah masking. You would have thought I’d be sick of it by now, no? No, fortunately now, more so than ever, I find myself answering question after question and tackling use-case after use-case. So when I was asked this week:

“Chris, is there a way for us to call Data Masker for SQL Server directly from Azure DevOps?”

I thought to myself, well that sounds easy enough… and it was! I know what you’re thinking, c’mon Chris, surely there is more to it? But no, it’s actually pretty straight forward!

I pointed them at the PowerShell module and cmdlets for SQL Provision and the Azure DevOps plugin to automate all of their Provisioning and Masking process, thinking all the while “pffft, they could have made this harder!” and then…

“No sorry Chris, is there a way for us to call JUST Data Masker for SQL Server directly from Azure DevOps?”

Ah! Now that’s an interesting one!

#1 Figure out where you want Data Masking to run in your process

This empty Azure deployment stage looks good enough for now! If you wanted to chain other processes either side of it, that’s cool too! Maybe you have your own provisioning process in place and you want to point Data Masker at it to sanitize it? Makes sense to me! For now I’m going to stick with a single agent job for simplicity.

#2 Figure out what is actually going to run Data Masker

Data Masker is a client install and as such will need to be installed on a *gasp* actual machine!

No but seriously, any server you have lying around, physical or VM will do the trick as long as it meets these requirements. Now this Server/VM will need to have an Azure DevOps agent on it already, which of course is the ideal candidate for being the “thing” that calls Data Masker – this could be the Staging/Non-Functional/Pre-Prod environment also of course, so you could copy down PROD and then immediately invoke masking.

#3 Call the command line from Azure DevOps

In your pipeline steps you can specify the calling of an executable on the machine where the agent resides. Fortunately Data Masker has a wonderful command line available that you can call, you can read all about it here: https://documentation.red-gate.com/dms/data-masker-help/general-topics/about-command-line-automation

The PARFILE you could of course dynamically replace with variables so that it only calls the relevant parameter file for that particular database as well, a nice benefit!

My PARFILE just simply looked like this:

It was calling a local Data Masker set “AzureFun” – now the thing to bear in mind is that Data Masker will run with the Windows authentication credentials that are being run as by the Azure DevOps agent, unless you specify otherwise. In this case because the Azure DevOps agent has the correct permissions to update the databases on this instance anyway I’m fine to use Windows Authentication:

Conclusion

It’s very easy to simply call the command line of Data Masker for SQL Server directly from Azure DevOps, does this same approach work from other CI/CD tools? If they can call executables on the target server then absolutely! So it’s very easily included in the process – you just have to think about where Data Masker is installed and what credentials you’re using for it!

Bonus Point – what about if it’s all Azure SQL Database?

You had to do it didn’t you, you had to say it!

“But Chris, now we know we can call this all from Azure DevOps, what if we wanted to mask and copy Azure SQL Databases into Dev/Test etc.?”

Well actually the good thing is, it’s also pretty similar! When you’re connecting Data Masker to an Azure SQL DB you only need to specify this in the connections in the controller. Again, authentication will likely have to be SQL Auth at this point, and you need to be in Cloud mode, and I’d recommend setting the connection timeout to 10s rather than the standard 5s, but it can still be called as normal from the PARFILE:

So the Data Masker element is reasonably straight forward – that’s the good news. But the thing you REALLY need to stop and think about is:

Where are our Dev and Test copies going to BE?

Option #1: If they’re going to be on VMs or local dev and test servers / developer machines then you could follow a similar approach to one I laid out in this blog post for Redgate in which you create a BACPAC file and split it out on premise before importing it and then provisioning from there. And you could use this code in my Github to achieve something very similar. Caveat: I am no PowerShell guru, who do you think I am? Rob Sewell? Chrissy LeMaire? No. Sadly not. So you can build your own logic around my code though, have at it, I don’t mind! ^_^

Option #2: Keeping everything in Azure. You can copy databases around in Azure and it seems to work pretty well! So I wrote this PowerShell (also in my GitHub for y’all) to effectively copy a PROD DB into the same resource group, mask it and then copy it across to a Dev/Test resource group, dropping the temp copy so as not to incur lots of extra Azure costs (this is just one of the methods I’ve seen people use, again it’s up to you!) – again, see the caveat in option #1 above for my statement on PowerShell! The good thing is, you can use the ‘&’ simply from PowerShell to call Data Masker’s command line.

Either of these options can be run from Azure DevOps also as part of your provisioning or working processes, but instead of including a call to the command line, you can run a fun PowerShell script instead:

Second Conclusion *sigh*

There are lots of ways to get what you need into Dev and Test, but these copies should be masked if they contain personal, identifying information. There are some methods above but there are plenty of others out there on the internet and if you’re not sure about getting started with data masking; try my post here – happy masking!

Provisioning local Dev copies from AWS RDS (SQL Server)

“It’s still magic even if you know how it’s done.”
Terry Pratchett

For a long time now I have worked heavily with Redgate’s SQL Provision software. I was involved from the very beginning when SQL Clone was but a few months old and just finding it’s little sheepish (teehee) legs in the world and before Redgate Data Masker was “a thing”.

The one thing I’ve never comprehensively been able to help a customer with though, was PaaS. Platform as a Service has been something that has plagued my time with this wonderful software and that is because you simply cannot take an Image (one of SQL Clones VHDX files) from an Azure SQL Database or from an Amazon AWS RDS Instance directly, helpfully.

But then in January 2019 I did some research and I wrote this article on how you could achieve Provisioning from Azure, through the BACPAC file export method, this was great and several customers decided this method was good enough for them and have adopted it, and in fact completely PowerShell-ed out the process (links to something similar in my GitHub which I used for my PASS Summit Demo 2019), however this never solved my AWS problem.

I’ll be the first to admit, I didn’t even try. AWS for me was “here be dragons” and I was a complete n00b; I didn’t even know what the dashboard would look like! However, in early December 2019 I was on a call with a customer who mentioned that they would like to Provision directly from RDS SQL Server and they don’t want any “additional hops” like the BACPAC Azure method. On the same day, Kendra Little (sorry Kendra, you seem to be the hero of most of my blogs!) shared some insight that it was possible, with AWS, to output .bak files directly to an S3 bucket. That got me thinking, if we can get access to a .bak file directly from S3, surely we could provision it all the way to dev with little- to-no involvement in the process?

My reaction to this news was that it was the universe telling me to get off my backside and to do some thing about it, so with renewed determination and looking a little bit like this:

Ready Lets Go GIF by Leroy Patterson

I set off into the world of AWS.

1 – Setup

Now naturally, I am not a company. Shock. So i don’t have any pre-existing infrastructure available in AWS for me to tinker with, and that was the first challenge. “Can I use anything in AWS for free?” – The answer? Actually, yes! AWS has a free tier for people like myself who are reeeeeally stingy curious which at the very least will let me better understand how to interact with the various moving parts for this.

First step. I’m going to need a Database in RDS, so I went to my trusty DMDatabase (scripts here for y’all) which I use for EVERYTHING, on-premise, in Azure, EV-ERY-THING.

In AWS I went to RDS and setup a SQL Server Express instance called dmdatabaseprod (which fortunately kept it on the free tier). Luckily, AWS provides an easy getting started tutorial for this which you can find here – why re-invent the wheel? After creating the DB I had some major issues actually connecting to it in SQL Server Management Studio; I thought I had allowed all the correct ports for traffic, put it in the right security groups blah blah blah… and guess what it was?

Public accessibility. Was set. To “No“. *cough* well that problem was sorted quickly so it was onto the next challenge.

2 – Backing up to an S3 bucket

I can take no credit for this step whatsoever because it was the wonderful Josh Burns Tech who saved me. He created a video showing exactly what I wanted to do and you can see this, with full instructions and scripts here: https://joshburnstech.com/2019/06/aws-rds-sql-server-database-restore-and-backup-using-s3/

After following the advice of Josh and walking through his steps, getting a new S3 bucket setup and configured and creating a new backup of my DMDatabase, I was a good step of the way there! As you can see my .bak was nicely sat in my S3 bucket – marvelous!

3 – Making the S3 bucket visible to SQL Server

This was the tricky bit. My approach to solving this problem was “I need SQL Server to be able to see the .bak file to be able to create an image and clones from it. So, logically, I need it to be mapped as a network drive of some kind?” – simple, no? It turns out that it was the best approach from what I found online but there were a number of ways I found of tackling it.

I started out using this article from Built With Cloud which was super informative and helpful, I managed to get rClone running and the S3 bucket was showing as a local drive, which was exactly what I wanted:

But I ran into a problem – SQL Server could not access the mapped drive.

So is there another way? I found a bunch of resources online for CloudBerry, TnT Drive and MountainDuck but, like I mentioned I’m on a very limited budget ($0) so naturally… I put this on twitter. I received a tonne of replies giving some examples and some ideas and the one idea that kept coming up time and time again was AWS Storage Gateway. I had never heard of it, nor did I have any idea of how it worked.

So. Back to Google (or in my case Ecosia, it’s a search engine that plants trees if you search with them, what’s not to love???)

To simplify it. Storage Gateway is solution that is deployed “on-premise” i.e. as a hardware gateway appliance or a virtual machine, and it allows you to effectively use your S3 (or other AWS cloud storage service) locally by acting as the middle-person between AWS and your on-premise systems, and it does fancy local caching which means super low latency network and disk performance. There are a few different types you can utilize but for this exercise I went with “File Gateway”, from Amazon: “A File Gateway is a type of Storage Gateway used to integrate your existing on-premise application with the Amazon S3. It provides NFS (Network File System) and SMB (Server Message Block) access to data in S3 for any workloads that require working with objects.”

Sounds ideal. Time to set it up!

I have access to VMWare Workstation Pro on my machine so I downloaded the OVF template for VMWare ESXi and loaded it up in VMWare (the default username and password threw me a little but it turns out it’s admin and password as standard, and you can change it as you configure):

Then it was a bit of a checkbox exercise from there:

Now I wasn’t 101% sure of exactly how best to set up my fancy new Gateway, so fortunately I found this super helpful video, funnily enough from Teach Me Cloud as opposed to the aforementioned Built With Cloud, and although it was a little out of date, I also had one of Redgate’s finest engineers (the wonderful Nick) on hand to help out. Between the video and us (mostly Nick) we were able to get everything connected!

But I ran into the same problem. SQL Server couldn’t access the backup file.

angry hate GIF

Fortunately though, after some frantic Googling we managed to find a very straightforward article that fixed all of our pain! We needed to map the drive in SQL Server itself – Thanks Ahmad! Now, yes, I did use XP_CMDSHELL (insert DBAReactions gif at the mere mention of it here) but this was for testing purposes anyway, I’m sure there are other ways to get around this problem!

…and guess what? It worked. Huzzah!

If you can substitute my poorly named image “blah” and assume instead it says “HeyLookImAnImageFromAnRDSBackupFileArentIClever“, this means I can now schedule my PowerShell process to create new images at the beginning of every week to refresh my DMDatabase Clone environment, no manual steps needed!

Conclusion

Whilst there are a number of steps involved, you can easily take advantage of some of the fantastic features offered by AWS like Storage Gateway and even if your database is hosted in RDS, you can fully provision copies back into IaaS (Infrastructure as a Service) or On-Premise workflows to keep your Pre-Production copies up to date and useful in development!

Just remember to mask it too!

P.S. I’m sure you could probably find some clever way of using the free rClone method I also mentioned and having this readable by SQL Server, but I haven’t figured it out yet, but will blog when I do!