5 (Noticeable) Business Benefits of Secure Database Provisioning

“Quality is never an accident; it is always the result of high intention, sincere effort, intelligent direction and skillful execution; it represents the wise choice of many alternatives.”
William A. Foster

I know what you’re thinking.

Chris. Your title looks like it was written to be a corporate whitepaper that I see ads for when I’m browsing social media; it should have a CLICK HERE button, a boilerplate photo of a smiling person holding a pen and it should say something like ‘executives hate them, find out their secret here!’

But something has become immediately obvious to me in the last few months, I still speak to people daily who are forced to:

  • Work in shared development models
  • Work on empty (schema-only) / heavily subset databases
  • Work on old, out of date and/or irrelevant data
  • Make decisions without knowing enough about their data or what they hold

When speaking to them though it becomes immediately obvious that the reason there is no dedicated option available for developers is actually not related to the “traditional” problems that one would expect. You would naturally assume that the reason for not refreshing these environments is because of the large amounts of space or time taken to refresh often enough, or even that ‘we simply cannot due to sensitive PII and regulatory concerns‘.

No. In fact it comes down to, as all things do, time and money.

paid make it rain GIF by Thalia de Jong

In the most recent State of Database DevOps report (2020 that is), a whopping 70% of 2000+ respondents replied that they were using a shared development database and this comes with a whole heap of associated problems, like poor code quality, looser controls around sensitive data and defective deployments. Just these figures alone already point to the solution being to spin up copies for developers on demand and it’s not like we can’t do that. There is SO much technology in the world, across almost all database platforms, that will allow us to virtualize, containerize, sanitize… (effectively all of the ‘izes‘) our databases so that we can have full, safe, realistic copies as frequently as we like. So what is stopping us?

From experience, it’s justification*. People going to senior stakeholders and saying “we need this technology” and hearing a cacophony of classic business challenges back: “but is it broken?”, “do we really NEED it?“, “it costs HOW much!?!“, “how much time will it take to implement?” etc… It’s dev and test hygiene, not a sexy major modernization project like using Azure Arc, using Blockchain or creating Artificial Intelligence. Who cares that developers have to share a database? We’ve got bigger Tofish to fry!

*Sometimes, but much less frequently, it’s down to complexity of implementation – but we’ll leave that one aside for now!

As you will know from my last post on why now is the time to adopt better working practices, it’s important for us to highlight the gains that can be made from newer, updated practices, and why now is not the time to be closing our minds off to a better way of life. It’s not going to be easy to sum this up in 5 points, and there are many other benefits to solid database provisioning but these are in my opinion, the ones that will revolutionize the way you develop.

Very important side note for this blog post: there are lots of subjective key practices, processes and tools that can form part of the “database provisioning process” specifically and they will vary wildly by experience, opinion and company – so for the purposes of the below I will be describing the benefits of a process that involves 3 primary components / steps, given these are the three I tackle most often:

  • Data Identification and Classification / Cataloging
  • Data De-Identification i.e. Data Masking
  • Data Provisioning i.e. Real Time Database Cloning / Provisioning

1 – Increase developer happiness / contentedness

Developers are employed to do 1 thing: innovate. It’s even in the name! Developers are on the cutting edge and are focused on providing value to end users as quickly and efficiently as possible, with shortened release cycles, incremental stories and optimized workflows they can produce this innovation. But a big part of the story is the setup.

Even if you’re working to a more agile methodology it is hard to deliver and test changes which are, in development environments, fundamentally destructive and experimental if you are sharing a workspace with multiple colleagues. Writing on shared Word documents can be frustrating at the best of times, so how can developers be expected to produce high-quality, rigorously tested, game-changing code when at any minute another developer can take the environment down, cause it to run slowly, or overwrite those changes with their own? When you cannot produce changes in an isolated, sandbox environment where they can be individually assessed, re-worked and improved then you have no guarantee that it should be promoted.

All of these sound like arguments that are focused around the production of code, but in fact these issues can all have a huge impact on something that is widely under regarded and scrutinized: developer happiness.

Developers are the people who make stuff go, and without them feeling content and valued in their roles, we can’t expect our productivity and product quality to reflect that – so when developers witness the poor management of their code, something they have worked so hard on as it goes sliding down the priority list or gets rolled back or overwritten etc. they don’t feel motivated to continue doing the best that they can do.

With dedicated environments for dev and test, for different branches, pull requests etc. developers can finally work on innovative and exciting projects, and optimize the code that goes out the door to end users.

2 – Develop a common language about data & make better decisions

It’s very hard to speak about things when you use different language to describe the same thing. That much is obvious. In the United Kingdom alone we have many different words for bread rolls. So when someone comes into a sandwich shop in London and asks for a “Stotty”, can you guarantee that the person serving will know exactly what they mean, exactly when they say it?

The Office Reaction GIF

No. There will be a gap where some translation will be required: some “down-time“, if you will. Now imagine taking something as simple as a bread roll and applying it to an enterprise data estate… you’re going to have a very bad time.

As I talked about in my blog posts here (importance of database classification) and here (classifications role in DevOps) before you can really make a fully informed decision about your data, you must know 2 simple things:

  1. What data you hold
  2. Where your data is

I should hurriedly add that I don’t just mean sensitive data now – all data deserves to be classified because whether you’re a full stack developer adding a column to a table you’ve never used before, an auditor trying to carry out a Data Protection Impact Assessment (DPIA) and trying desperately to include the database, or you’re a BI developer setting up some new reports or processes, you’re going to need to know about the data. This is where people have questions, and this is where you shouldn’t have to reply on anecdotal knowledge or being pushed around from one person to another at the company who supposedly “might be able to help“.

Better insights into data leads to better practices, less waiting (waste reduction) and greater insight. When we then act on this insight we move faster and deliver greater value in our pipelines.

Have you picked up on the trend yet? How all of these are going to end? Well don’t spoil the ending for those who haven’t, they’ll have to wait fort he conclusion!

3 – Move faster and better enable the DevOps pipeline

It’s apt that I’m listening to an amazing EDM remix of the Green Hill Zone from Sonic when writing this section, but isn’t this just what we need as a business? We want to be able to move faster, or to put it in more ‘agile’ terms, we need to be able to pivot and adapt with only a moments notice. Until now, the database has been a monolithic and difficult to steer behemoth, and it shows in our processes.

Yank Tug Of War GIF by BEERLAND

A tangible example of what I mean when I say “move faster”, is branching. It’s fairly commonplace now for a developer to be able to clone a repository and checkout a specific branch, create new branches etc. without fear of switching between those branches and what it might entail. On a dev environment, especially when one is working database-first with your changes (it does make sense to know how the changes will impact the database first – that’s all I’m saying) it is, without a reasonable process in place, exceptionally difficult to easily switch between branches and keep work separate.

This often forces developers to stick to one environment when changes are all made in tandem and can play havoc when it comes to capturing those changes in the right place – a manual state-based comparison of a dev database with multiple branches of work on it to a target upstream could be disastrous.

This is why taking advantage of something like database virtualization, allowing you to spin up copies of databases in seconds, could be the answer. You can automate the provisioning of environments as githooks, during Pull Request automation or as release candidates and the experience will be exactly the same across the board – boom *code base*, fresh and ready to go. When developers can move fast, value comes through a whole lot faster.

4 – Minimize space constraints on new copies, on premise or in the cloud

Space is always a big player in these conversations, and for some it’s enough to boil it down to “well just how much space can we save??” and that’s enough to put a dollar value on the ROI, and people storm ahead with a solution (that’s not always right for them).

But space is a very real problem, much as we (as technology professionals) like to believe that in these modern times of cloud-native solutions, easily scaled serverless-compute VMs and Big Data Clusters, we know there are still a LOT of people out there firefighting legacy, necessary technology and wrestling with what they CAN get out of backups or their SAN tech.

Even using cloud providers costs money, data egress and ingress costs $, BLOB storage costs $, additional security measures cost $. So it’s really not ideal when our databases, for historical reasons or by virtue of the sheer AMOUNT of data we hold and process, are 5, 10, 50, 100TB+, because we’re going to be struggling with this Dev/Test issue still for years to come.

As before with point 3, database virtualization has come of age and has now we have a lot of different solutions from containerization through DBaaS that can aid us in minimizing the amount of space that we ACTUALLY require, meaning we have less money that we need to pour into maintaining large, unwieldy Dev/Test environments or paying a large bill for the privilege of doing so in the cloud (and when developers will be using their dev machines anyway it just makes sense to see what we can do to leverage this existing hardware).

Whilst this one doesn’t directly add specific value to the end of the pipeline, or speed up this delivery, it can help reduce overhead costs associated with the infrastructure needed when providing this value.

5 – Work on realistic data without worrying about data breaches

This is probably one of the most obvious reasons given that I tend to blog about data regulations and compliance ALL THE TIME but I feel like I need to keep saying this.

If you remove all of the data from development and test database copies, this will not help with development and developers will have nothing meaningful to go on, nor any testing that isn’t limited to pre-defined values.

If you leave all of the data in development and test database copies, all you’re doing is duplicating your attack surface area and creating a lot of potential risks for that data to be surfaced where it shouldn’t be – on the internet, in screenshots, emails and of course, hacked.

So there needs to be a happy medium where we can have both the useful data that gives us insight and intelligence of a full data set, the business logic, trends, demographics etc. that we need during testing or analytics – but it should also be sanitized so that data subjects contained therein cannot be re-identified. Static masking, applied to lower environments allows us to retain the data with none of the data.

The Next Generation Data GIF by Star Trek

Protective measures can be built into the DevOps process from the very beginning as you’ve already seen right here on my blog; so as long as it is a part of the process, and we have multiple controls (or guard rails) that allow us to operate safely and quickly without fearing that same speed will cause us to release any sensitive information, allowing us to focus on one thing, value.

Conclusion

As you’ve seen above, it all comes down to time and money but there are many ways to save and speed up within a DevOps process by means of a good, solid database provisioning process. Whilst none of these reasons comes with a fixed ROI (unless you have ALL of your pre-prod database storage costs to hand) they contribute to something far better than that:

The ease of delivering value.

In a world where we can be concerned about everything, and where it’s hard to keep up with emerging technologies – it makes sense to start pruning away blockers to the process, the problems that are stopping us from delivering value faster – THAT is the theme and point of this blog post; our end users. We’re already delivering excellent value to them, we trust our developers and teams, but what’s stopping them from moving faster with database changes? Adopting a good provisioning process will mean you start to notice all of the above become true of your database development lifecycle.

Delays are not DevOps, delaying DevOps is worse: Why we need better working practices now more than ever.

“The time is always right to do what is right.”
– Martin Luther King Jr.

Over the past few months, we have been on lock-down. The product of a devastating and deadly disease that has well and truly stamped it’s legacy on human history forever. But it is out of these times that we receive a glimpse, a look into what is possible, and what humanity can do. It is out of this fight, out of these ever decreasing odds that we finally see what a combined effort can do, and what a focus on our fellow people can bring about. Don’t know what I mean, take a look at the Good News Network and subscribe (just like me) to see the best of us.

But it only works, we only triumph, when we work together.

Recent Example: Scientists at Oxford University have seen a tremendous breakthrough with their virus efforts and are making unprecedented strides towards a viable vaccine – but it involves an incredibly strong partnership with pharmaceutical companies and governments worldwide for staged testing, large scale results gathering and continuous improvement.

So. Why then in the last few months have I been speaking to people who say things like:

  • We have delayed our non-BAU process implementation, because we have seen a spike in usage, and we need all hands on deck.
  • We have been investigating tooling and processes to help our development teams, but this is on hold as we come to terms with this newer way of working.
  • Our teams are keen to adopt more agile ways of working, but they are overburdened at the moment, so we have decided to postpone any research into this for the next few months.

It is feasible, amid a global panic, that people and companies will do (and definitely have done) what comes naturally to us; that is to “bunker down“. We believe that if we shift all efforts from projects and ongoing testing / new processes, we can have all hands available to deal with anything that comes our way. Processes are established for a reason, right? Legacy methods of dealing with ad-hoc changes and semi-frequent deployments, waterfall-esk development cycles and decade(s) old systems represent the familiar, the safe… Supposedly.

Now so, more than ever, it is time to actually change direction and to put more effort into some of the key principles and processes that will lead us to DevOps nirvana; it is this trinity of people, processes and tooling that can ultimately be the salvation for many global companies as they try to maintain their agility and competitiveness within an uncertain, shifting post-pandemic international market. There are many reasons why I, and many others, believe this but I have detailed 3 key reasons below:

1 – Delaying DevOps creates waste and costs businesses money

DevOps is a culmination of learning, experience and effort and it cannot be classified as one single thing, however it is possible to define a number of things that DevOps is and what it most certainly is not. One of the things that DevOps is, is “the constant delivery of value to end users“; the idea that by adopting certain technical measures and working practices we can minimize the time to delivery of new features and functions, which equates to greater value for us, our end users and significantly more agility to shift in different directions as required.

These ideas of flow and value streams are covered quite nicely by Lean IT, which extended from Lean Manufacturing principles and it is nicely explained here (and I would highly recommend you read The Phoenix Project if you haven’t already) – but the purpose of it (in a similar vein to agile principles) in this sense, is that it defines a number of things that don’t add value to the resulting product or service. These ‘things’ are referred to as waste and this is precisely what we should be looking to remove from our existing, legacy processes because why would we work on anything that doesn’t deliver any value?

There are a few different types of waste but I want to highlight three important ones that often exist as a result of legacy processes still being in place:

  • Defects Includes lack of testing (poor execution) and hot fixing environments (unauthorized changes)
  • Waiting Including everything from waiting for refreshed environments to waiting for feedback/results and even manual processes like deployment approvals
  • Motion (excess)Effectively doing the same thing over and over again, fire-fighting problems that arise on a near daily basis, engaging and monopolizing resources constantly who could otherwise be working on other, more important or value-add tickets (for those of you who HAVE read the Phoenix Project, see Brent as an example!)

The product of this waste is very simple and it fits into 3 main buckets: poor customer experience, increased costs and lost productivity. All of these things boil down to one fundamental truth – bad practices cost us money and reputation, transformation now could help us prevent this, and people will remember us for stepping up when we needed to.

2 – The workforce is increasingly distributed/remote and needs to collaborate better

At the beginning of 2020, one of the biggest questions faced by companies all around the globe was “how do I find and retain talent?”; this is not a new question and had already been around for years. Companies restricted to their offices (base OR satellite) realized they were increasingly fighting for candidates in one of the most competitive markets served by an ever dwindling local pool of options.

This was a situation which necessitated companies to stretch beyond their existing capabilities to enable a better quality experience for remote and distance workers and/or teams – a situation which would later be exacerbated 100x-fold by the global pandemic crisis. Not only has this crisis confirmed that most companies, certainly those feeding software markets, can work remotely, it has also posed the question of if we should work remotely more often, and has fundamentally changed the way we as a workforce will continue to work in the future.

One thing is for sure, whilst we will try to “return to normal” as much as possible, normal has been forever changed and remote working and collaboration is here to stay. That’s the important word in play here, collaboration.

The spirit of DevOps as I’ve mentioned before is good quality communication, collaboration and accountability. But at the heart of those three ideas is visibility. In an office we can over-hear, we can drop-in or bring things up “over the water cooler”/”at the coffee machine”. In a remote working context, that isn’t possible. So we have to adhere to 2 of the most important practices in modern day software development: transfer knowledge and record decisions.

When we work and communicate better in a remote/distributed workforce, and use tools and processes available to us, people don’t make unauthorized changes, or make decisions that affect you that you weren’t aware of until 3 weeks after they were made. It becomes easier to make decisions and generate better work faster, rather than being paralyzed by indecision and uncertainty as to whether you hold the latest version of the truth or if it is outdated. When we adopt the right processes and tools into our DevOps methodology, we know for certain what we should be doing, and why we should be doing it.

Process-wise this can easily take the form of common functions, many already at the disposal of teams when remote; stand-ups, retrospectives, mob- and pair- programming, OKR and sprint planning, there are lots of different ways for us to work well and know what we’re supposed to be working on at all times (and why). Tooling-wise we can then match these how we will be doing something with the respective record of what is being done, what decisions have been made and crucially, why. Using work management software like Trello, bug/feature tracking software like JIRA or Azure DevOps work items, source controlling everything (even having a strong branching and merging strategy to control workflow) with rigorous testing routines, policies and pull requests and automation all lead to better informed, happier*, well-performing developers with a crucial sense of purpose.

*Important side note: It is also crucial that we don’t simply lose ourselves in the business benefits completely – developers, testers etc. are all human and we all crave job satisfaction and happiness in our roles. Yes you might be able to increase your deployment frequency, minimize costs or complaints, but nothing compares to a satisfied, motivated team, which DevOps can help breed and inspire.

3 – DevOps breeds innovation and improves company performance, with a tangible return on investment (and not just financial!)

Automation is one of the single greatest ways we can modernize our processes, and is often the first principle we think of when adopting DevOps practices; taking something that is manual or held together by legacy scripts that forever fail and cause outages, and instead continuously integrating, continuously testing and continuously improving using the latest processes and tooling available to us. Automation allows us to create high cycle rates, enhance and multiply the feedback options we have within our pipeline(s) and allows us to reduce manual concerns and issues, to dedicate teams to the very thing they were employed to do: innovate. We only need watch how Netflix does DevOps to realize what we can unlock.

On a weekly basis I discuss existing processes with developers all over the world and one key trend always emerges that we need to focus on: they have a process that is currently manual, and it needs improving. I have lost track of the number of times I’ve been told about a process where developers generate scripts, manually test themselves and then “do x” with it, whether that be just deploying to Production themselves (without review), or putting it in an ever mounting pile of scripts in a folder on a file share for someone to sift through periodically.

Across every single one of the conversations mentioned above that I have, there is not a single discussion that doesn’t include some kind of quantifiable cost to the business, whether that be downtime, customer refunds, regulatory penalties and even in extreme cases, high developer churn. All can still be expressed in terms of 2 things: Time and Money.

If we remove those roadblocks for our developers, if we give them tools to enable them to more easily do their jobs, we put in place processes that allow them to more easily deliver that innovation, and tight automated controls to remove error-prone, manual jobs – we end up with something more akin to harmony. By this of course I mean the “constant delivery of value to end users“. This creates a more positive user experience, allows us to respond more quickly in an uncertain market and make decisions on what we should or, just as importantly, shouldn’t do, faster.

The world at the moment is a very uncertain place and has destroyed jobs, companies and whole industries. We should expect that consumer confidence is at an all time low and as we all come out of lock-down across the globe, we should be prepared to metaphorically “put our best foot forward” to help our developers believe in what we’re doing, stimulate faith in our industries and ultimately deliver more value bidirectionally.

But this idea only works if we do this now.

Conclusion

DevOps is more than just “picking up some new tools” or “rolling out agile” to development teams. It is fundamentally a mindset change that can drastically and fundamentally alter the underlying motivations and thinking within an organisation, allowing you to focus on the most important thing – delivering value, faster.

There are always times where delaying large scale roll-outs is a pertinent decision to make, and a hard one at that! But DevOps seeks to unify every part of of the development cycle; giving you greater visibility, communication, accountability and control, with maximum flexibility to test, validate and even pivot where needed.

But the time to do DevOps is not “once everything improves“, “once we get back to normal” or “when we have more time“… it is now. Now is the opportunity we all have to capitalize on the wave of change we have been consistently waiting to implement, to strengthen our position and future growth in our markets. DevOps is how we can come out on the other side of this disaster ready to embrace new technologies and ways of thinking, to respond to our customers needs, and deliver value and speed at scale.

3 methods for seeding test data during CI builds with Flyway

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.
Sir Arthur Conan Doyle, Sherlock Holmes

EDIT: Modified 23/12/2020 to include updates to Flyway in v7, method 4 below

Can you tell I’m loving Flyway at the moment? Well I am. It’s JUST SO GOOD! Honestly there are so many things you can do with it! Don’t know what I’m talking about? Check out my posts on xRDBMS DevOps with Flyway and tSQLt unit tests with Flyway and you’ll see what I mean!

As a result of the above posts though I was asked a question that I had to think about for a little bit before having the best possible answer, how can we seed some testing data INTO the build database so that we can run some meaningful tests against it?

This makes perfect sense to me, but there’s also a few different ways to do this – so let’s go fly(way)!

flying i believe i can fly GIF

1 – Test Data Migration Scripts

In my previous posts on Flyway (above) I talked about having an entirely separate build folder present within the repository, and a folder of test migrations alongside our schema migrations – I called these the Build_Config folder, (containing the build configuration file) and the Test_Migrations folder (unsurprisingly containing testing migrations) in the _Migrations location:

I was using the same build config for 2 purposes; 1) to build the schema migrations from the base version, by passing it the Schema_Migrations location dynamically and 2) then building the tSQLt framework and testing objects by passing it the Test_Migrations location dynamically.

This actually worked surprisingly well, but even beyond this – the same method can be repurposed, or added to, by augmenting your testing scripts and adding a data insertion task (as an additional script or group of scripts). In my folder, I can simply add a migration like this:

Because of course I like dogs.

lana del rey yes GIF

and once pushed to the repository and the build has run we should be able to verify our testing data is present:

A bonus win for this step of course, is that where Devs have their own Flyway config files locally for their development databases they could also overwrite this behavior and point the testing and/or data scripts at their own database so they have some seed data to work with too!

2 – Add a data generation step to the pipeline

There are SO MANY technologies out on the inter-webs for generating data. SO MANY. Many of them also have a command line or PowerShell module that we can use to easily invoke them against a target, especially if that target is going to be persistent like my Flyway Azure SQL Build DBs!

Because I have access to it and because I’m using essentially SQL Server DBs, I could easily use Redgate SQL Data Generator – but to get the data you need you could use anything from DBATools Data Generation (also SQL Server) to FillDB for MySQL (which looks awesome and you could easily use this for Step 1 above too!)

There are numerous ways to invoke tools and applications and fortunately good CI/CD tools like Azure DevOps offer multiple ways to, for instance, run PowerShell or CLI steps from within the pipeline – so we could easily invoke SQL Data Generator on a VM or physical machine we have an Azure DevOps agent on – but this thinking also opens up the possibility of using something like Chocolatey to dynamically install the software on the Azure DevOps hosted pool VM during build (for the Redgate tools at the moment I suppose you’d need a Windows VM).

sassy pants chocolate GIF

I will be writing a future blog post about this step because it sounds _very_ interesting, but I’m not sure yet what can be done specifically using Chocolatey or if I’ll have to look elsewhere, although I have read this post in the past (thanks Paul!) detailing limitations and a great workaround using Azure DevOps, so it’s likely that’ll be my first port of call!

Just to give you an idea of end result with SQL Data Generator specifically though:

3 – Use existing data, don’t generate

Ok this one is going to be controversial already, I can tell! Let’s all stay calm!

happy chill GIF

The best data to be tested is our data. What we have in Production is what will have these changes deployed to it… eventually! So shouldn’t we just test against that? Well. Maybe, maybe not depending on what is in there.

There’s a few methods to achieve this – my personal favorite would be to use a SQL Clone, spin that up on a build VM rather than using an Azure SQL DB, and we can have all the data in an instant. Of course if we hold any sensitive PII/PHI then we should ensure that is protected first!

Of course there are lots of other options, like restoring a backup or spinning up a container etc. and these can all just be a stage in the YAML file before invoking Flyway but the point is, if we use an existing copy of our Prod database from some source or another, it will have 2 things we really care about:

  1. Data. Ready to go, ready to test, ready to give us the best possible insight into our changes.
  2. The flyway_schema_history table. Instead of running EVERY migration we’ve ever written, which could take a while for a large team, we run only the latest migrations to check that they would deploy happily to the Production target.

To get this stage to work though, you would need to do a couple of things differently:

  1. The build DB would have to be created from the clone/backup/other every time instead of simply cleaning the schema down.
  2. You would need to remove the Flyway Clean step from the pipeline in my previous post, because it would otherwise drop all the tables (and then we wouldn’t have any data!)
  3. By extension, this also makes the callback to remove the tSQLt objects void, so you can remove that too.

4 (Bonus Method) – Script Migrations

In Flyway v7 the team added the ability to also run script Migrations and Callbacks which mean it is possible to invoke .ps1, .bat, .cmd, .sh, .bash, and .py files as part of the version control > build and migration process.

This means that you can include a script to invoke any loading or processing of data you may prefer – you could invoke a data generation utility, data masking and of course anything else that can be invoked with these file formats. A good example of this might be calling Data Generator as above, or you could use DBATools, DTM Data Generator or even a more platform agnostic approach by using a Hazy generator to produce and then load an incredibly realistic data set.

Conclusion

There are a lot of different ways to generate data, you can generate completely synthetic data, you can mask data or use Prod data, it’s up to you! Ultimately it will just for another part of your pipeline – just be careful of ordering! You don’t want to try generating data into a table that hasn’t been built yet.

Respect your YAML file and you’ll get schema, data and unit tests and this will lead to one thing. Greater insight, earlier.

thumbs up GIF

Using things weirdly – Part 2: Static Data in the Hybrid Model (w/ Redgate SQL Source Control and Change Automation)

“Where’s your will to be weird?”
Jim Morrison

I had some really positive feedback on my last “Using things weirdly” post, and truth be told, I love to use things weirdly. The number of times I’ve heard: “Oh, well, sure yeah I guess it works that way too…” is just too many to count. So imagine how my eyes lit up when I realized that you can do something weird with one of my favorite things to talk about at the moment, The Hybrid Model.

Now if you don’t know about the Hybrid Model then I would suggest you check out my post here that’s all about the different source control models available for your databases!

The Problem

Across both the State based and Migrations based offerings within the SQL Toolbelt, you have access to something very cool: the ability to easily (alongside the schema) source control your static data. Now don’t ‘@’ me because you think I should be referring to it as “Semi-Static” because it might change occasionally and ‘that’s not truly static then is it?‘; I could easily also refer to it as ‘Lookup Data’ or ‘Reference Data’, basically whatever you class things like “Country Codes” and “Currency Codes” as.

Whatever you call it, it can go into your VCS like any part of the database schema:

SQL Source Control: State Based
SQL Change Automation: Migrations Based

But. One thing that – as of writing this RIGHT NOW (23/07/2020 10:09am BST) – is not officially available in the Hybrid Model combining these two methods… is Static Data. The Data tab in SQL Change Automation even disappears when you set it up as a Hybrid workflow:

And this gives me a sad.

sad monty python and the holy grail GIF

The Solution?

Got your Hybrid Model setup and ready to go? Let’s use it weirdly!

1 – Use SQL Source Control to commit some static data to your state repository. This is as easy as right clicking on your (highlighted green) source control linked database and selecting “Other SQL Source Control Tasks” > “Link/Unlink Static Data“, and picking your tables.

Nothing should be showing in SQL Change Automation:

2 – Unlink SQL Change Automation from the state repository for a moment and link it instead directly to the development database. This will cause it to go into Migrations-First mode. You can do this by clicking the blue source name in SSMS and picking Existing Database instead:

Because it’s technically the same database as you’re source controlling in the state repository, it should all just work™ and should tell you there are no dev changes to the source. Then you’ll see the “Data” tab has been enabled:

3 – Select the same tables to source control as you did with SQL Source Control by using the Add Tables wizard:

BUT WAIT!!

shocked oh my god GIF

Isn’t it now going to generate a migration for our static data?? This isn’t included in the baseline or anything at all so far, so is it going to try and insert all of my static data into tables later on that already have it?

No. Actually we’re fine!

4 – Generate the static data migration script (and look at it for peace of mind). Notice that the script actually has checks in there – because we’re newly adding these tables, the migration will check if the tables are empty before trying to run the script, and only AFTER this migration will SQL Change Automation start generating the differential, incremental static data migrations:

Commit this migration script to your migrations repository, and that’s all we need to do here!

5 – All that’s left now, is to re-hook-up the Hybrid pipeline, follow the same steps you did before in Step 2 but this time, instead of an existing database, just link it back to the state repository like it was before. If you’ve done this right, you should see no changes pending for migrations:

BUT you will notice that if you change any static data with SQL Source Control, it should now show up in SQL Change Automation!

Change to static data prior to commit in SQL Source Control
Change before migration script generation in SQL Change Automation
Generated migration script to be committed

Conclusion

Is it an intended use? No absolutely not, the reason it’s disabled is that with all things at Redgate they are considered heavily to ensure users are offered the best possible user experience, functionality and essentially something that meets requirements across the board.

But. We can use it weirdly to, as i say, just make it work™.

What have you used weirdly lately? Let me know!

Flyway and tSQLt – migrating to warmer test climates

“If you truly have faith in your convictions, then your convictions should be able to stand criticism and testing.”
DaShanne Stokes

Welcome fellow TestDriven-Development enthusiasts… is what I would say if i actually ever did TDD and didn’t just, you know… write regular unit tests after the fact instead.

I’m going to be honest, I love the idea of TDD but have I ever actually been able to do it? No. Have competent developers been able to do it successfully? Yes, of course. Don’t know anything about TDD? You’re in luck! Click here for an introduction (don’t worry though, THIS post is not going to be about TDD anyway, so you can also keep reading).

But one thing we can all agree on is that testing is pretty important. Testing has evolved over the years though and there are a million-and-one ways to test your code, but one of the most difficult and frustrating things to test, from experience, is database code.

gilmore girls shot of cynicism GIF

Some people argue that the days of testing, indeed, the days of stored procedures themselves are gone and that everything we do in databases should be tested using a combination of different logic and scripting languages like Python or PowerShell… but we’re not quite there yet, are we?

Fortunately though we’re not alone in this endeavor, we have access to one of the best ways to test T-SQL code: tsqlt. You can read more about tsql at the site here but in short – we have WAYS to test your SQL Server* code. The only problem is, when you’re using a migrations approach… how?

*There are also many ways to unit test code from other RDBMS’ of course, like utPLSQL for Oracle Database or pgTAP for PostgreSQL – would this method work for those? Maybe! Try adapting the method below and let me know how you get on!

I’ve already talked about how implementing tests is easier for state based database source control in a previous post because we can easily filter tests out when deploying to later stage environments, however with migrations this can be a real pain because you have to effectively work on tests like you would any normal database changes, and maybe even check them in at the same time – so ultimately, they should be managed in the same way as database schema migrations… but we can’t filter them out of migrations or easily pick and choose what migrations get run against test and Prod, without a whole lot of manual intervention.

Basically. It’s a mess.

mess fail GIF

But during my last post about Flyway I was inspired. This simple and easy to use technology just seems to make things really easy and seemingly has an option for EVERYTHING, so the question I started asking myself was: “How hard would it be to adapt this pipeline to add unit tests?” and actually although there were complications, it was still easier than I thought it would be! Here’s how you can get up and running with the tSQLt framework and Flyway migrations.

1 – Download the scripts to create the tSQLt framework and tests from the site

Ok this was the easiest step of them all, largely because in the zip file you download from the tsqlt website all you have is a set of scripts, first needed to enable CLR and the second to install the tsqlt framework:

As part of my previous pipeline I’m actually using Azure SQL Database as my development environment, where RECONFIGURE is not a supported keyword and where we don’t need to run the CLR script anyway, so all I needed was the tSQLt.class.sql file.

The good thing about this is that we can copy it across into a migration and have this as our base test class migration, and then any tests we write on top of it will just extend it – so as long as we remember to update it _fairly_ frequently with any new tsqlt update, we should be fine! (Flyway won’t throw an error because these are non persistent build objects, so no awkward checksum violations to worry about!)

2 – Adapt the folder structure in the repository for tests

I added 2 new folders to my _Migrations top level folder, a Schema_Migrations folder and a Test_Migrations folder. When you pass Flyway a location for migrations, it will recursively scan folders in that location looking for migrations to run in order. I copied the migrations I had previously into the Schema Migrations folder and then my new tSQLt creating migration into the Test Migrations folder. This allows them to be easily coupled by developers, whether you’re writing unit tests or practicing TDD:

You’ll have noticed I called my base testing migration V900__ – this is because I do still want complete separation and if we have a V5 migration in schema migrations and a V5 testing migration, we’re going to have some problems.

3 – Add a callback to handle removal of the objects

As I was putting this together, I noticed that I could use flyway migrate to run the tSQLt framework against my Dev database, but every time I tried to then flyway clean that database I got a very nasty error stating that the tSQLt assembly could not be removed because of dependent objects.

Flyway does not handle complex dependencies very well unfortunately, that’s where you’d use an industry leading comparison tool like SQL Compare so, with some advise from teh wonderful Flyway team, I set to work on a callback. A callback is how you can hook into Flyway’s own processes, telling it to do something before, during or after certain commands. In my case we were going to remove all of the tSQLt objects prior to running Flyway clean to remove the rest of the schema. To make it future proof (in case objects are added or removed from the tSQLt framework), I wrote a couple of cursors to go through the different objects that were dependent on the assembly and remove them, rather than generating a script I know to have all of the tSQLt objects in right now. You can find the code for the callback in my GitHub here, you are welcome to it!

Animated GIF

All you have to do is name it beforeClean.sql and ensure it is in the directory with your other sql migrations so that it will pick this up and run it – I put it in my Test_Migrations folder, because I only want it to run this callback when cleaning the build DB, as this is the only place we’re utilizing automated unit tests… for now!

4 – Update the Azure DevOps pipeline

I’ve got my callback, I’ve got my tSQLt migration and the folder structure is all correct and is pushed to Azure DevOps but naturally it is breaking the build *sad* but fortunately all we now have to do is update the YAML pipeline file:

trigger:
- master

pool:
  vmImage: 'ubuntu-latest'

steps:
- task: DockerInstaller@0
  inputs:
    dockerVersion: '17.09.0-ce'
  displayName: 'Install Docker'

- task: Bash@3
  inputs:
    targettype: 'inline'
    script: docker run -v $(FLYWAY_LOCATIONS)/Test_Migrations:/flyway/sql -v $(FLYWAY_CONFIG_FILES):/flyway/conf flyway/flyway clean -enterprise
  displayName: 'Clean build schema'

- task: Bash@3
  inputs:
    targettype: 'inline'
    script: docker run -v $(FLYWAY_LOCATIONS)/Schema_Migrations:/flyway/sql -v $(FLYWAY_CONFIG_FILES):/flyway/conf flyway/flyway migrate -enterprise
  displayName: 'Run flyway for schema'

- task: Bash@3
  inputs:
    targettype: 'inline'
    script: docker run -v $(FLYWAY_LOCATIONS)/Test_migrations:/flyway/sql -v $(FLYWAY_CONFIG_FILES):/flyway/conf flyway/flyway migrate -enterprise
  displayName: 'Run flyway for tSQLt'

You will notice a couple of important things that I have highlighted above:

  1. I’m cleaning the build schema using the Test_Migrations repository – this is because that is where my callback is and I need that to run before the clean otherwise it will fail due to the tSQLt assembly issue (line 17)
  2. I am running the migrate for the tests and the schema separately in the file, instead of just calling flyway to recursively run everything in the _Migrations folder. This is because I want them to be 2 separate steps, in case I need to modify or remove either one of them, or insert other steps in between and so that I can see the testing output in a separate stage of the CI pipeline (lines 23 and 29).

Caveat: As a result of (Option 2) running the 2 processes separately, it means running Flyway twice but specifying the Schema_Build and Test_Build folders in the YAML as being mapped to Flyway’s sql directory (lines 16 and 22 in the file above) but the problem this causes is that the second time Flyway runs, when it recursively scans the Test_Migrations folder it will not find the migrations that are present in the Flyway_Schema_History table, resulting in an error as Flyway is unable to find and resolve the migrations locally.

The way to fix this though is pretty simple – you find the line in the Flyway Config file that says “IgnoreMissingMigrations” which will allow it to easily continue. We wouldn’t have to worry about this setting though, if we were just recursively looking to migrate the Schema and Test migrations in the same step (but I’m a control freak tee-hee).

Now, once committed this all runs really successfully. Velvety smooth one might even say… but we’re not actually testing anything yet.

5 – Add some tests!

I’ve added a single tSQLt test to my repository (also available at the same GitHub link), it was originally created by George Mastros and is part of the SQLCop analysis tests – checking if I have any user procedures named “SP_”, as we know that is bad practice – and I have wrapped it up in a new tSQLt test class ready to run.

You’ll notice I also have a V999.9__ migration in the folder too, the purpose of this was to ‘top and tail’ the migrations; first have a script to set up tSQLt that could be easily maintained in isolation and then end with a script that lets me do just 1 thing: execute all of the tests. You can do this by simply executing:

EXEC tSQLt.RunAll

and we should be able to capture this output in the relevant stage of the pipeline.

Some of you may be asking why I chose to have the run unit tests as part of the setting up of the testing objects – this was because I had 2 options:

  1. I’m already executing scripts against the DB with Flyway, I may as well just carry on!
  2. The only other way I could think to do it was via a PowerShell script or run SQL job in Azure DevOps but the 2 plugins I tried fell over because I was using a Ubuntu machine for the build.

So naturally being the simple person I am, I opted for 1! But you could easily go for the second if you prefer!

6 – Test, Test, Test

Once you’ve handled the setup, got the callback in place (and also followed the steps from the last blog post to get this set up in the first place!) you should be able to commit it all these changes and have a build that runs, installs tSQLt and then runs your tests:

I realize there are a lot of “Warnings” in there, but that is just Azure DevOps capturing the output, the real part of this we’re interested in is lines 31-40 and if we clean up the warnings a little you’ll get:

+----------------------+
|Test Execution Summary|
+----------------------+
|No|Test Case Name|Dur(ms)|Result |
+--+---------------------------------------+-------+-------+ 
|1 |[somenewclass].[testProceduresNamedSP_]|144|Success|
------------------------------------------------------------
Test Case Summary: 
1 test case(s) executed, 1 succeeded, 0 failed, 0 errored. 
------------------------------------------------------------------

But if I introduce a migration to Flyway with a new Repeatable Migration that creates a stored procedure named SP_SomeNewProc…

+----------------------+
|Test Execution Summary|
+----------------------+
|No|Test Case Name|Dur(ms)|Result |
+--+---------------------------------------+-------+-------+ 
|1 |[somenewclass].[testProceduresNamedSP_]|184|Failure|
------------------------------------------------------------
Test Case Summary: 
1 test case(s) executed, 0 succeeded, 1 failed, 0 errored. 
------------------------------------------------------------------

It even tells us the name of the offending sproc:

All I have to do now is make the corresponding change to remove SP_ in dev against a bug fix branch, push it, create a PR, approve and merge it in and then boom, the build is right as rain again:

Thus bringing us back into line with standard acceptable practice, preventing us from delivering poor coding standards later in the pipeline and ensuring that we test our code before deploying.

Conclusion

Just because you adopt a more agile, migrations based method of database development and deployment, doesn’t mean that you have to give up on automated testing during Continuous Integration, and you can easily apply these same principles to any pipeline. With just a couple of tweaks you can easily have a fully automated Flyway pipeline (even xRDBMS) and incorporate Unit Tests too!

I will be running for the Ridge, will you?

“Until one has loved an animal, a part of one’s soul remains unawakened.”
Anatole France

TL;DR: If you love to walk, jog or run – sign up to a fantastic cause here to support a shelter in need!

The Full Story: This is going to be a reasonably short post, because I’m hoping if you’ve made it this far, you’ll read to the end.

This August, I will be “running for the Ridge“. This is an event taking place (virtually of course) where one can run, jog or walk any distance you like, as long as it is above 5km (3.11 miles) for Adults or 2km (1.25 miles) for ages 12 and below, wherever you like, anytime throughout the months of August and September.

For the last 2 months, I have aspired to be a healthier version of myself – I have managed to adopt healthier eating habits; removing a lot of the processed sugars and oils I was eating, not drinking alcohol during the week, always exercising at least once a day etc. and as a result I have managed to lose just over 12kg (26.5 lbs) and I feel so much better for it.

Running has been an instrumental part of the improvement in my quality of life and now I’m happy to be supporting something close to my heart with it.

So why this? What makes “running for the Ridge” special?

Well, because it’s in aid of Jacob’s Ridge – an animal shelter in Spain that, thanks to the impact of one nasty global pandemic, is sorely in need of donations in order to stay open and safeguard the well-being of the animals that it looks after, a cause deserving of support all in itself.

So that’s why this is special – because you can directly do something to:

  • Positively impact the lives of rescued animals
  • Positively impact your own health and mental well-being (in the form of exercise)
  • Do something nice with friends or family (at a responsible social distance where necessary)

To me, this just sounds like the easiest decision ever; run/jog/walk a distance of my own choosing, on my own comfortable running route and animals benefit from that? Amazing.

You will even get an awesome sustainable wooden medal for participating! And the cost? 12 GBP per person. That’s it. You can even get sponsored by friends, family and/or colleagues too if you want to contribute more!

So that’s it. That’s all I wanted to say. I’ll be running for the Ridge this August and you know – if you have a spare couple hours or you enjoy walking/jogging/running alone or in a group, think about putting it to good use and let’s get out, have some fun and do some good!

You can sign up here: https://www.jacobsridge.com/product-page/run-for-the-ridge – just sign up on the site and they’ll send you all the instructions direct to your email.

If you do sign up, tweet me @PlantBasedSQL with your entry number and I’ll give it a big ol’ like! I’m number 21 🙂

xRDBMS Database Continuous Integration with Flyway, Azure DevOps and Docker… the simple way.

“Some people try to make everything complicated, be the person who tries to make everything simple.”
Dave Waters

Simplicity is in my blood. That’s not to say I am ‘simple’ in the sense I cannot grasp more than the most basic concepts, but more that I am likely to grasp more complex problems and solutions when they are phrased in simple ways.

This stems from my love of teaching others (on the rare occasion it falls to me to do so), where I find the moment that everything just ‘clicks’ and the realization comes over them to be possibly one of the most satisfying moments one can enjoy in life.

shocked star trek GIF

Now recently I’ve been enjoying getting my head around Flyway – an open source JDBC based migrations tool that brings the power of schema versioning and deployments together with the agility that developers need to focus on innovation in Development. There’s something about Flyway that just… ‘clicks’.

It doesn’t really matter what relational database you’re using; MySQL, IBM DB2, even SAP HANA! You can achieve at least the core tenants of database DevOps with this neat and simple little command line tool – there’s not even an installer, you just have to unzip!

Now I’ve had a lot of fun working with Flyway so far and, thanks to a few people (Kendra, Julia – i’m looking at you both!) I have been able to wrap my head around it to, I would say, a fair standard. Caveat on that – being a pure SQL person please don’t ask me about Java based migrations, I’m not quite there yet!! But there is one thing that I kept asking myself:

“When I’m talking to colleagues and customers about Database DevOps, I’m always talking about the benefits of continuous integration; building the database from scratch to ensure that everything builds and validates…” etc. etc. so why haven’t I really come across this with Flyway yet?

think tom hanks GIF by The Late Show With Stephen Colbert

Probably for a few reasons. You can include Flyway as a plugin in your Maven and Gradle configurations, so people writing java projects already get that benefit. It can easily form part Flyway itself by virtue is simply small incremental scripts and developers can go backwards and forwards however and as many times as they like with the Flyway Migrate, Undo and Clean commands, so is there really a need for a build? And most importantly, Flyway’s API just allows you to build it in. So naturally you’re building WITH the application.

But naturally when you’re putting your code with other people’s code, things have to be tested and verified, and I like to do this in isolation too – especially for databases that are decoupled from the application, or if you have a number of micro-service style databases you’d want to test all in parallel etc. it’s a great way to shift left. So I started asking myself if there was some way I could implement a CI build using Flyway in Azure DevOps, like I would any of the other database tooling I use on a regular basis? Below you’ll find the product of my tinkering, and a whole heap of help from Julia and Kendra, without whom I would still be figuring out what Baseline does!

Option 1) The simplest option – cmdline

Flyway can be called via the command line and it doesn’t get more simple than that.

You can pass any number of arguments and switches to Flyways command line, including specifying what config files it’s going to be using – which means that all you have to do, is unzip the Flyway components on a dedicated build server (VM or on-prem) and then, after refreshing the migrations available, invoke the command line using Azure DevOps pipelines (or another CI tool) to run Flyway with the commands against a database on the build server (or somewhere accessible to the build server) and Bingo!

No Idea Build GIF by Rooster Teeth

And that’s all there is to it! You get to verify that all of the migrations up to the very latest in your VCS will run, and even if you don’t have the VERY base version as a baseline migration, you can still start with a copy of the database – you could even use a Clone for that!

But yes, this does require somewhere for Flyway to exist prior to us running with our migrations… wouldn’t it be even easier if we could do it without even having to unzip Flyway first?

Option 2) Also simple, but very cool! Flyway with Docker

Did you know that Flyway has it’s own docker image? No? Well it does!* Not only that but we can map our own version controlled Migration scripts and Config files to the container so that, if it can point at a database, you sure as heck know it’s going to migrate to it!

*Not sure what the heck all of this Docker/Container stuff is? You’re not alone! Check out this great video on all things containers from The Simple Engineer!

This was the method I tried, and it all started with putting a migration into Version Control. Much like I did for my post on using SQL Change Automation with Azure SQL DB – I set up a repo in Azure DevOps, cloned it down to my local machine and I added a folder for the migrations:

Into this I proceeded to add my base script for creating the DMDatabase (the database I use for EVERYTHING, for which you can find the scripts here):

Once I had included my migration I did the standard

Git add .
Git commit -m "Here is some code"
Git push

and I had a basis from which to work.

Next step then was making sure I had a database to work with. Now the beauty of Flyway means that it can easily support 20+ RDBMS’ so I was like a child at a candy store! I didn’t know what to pick!

For pure ease and again, simplicity, I went for good ol’ SQL Server – or to be precise, I created an Azure SQL Database (at the basic tier too so it’s only costing £3 per month!):

Now here’s where it gets customizable. You don’t NEED to actually even pass in a whole config file to this process. Because the Flyway container is going to spin up everything that would come with an install of Flyway, you can pass it switches to override the default behavior specified in the config file. You can adapt this either by hard-coding strings or by using Environment Variables alongside the native switches – this means you could pass in everything you might need securely through Azure Pipeline’s own methods.

I, on the other hand, was incredibly lazy and decided to use the same config file I use for my Dev environment, but I swapped out the JDBC connection to instead be my Build database:

I think saved this new conf file in my local repo under a folder named Build Configuration – in case I want to add any logic later on to include in the build (like the tSQLt framework and tests! Hint Hint!)

This means that I would only need to specify 2 things as variables, the location of my SQL migrations, and the config file. So the next challenge was getting the docker container up and running, which fortunately it’s very easy to do in Azure Pipelines, here was the entirety of the YAML to run Flyway in a container (and do nothing with it yet):

trigger:
- master

pool:
  vmImage: 'ubuntu-latest'

steps:
- task: DockerInstaller@0
  inputs:
    dockerVersion: '17.09.0-ce'
  displayName: 'Install Docker'

- task: Bash@3
  inputs:
    targettype: 'inline'
    script: docker run flyway/flyway -v
  displayName: 'Run Flyway'

So, on any changes to the main branch we’ll be spinning up a Linux VM, grabbing Docker and firing up the Flyway container. That’s it. Simple.

So now I just have to pass in my config file, which is already in my ‘build config’ folder, and my migrations which are in my VCS root. To do this it was a case of mapping where Azure DevOps stores the files from Git during the build to the containers own mount location in which it expects to find the relevant conf and sql files. Fortunately Flyway and Docker have some pretty snazzy and super clear documentation on this – so it was a case of using:

-v [my sql files in vcs]:/flyway/sql

as part of the run – though I had to ensure I also cleaned the build environment first, otherwise it would just be like deploying to a regular database, and we want to make sure we can build from the ground up every single time! This lead to me having the following environment variables:

As, rather helpfully, all of our files from Git are copied to the working directory during the build and we can use the environment variable $(Build.Repository.LocalPath) to grab them! This lead to me updating my YAML to actually do some Flyway running when we spin up the container!

trigger:
- master

pool:
  vmImage: 'ubuntu-latest'

steps:
- task: DockerInstaller@0
  inputs:
    dockerVersion: '17.09.0-ce'
  displayName: 'Install Docker'

- task: Bash@3
  inputs:
    targettype: 'inline'
    script: docker run -v $(FLYWAY_LOCATIONS):/flyway/sql -v $(FLYWAY_CONFIG_FILES):/flyway/conf flyway/flyway clean -enterprise
  displayName: 'Clean build schema'

- task: Bash@3
  inputs:
    targettype: 'inline'
    script: docker run -v $(FLYWAY_LOCATIONS):/flyway/sql -v $(FLYWAY_CONFIG_FILES):/flyway/conf flyway/flyway migrate -enterprise
  displayName: 'Run flyway for schema'

Effectively, this will spin up the VM in ADO, download and install Docker, fire up the Flyway container and then 1) clean the target schema (my Azure SQL DB in this case) and 2) then migrate all of the migrations scripts in the repo up to the latest version – and this all seemed to work great!*

*Note: I have an enterprise Flyway licenses which enables loads of great features and support, different version comparisons can be found described here.

So now, whenever I add Flyway SQL migrations to my repo as part of a branch, I can create a PR, merge them back into Trunk and trigger an automatic build against my Flyway build DB in Azure SQL:

Conclusion

Getting up and running with Flyway is so very very easy, anyone can do it – it’s part of the beauty of the technology, but it turns out getting the build up and running too, when you’re not just embedding it directly within your application, is just as straightforward and it was a great learning curve for me!

The best part about this though – is that everything above can be achieved using pretty much any relational database management system you would like, either via the command line and a dedicated build server, or via the Docker container at build time. So get building!

ready lets go GIF

Classification by design: Including data cataloging in a DevOps process

“The most powerful tool we have as developers is automation.”
Scott Hanselman

It is no secret that I love to talk about data protection, specifically from the perspective of structured data. When we talk about database development practices, we often find ourselves talking about 3 things most often:

  • Source Control
  • Continuous Integration and Continuous Delivery/Deployment (CI/CD)
  • Estate Monitoring

Some people refer to this as “DataOps“, others refer to it as “DevDataOps” but in reality, it’s all DevOps guys. This may be an unpopular opinion (and if it clashes with yours please forgive me, it’s just my opinion) but just because a certain niche area hasn’t been specifically called out within a subset of DevOps doesn’t mean you have to invent your own term for it!

Now this leads me on to DevSecOps, or as I like to call it… More secure DevOps.

rimshot GIF

No but seriously this is a slightly different case – DevSecOps is like DevOps but fortified with security from the ground up. There’s a fantastic article and diagram of this on Plutora from Mark Robinson of how this looks (below) and if you haven’t read his article I would definitely go and give it a read!

DevSecOps Diagram

Good DevOps practice is a combination of different things working together, bringing the right mentality, the principles, processes and amazing tools at our disposal like automation but this all includes security from the ground up too. DevOps is about putting those principles and practices in place to strengthen the pipeline, so why don’t we treat security in the same way?

Take, for example, 3 pieces of legislation that have been very much in the spotlight:

The controller shall implement appropriate technical and organisational measures for ensuring that, by default, only personal data which are necessary for each specific purpose of the processing are processed.
GDPR (Europe) Art. 25 “Data protection by design and by default”

Processing agents shall adopt security, technical and administrative
measures
able to protect personal data from unauthorized accesses and accidental or unlawful situations of destruction, loss, alteration, communication or any type of improper or unlawful processing.

– LGPD (Brazil) Chapter VII, Art. 46 “Security and Secrecy of Data”

“A Controller or Processor is required to implement appropriate technical and organisational measures to demonstrate that Processing is performed in accordance with this Law…”
DIFC LAW NO. 5 OF 2020 (Dubai) Part 2D, Art. 14 (2) “Accountability and notification”

There’s a common running theme here and although lots of global legislation will either allude to, or directly tell you ways you can be compliant and what some of these “organizational” and “technical” measures are, it’s still pretty blurry.

How do we know what we can do? How do we know what “default” and “design” mean in this context? Well, we build it into the DevOps process.

Now I could sit here forever and talk about why transforming your database development, deployment and provisioning processes allows us to be more secure, but that’s a lot of material and it might have to come in chunks! So what we’re going to focus on today is as the title suggests: Data Classification and Cataloging.

Why is Cataloging important?

Cataloging structured data is incredibly important because it can be one of the first steps we take to securing sensitive Personally Identifiable Information (PII) or Protected Health Information (PHI) wherever it exists across our database environments. It allows us to make strengthened, contextual decisions about the data we hold including how we treat it in pre-Production, how long we retain it for and which systems and processes consume it.

But the most important part of this is simply: it tells us where the risk is.

Read through any of the most recent data protection laws and you will notice that a few things come up quite a lot including “Data Protection Impact Assessment“, or DPIA. Effectively if you can assess the risk of processing activities you can more readily answer the data protection questions and challenges you may face.

Knowing where your data resides can be the first step to helping you assess this risk, and to more readily answer your own data questions. If you want to read more about Cataloging specifically and why it is useful, you can read more about it on my previous blog here.

Where does Cataloging fit into DevOps?

This one is simple to answer. Once you have fully classified your entire estate, you’re not done. No, if you’re a development house or indeed even a single developer – if you are making any schema changes to the tables holding that sensitive data, you’re never done.

The reason for this is that Cataloging is an evergreen activity – if you update tables by removing columns, adding columns, splitting tables, adding tables… anything! Well then you need to be ready to make sure that you are:

a) Prepared and equipped with knowledge of the tables you’re working on and if this is a high risk activity.

b) Updating classification information to reflect the new “truth”, i.e. if you’re adding a column that will collect people’s Twitter handles, then that column should be classified as sensitive, and this should be reflected the moment it is deployed to Production.

So it is important to have the correct people working on this, with the right knowledge, preparation and processes and using the correct tools ensuring that those updates are persisted properly and securely through your deployment pipelines.

Huh… people, processes and tools… That sounds familiar!

The Process: SQL Data Catalog, SQL Change Automation and Azure DevOps

For this little experiment of mine I used Redgate’s SQL Change Automation (Migrations First approach in SQL Server Management Studio) and SQL Data Catalog to both develop & deploy and classify/categorize respectively, and for simple version control and orchestration of this pipeline I opted for Azure DevOps (with SQL Change Automation CI/CD plugins):

NOTE: Heads up, all of the code I used for this can be found right here in my GitHub, feel free to have at it!

Step 1) Setup SQL Change Automation project with an Azure DevOps Git Repo and then create the YAML file to build it, and create a deployment process to Acceptance and the Production.

Ok – that’s a bit of a mouthful and a massive ask. There’s a lot of things there you have to be familiar with, but we don’t have time to go into right here. BUT fortunately if you checkout Redgate University right here, my blog post on using Change Automation with Azure DB and Redgate’s Product Learning section, you’ll be a Database DevOps ‘Whizz-Kid’ in no time!

Fast forward a little and I had my example databases, VCS and pipeline all up and running:

Step 2)The “Theory”: This is where things get interesting. So we have an example pipeline set up and we are able to completely deploy all the way through to “Production” so let’s talk theory.

In SQL Data Catalog I have covered both my Production and Acceptance Databases:

Now, in development we don’t make changes directly to Production, so why should Classification be any different? Now how you adapt the above code is up to you, feel free to split it, move it around, incorporate it into Pull Requests if you want to… But I’m going with a bit more of a simple situation.

Situation: Developer makes a change in Development, which gets committed, reviewed and merged o the main branch, resulting in a build and a deployment, in this case to Acceptance and then it is later deployed to Production.

Now, by Acceptance we should only have the “good work”, i.e. all of our testing is shifted left within DevOps so Acceptance is basically the last stop before Production. Therefore we should classify the work we have done on Acceptance, crucially, before it gets to Production and starts gathering sensitive data, and then copy this classification up on deployment.

Ideal: We should have no columns on Production that have not been classified.

Step 3) – In Practice: Fortunately it’s very easy to automate a lot of these steps with SQL Data Catalog utilizing it’s PowerShell cmdlets and REST API. The cmdlets are fully documented and very easy to use (docs here). This allows us to easily scan, classify and copy classifications up to other databases, but we’ll also need to do some checks and report if there are discrepancies, as part of the deployment pipeline that can be investigated.

  • Are there any columns on Acceptance that aren’t classified but have been deployed to Production? (failure to comply with process)
  • Are there any columns on Production that have not been classified? (classification drift)
  • Are there any unclassified columns on Acceptance that have not yet been deployed to Production (for pipeline hygiene purposes)

The other part of this ‘fun’ is reporting what has been changed in the same process. Now fortunately SQL Change Automation spits out a Changes.json file with its Release Artifacts and we can steal that away and find out how many tables have been created or changed in this release and report that back so we can correlate what has been done and what is missing:

So actually getting this up and running is just going to require 3 things:

  1. The PowerShell script from GitHub (or your own personalized variant) as a step in your production deployment
  2. Data Catalog available and pointed at Acceptance and Production (or your versions of these environments)
  3. Variables set in Azure DevOps to fill the gaps (e.g. Where is Data Catalog? Whats my PowerShell Auth token? What are my Acceptance and PROD DBs called? etc.)

3 is the last step there so you’ll need something like this to run the script:

  • DatabaseDeploymentJSON – where the JSON file will be with the latest changes in the Prod release
  • DataCatalogAuthToken – Your PowerShell Auth token from Settings in Data Catalog
  • DataCatalogUrl – The full URL to your Data Catalog installation, missing the “\” at the end (ending :15156)
  • ExportPath (Optional) – I specified the path for my Database Deployment Resources to save typing it out in the Redgate plugins
  • ProdDB / StageDB – As you would expect, the Production and Acceptance/Staging DBs you’re deploying to/from
  • ProdInstance / StageInstance – As above, except the instance the Database are located on

In the variables above the Instance and DB names are purely used within Data Catalog, so there’s no need to worry about anything happening to the actual databases themselves!

Once you’ve run through the deployment pipeline a couple of times and the changes.json file is being produced, you can go ahead and copy the script into an inline PowerShell script step in your release and you should find it will fire to life! I simulated an example by modifying my Contacts table and my Articles table, adding 1 column each and deploying both to Acceptance. I then classified just 1 of these in Acceptance in Data Catalog:

and then approved the deployment to Production and tada!

Ok you probably can’t make all that out, but it effectively says:

(Information) Table dbo.Articles was modified in this deployment.
(Information) Table dbo.Contacts was modified in this deployment.

That much we knew!

1 column(s) with classifications were discovered on VoiceOfTheDBA Acceptance that are not classified in VoiceOfTheDBA Production:
dbo.Articles.TestingPineapple

Excellent, we classified that one so it gets copied up and we can verify that in data catalog against Production:

and finally, we get a warning about Production now containing unclassified columns:

(Alert) The following columns have been discovered on VoiceOfTheDBA Production that require classification:

dbo.Contacts.TestingPineapple

You should classify these columns in VoiceOfTheDBA Acceptance prior to the next deployment.

Just as we expected. Success!

Happy Tom And Jerry GIF

Conclusion

Classification and categorization belongs as part of DevOps, if you expect the context for your business decisions around data to remain evergreen and informed then it cannot sit on the shoulders of one or two people to support it, and it cannot live in a manually updated Excel sheet or document.

By including it within the DevOps process, not only do you add an additional layer of security but you also make it an automated, team activity that can be audited, checked and easily kept up to date.

Is this DevSecOps? Well… not really no. Is this a more secure approach to Database DevOps? Absolutely! Happy DevOpsing!

My wife and I started a food blog!

“Veganism is not a “sacrifice.” It is a joy.”
Gary L Francione

This is just a short one in the grand schema of things, you probably all know that I enjoy blogging but a lot of my posts end up quite long and rambling. This one however will be quite concise.

My wife and I started a food blog.

There. I said it.

She had been saying to me for some time that she wanted a place to take all the vegan recipes we make and to put them up on the interwebs for people to find and make and get as much joy out of as we do, so we finally did it. It stalled a few times and then eventually, like 2 weeks ago, we finally got the first couple of posts up.

If you’re interested in the sort of plant-based cooking we do at home, you’re looking to go plant based or just looking to help the animals out, you can find our blog right here:

https://TheSnugVegans.com/

We look forward to dining with you!

The Snug Vegans