“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”
– Sir Arthur Conan Doyle, Sherlock Holmes
EDIT: Modified 23/12/2020 to include updates to Flyway in v7, method 4 below
Can you tell I’m loving Flyway at the moment? Well I am. It’s JUST SO GOOD! Honestly there are so many things you can do with it! Don’t know what I’m talking about? Check out my posts on xRDBMS DevOps with Flyway and tSQLt unit tests with Flyway and you’ll see what I mean!
As a result of the above posts though I was asked a question that I had to think about for a little bit before having the best possible answer, how can we seed some testing data INTO the build database so that we can run some meaningful tests against it?
This makes perfect sense to me, but there’s also a few different ways to do this – so let’s go fly(way)!
1 – Test Data Migration Scripts
In my previous posts on Flyway (above) I talked about having an entirely separate build folder present within the repository, and a folder of test migrations alongside our schema migrations – I called these the Build_Config folder, (containing the build configuration file) and the Test_Migrations folder (unsurprisingly containing testing migrations) in the _Migrations location:
I was using the same build config for 2 purposes; 1) to build the schema migrations from the base version, by passing it the Schema_Migrations location dynamically and 2) then building the tSQLt framework and testing objects by passing it the Test_Migrations location dynamically.
This actually worked surprisingly well, but even beyond this – the same method can be repurposed, or added to, by augmenting your testing scripts and adding a data insertion task (as an additional script or group of scripts). In my folder, I can simply add a migration like this:
Because of course I like dogs.
and once pushed to the repository and the build has run we should be able to verify our testing data is present:
A bonus win for this step of course, is that where Devs have their own Flyway config files locally for their development databases they could also overwrite this behavior and point the testing and/or data scripts at their own database so they have some seed data to work with too!
2 – Add a data generation step to the pipeline
There are SO MANY technologies out on the inter-webs for generating data. SO MANY. Many of them also have a command line or PowerShell module that we can use to easily invoke them against a target, especially if that target is going to be persistent like my Flyway Azure SQL Build DBs!
Because I have access to it and because I’m using essentially SQL Server DBs, I could easily use Redgate SQL Data Generator – but to get the data you need you could use anything from DBATools Data Generation (also SQL Server) to FillDB for MySQL (which looks awesome and you could easily use this for Step 1 above too!)
There are numerous ways to invoke tools and applications and fortunately good CI/CD tools like Azure DevOps offer multiple ways to, for instance, run PowerShell or CLI steps from within the pipeline – so we could easily invoke SQL Data Generator on a VM or physical machine we have an Azure DevOps agent on – but this thinking also opens up the possibility of using something like Chocolatey to dynamically install the software on the Azure DevOps hosted pool VM during build (for the Redgate tools at the moment I suppose you’d need a Windows VM).
I will be writing a future blog post about this step because it sounds _very_ interesting, but I’m not sure yet what can be done specifically using Chocolatey or if I’ll have to look elsewhere, although I have read this post in the past (thanks Paul!) detailing limitations and a great workaround using Azure DevOps, so it’s likely that’ll be my first port of call!
Just to give you an idea of end result with SQL Data Generator specifically though:
3 – Use existing data, don’t generate
Ok this one is going to be controversial already, I can tell! Let’s all stay calm!
The best data to be tested is our data. What we have in Production is what will have these changes deployed to it… eventually! So shouldn’t we just test against that? Well. Maybe, maybe not depending on what is in there.
There’s a few methods to achieve this – my personal favorite would be to use a SQL Clone, spin that up on a build VM rather than using an Azure SQL DB, and we can have all the data in an instant. Of course if we hold any sensitive PII/PHI then we should ensure that is protected first!
Of course there are lots of other options, like restoring a backup or spinning up a container etc. and these can all just be a stage in the YAML file before invoking Flyway but the point is, if we use an existing copy of our Prod database from some source or another, it will have 2 things we really care about:
- Data. Ready to go, ready to test, ready to give us the best possible insight into our changes.
- The flyway_schema_history table. Instead of running EVERY migration we’ve ever written, which could take a while for a large team, we run only the latest migrations to check that they would deploy happily to the Production target.
To get this stage to work though, you would need to do a couple of things differently:
- The build DB would have to be created from the clone/backup/other every time instead of simply cleaning the schema down.
- You would need to remove the Flyway Clean step from the pipeline in my previous post, because it would otherwise drop all the tables (and then we wouldn’t have any data!)
- By extension, this also makes the callback to remove the tSQLt objects void, so you can remove that too.
4 (Bonus Method) – Script Migrations
In Flyway v7 the team added the ability to also run script Migrations and Callbacks which mean it is possible to invoke .ps1, .bat, .cmd, .sh, .bash, and .py files as part of the version control > build and migration process.
This means that you can include a script to invoke any loading or processing of data you may prefer – you could invoke a data generation utility, data masking and of course anything else that can be invoked with these file formats. A good example of this might be calling Data Generator as above, or you could use DBATools, DTM Data Generator or even a more platform agnostic approach by using a Hazy generator to produce and then load an incredibly realistic data set.
There are a lot of different ways to generate data, you can generate completely synthetic data, you can mask data or use Prod data, it’s up to you! Ultimately it will just for another part of your pipeline – just be careful of ordering! You don’t want to try generating data into a table that hasn’t been built yet.
Respect your YAML file and you’ll get schema, data and unit tests and this will lead to one thing. Greater insight, earlier.