Testing Terraform Modules

It's been a long gap since my last article here, sadly. The pandemic has created various new pressures for me, my family, and my friends, and so it was an easy decision to prioritize doing "real work" and deprioritize writing about my work, but one of the functions of this site is to give me a historical record of various things I was working on, and so I'm now looking back on the past many months and realizing how much harder that context is to recover without some articles to refer to.

With that said, my aim here is to try to sum up what is at this point a multi-year intermittent research effort on testing Terraform modules, which only in the last few months became an official project sponsored by HashiCorp.

Historical Context

I originally started initial research on testing of Terraform modules before I joined the Terraform team at HashiCorp, and I've found various paper notes from that time where I was trying out different permutations of tests written in the Terraform language, tests written in external general-purpose languages, unit tests with mocked providers, integration tests with real providers, and various other fussy details about the ergonomics of writing different sorts of tests.

Concurrently with my early experiments, Terratest and Kitchen-Terraform both emerged as great examples of the particular combination of integration tests written in external languages, with Terratest requiring tests written in Go and Kitchen-Terraform requiring tests written in Ruby. Both of these have been used well in the community, but neither has emerged as a significant de-facto standard, and I think in part that's because no single general-purpose languages is comfortable and familiar to all Terraform users, and Terraform already has its own language to learn anyway.

Alongside those other efforts, then, I began using a lightweight convention in some modules I was maintaining: create a subdirectory called tests which, in my initial implementations, was just directly a root Terraform module calling the parent directory as a child module multiple times, like this:

module "defaults" {
  source = "../"

  # no arguments; default values only
}

module "with_reticulator" {
  source = "../"

  enable_reticulator = true
}

I ran these "tests" just by switching into this directory and running normal Terraform commands:

cd tests
terraform init
terraform apply
terraform destroy

In later incarnations of this I found it more convenient to create subdirectories under tests where each one is a smaller root module with only one call, because that way it's easier to iterate on only one example while doing main development, although admittedly harder to run all of the test configurations together at once to verify the final implementation.

I liked that this didn't require any extra software alongside Terraform, but it had two significant drawbacks:

As noted above, the workflow for running multiple tests and collecting the results is not very convenient.
Additionally, these test cases can only distinguish between a successful run or a failed run. There isn't any good way to automatically verify that the resulting infrastructure is actually working as desired.

To help with the second of these drawbacks, I wrote a special Terraform provider called testing which contains some data sources whose entire purpose is only to compare some values and fail if they don't match, allowing me to force an apply to fail in arbitrary situations and thus include additional checks in the signal produced by the test configurations.

Once Terraform v0.13 introduced the decentralized namespace for providers and an open provider registry, I published that provider essentially as-is as apparentlymart/testing, and it's up there at the time I'm writing this and ready to use for any test configuration that you'd run with Terraform v0.13 or later.

Due to a bit of an accident of history, I ended up contributing one of the modules I'd been maintaining independently to be adopted as the official HashiCorp Terraform module hashicorp/dir/template, forgetting that I'd used this informal testing pattern there, so amusingly there is now an official module that is (at least at the time of writing this) using this pattern, although that is a coincidence rather than intentional.

All of the above got us up to the start of module testing being an official priority for the Terraform team, with explicitly-planned research being the first order of business during the v0.15 development period.

Which direction to take?

The main thing I'd learned in my efforts so far is that there are at least infinity plus one different ways to slice this problem, and naturally everyone has their own sense of which parts of the problem are important, which are just unimportant implementation details, and which parts of systems they already know they'd rather just import and not think about too much.

I'm obviously not immune to those biases myself, and so for this first foray into testing as an official research project I picked a relatively modest avenue to explore: how far can we get with this idea of writing integration tests within the main Terraform language, without introducing an external language and language runtime?

This starting point is itself a bias: why should we prioritize writing tests in the main Terraform language? Although I'll be the first to admit that I prefer the design tradeoffs of that approach, the primary rationale is that Terratest and Kitchen-Terraform already did a great job of deeply exploring the other extreme where tests are written in an external language, and so I already have access to various examples of tests written with that strategy and the pros and cons folks discovered while doing it.

With that said then, Terraform v0.15 includes only a very modest step forward compared to my de-facto pattern of using a special provider and a tests subdirectory. The two significant differences are:

Terraform CLI now has an experimental terraform test command which natively understands that directory structure and essentially runs the init, apply, destroy sequence on all of the test suites in sequence, avoiding the need to script that outside of Terraform.
There's an experimental built-in provider called terraform.io/builtin/test which is more-or-less a cut down version of apparentlymart/testing, although lightly redesigned to exploit the fact that terraform test is a more specialized command than terraform apply and so the provider and the command are integrated together to allow us to produce more detailed reports than just whether each configuration succeeded or failed as a whole.

My aim with this baby step, then, is to try to make this hypothetical pattern more visible to module authors and ask those who have some time and motivation to join the experiment to try writing some tests and share what they came up with. I'm hoping to gather a collection of examples, both successful and unsuccessful, first to hopefully show that it's practical to write tests in the Terraform language itself (although that is not a foregone conclusion) and then, if the answer is yes, to gather a list of rough points in the initial prototype which we can iterate on in later phases.

This part of the article is where we reach the present day: the Terraform team shipped Terraform CLI v0.15.0-beta1 today including the work I've described above. For the rest of this I'm going to gaze into the future a little, but with the caveat that feedback could very well invalidate this whole line of inquiry, in which case I'm ready to abandon it and seek a different path.

I know that there is also a cohort particularly interested in writing true unit tests for Tarraform modules, where the focus is on testing only the expressions in the configuration files, with the underlying provider operations mocked away somehow. I've not forgotten this use-case, but some customer interviews at the start of this process (thanks to our product manager, Petros Kolyvas!) showed us that integration testing was the more common user expectation and so, in the interest of starting somewhere, I decided to start there. I do intend to return to unit testing use-cases at a later stage, but I want to make sure we have a good official story for integration testing first.

Language Features for Testing

The documentation and other commentary I wrote so far for the prototype testing features is careful to point out that I used a provider as the vehicle for test assertions only because that allowed for a more self-contained experiment, avoiding the need for cross-cutting conditional behaviors across the whole Terraform language runtime that would likely be disruptive to other work in progress.

Reasonably, that has prompted various people to ask me to speculate about what a more language-integrated version of this might look like. I don't want to stack the deck in favor of this particular answer, but I'd be lying if I said I hadn't already been thinking a few steps ahead about where this path might lead, and so to conclude this I'd like to say a little about the rough direction I'm imagining if the current experiment is successful.

I'm not going to discuss specific syntax here, because it's far too early to get into that, but I do want to introduce some concepts that I'm thinking about and intending to consider as we move toward a more final design.

I think it's fair to say that a lot of Terraform users come from a background where they've primarily used general-purpose imperative-style programming languages, like Ruby, Python, JavaScript, Go, etc. That's true for me too.

When coming from that background, it's common to focus on testing paradigms that also fit that mold: commonly, we write an imperative program that commands the system under test to perform various operations and then run sequential assertions against it. This is the testing model adopted by both Terratest and Kitchen-Terraform, as I discussed earlier.

For various reasons, the Terraform language has adopted a different paradigm. The Terraform documentation commonly uses the word "declarative" to describe Terraform, and indeed declarative programming is a paradigm recognized by the computer science community, although it's often difficult to narrow down a definition for it more precise than "not imperative", by describing what it doesn't do.

As someone who has been deeply involved in the modern development of the Terraform language, I typically reach for literature from functional programming circles when considering new design problems, because in practice (possibly by accident more than by design) Terraform's declarative style has led to it being more at home with functional programming structures like list comprehensions rather than imperative programming structures like loops and other explicit control flow.

The functional programming community has — I think it's fair to say — a broader sense of "testing" than imperative practitioners like myself are typically exposed to. Depending on the implementation language, functional programs can often be more amenable to being shown to be correct by formal proofs rather than by explicit testing, and in situations where that isn't true I think a functional practitioner would perhaps be more likely to reach for property-based testing, where the programmer typically writes down expressions describing predicates that must hold for all inputs and then calls the system under test with various randomly-generated inputs to see if those predicates always hold. Amongst other reasons, this can be because in functional/declarative code we're more often describing just the what and not the how of the implementation, and thus example-based test code can quickly end up just becoming an identical copy of the code it's aiming to test.

(I'm not going to get fully into the weeds of property-based testing here, but if you're interested in learning more about it then I'd suggest starting by reading about QuickCheck, which is a popular utility library for property-based testing in the Haskell language. Even if Haskell is not a familiar language for you I expect you'll find that there's a QuickCheck-like library written for your favorite language that you can use to experiment with this testing paradigm.)

With that said, I don't think it's particularly practical to use the textbook property-based testing techniques for Terraform module integration testing, because the typical strategy of rapidly calling a function with various sorts of correctly-structured random garbage isn't a good fit for modules which can typically take several minutes to deploy and will often lead to a direct monetary cost to whoever is running the tests.

Instead, I'm considering adopting some of the principles of Design by contract to get a similar result by having module developers write configuration to express the following:

For each significant subcomponent of the module (input variables, resources, output values), write preconditions and postconditions as expressions that are expected to hold true for all possible input. (In my initial design sketches this follows a similar model as the validation blocks we added for input variables in an earlier Terraform version.)
Then, alongside the main module implementation code, write one or more examples of an interesting set of inputs to the module, along with some additional metadata such as whether the example is expected to succeed or fail and, in the latter case, some sense of how it ought to fail (which of the specific module subcomponents should report the problem).

A nice benefit of this model is that the work done to author the tests can provide additional benefits outside of automated testing: Terraform could run the preconditions and postconditions under normal, non-test-harness conditions too and thus catch dynamic violations of the invariants that might arise due to remote system outages or defects.

Additionally, depending on how the "test examples" end up being written, it might be possible to extract the ones marked as expecting success and include them in generated documentation for the module, creating a mechanism similar to "doctests" seen in other language ecosystems, which aim to avoid documented examples from becoming invalid over time under ongoing maintenence.

Again, feedback from the initial experiment might lead me in an entirely different direction than what I've described here, so we'll have to wait to see what's really coming next, but I hope the above is at least enough to satisfy the curiosity of several folks who have asked me to speculate on where this experiment might lead!