Testing Terraform Providers

Terraform's various providers are essentially adapters from Terraform's "idealized" view of an object lifecycle to the practical reality of actual vendor APIs.

As a result, early on in their development it became clear that unit testing alone cannot give complete coverage of their functionality: the changing behaviors of the remote API significantly affect the provider's functionality, and so testing it end-to-end against real APIs has been an effective way to verify that provider behaviors are not accidentally broken by ongoing maintenence, and to recognize when upstream API changes have inadvertently broken a provider's assumptions.

The Terraform provider teams refer to these end-to-end tests as acceptance tests, and primarily they run them just prior to starting work on a system and then again a change is implemented to verify that it doesn't inadvertently affect unrelated functionality.

However, because Terraform's providers were originally developed in the same codebase as Terraform itself, the acceptance testing framework was developed as a Go package wrapping Terraform Core's internals, directly calling in to the same underlying APIs as Terraform CLI does. When we split providers into separate codebases for Terraform 0.10, the providers began depending on Terraform CLI's repository as an upstream library dependency, and thus retained the acceptance testing functionality through their vendored copies of Terraform Core.

That architecture became particularly problematic during the Terraform 0.12 effort because it means that each provider repository can only depend on one version of Terraform Core at a time, and thus it was impossible to run the provider acceptance tests against both Terraform 0.11 and Terraform 0.12 without resorting to import rewriting hacks.

In the aftermath of Terraform 0.12, I did some prototyping work around other ways providers might implement acceptance tests such that they are not coupled to any particular release of Terraform Core.

Terraform is not a Go library

Although the layout of Terraform's git repository allows parts of it to be treated as a Go library, the only public interface to Terraform Core is via Terraform CLI commands.

We've not put technical measures in place to block importing of Terraform Core packages from external codebases, but we also don't make any effort to preserve Go API compatibility for these between releases: a change to Terraform Core is free to break any of the internal APIs as long as it simultaneously repairs all of the callers.

Splitting the providers into separate codebases in Terraform 0.10 without first creating a formal architectural boundary between the providers and Terraform Core unfortunately created a de-facto library API out of a subset of Terraform Core functionality. We accepted this for Terraform v0.10, v0.11, and v0.12 just because other work took priority. However, Terraform v0.12 created particular tensions for that accidental architecture due to it resulting in significant changes to Terraform internals and inter-package interfaces.

In order to complete the Terraform v0.12 release without adding significant provider changes to its scope, we left lots of dead code in Terraform Core that is no longer being used by Terraform Core itself but was still being imported by the providers. By far the most significant bother in that regard is that the provider acceptance testing framework essentially depends on the whole of Terraform Core itself, because it runs Terraform Core in-process as if it were a Go library.

We recently released a separated SDK for Terraform providers to sever the dependency between providers and Terraform Core, but in its initial incarnation it is effectively a fork of Terraform Core itself in order to support the various existing acceptance tests in published providers.

Acceptance Testing via Terraform CLI

Given that Terraform CLI is the public interface to Terraform, and that acceptance tests are intended as end-to-end tests, I've been investigating the practicality of implementing acceptance testing in terms of official Terraform CLI release executables running as child processes.

The result of that investigation was an experimental Go library called terraform-plugin-test.

Since it's just a prototype, it has an API that maps quite directly to the main Terraform CLI operations and is thus not compatible with Terraform's existing acceptance testing framework, but it does illustrate that it is technically feasible to write tests that orchestrate Terraform CLI processes and capture results, by exploiting some new features added in Terraform v0.12 that allow machine inspection of plan and state artifacts.

To exercise it, I wrote a test function for Terraform's null provider using this new package, with the following implementation:

package null

import (
	"reflect"
	"testing"
)

func TestResource(t *testing.T) {
	wd := testHelper.RequireNewWorkingDir(t)
	defer wd.Close()

	wd.RequireSetConfig(t, `
resource "null_resource" "test" {
  triggers = {
    a = 1
  }
}
`)

	wd.RequireInit(t)
	defer wd.RequireDestroy(t)
	wd.RequireApply(t)

	state := wd.RequireState(t)
	if got, want := len(state.Values.RootModule.Resources), 1; got != want {
		t.Fatalf("wrong number of resource instance objects in state %d; want %d", got, want)
	}
	instanceState := state.Values.RootModule.Resources[0]
	if got, want := instanceState.Type, "null_resource"; got != want {
		t.Errorf("wrong resource type in state\ngot:  %s\nwant: %s", got, want)
	}
	if got, want := instanceState.AttributeValues["triggers"], map[string]interface{}{"a": "1"}; !reflect.DeepEqual(got, want) {
		t.Errorf("wrong 'triggers' value\ngot:  %#v (%T)\nwant: %#v (%T)", got, got, want, want)
	}

	// Now we'll plan without changing anything. That should produce an empty plan.
	wd.RequireCreatePlan(t)
	plan := wd.RequireSavedPlan(t)
	if got, want := len(plan.ResourceChanges), 1; got != want {
		t.Fatalf("wrong number of resource changes in plan %d; want %d", got, want)
	}
	instanceChange := plan.ResourceChanges[0]
	if got, want := instanceChange.Type, "null_resource"; got != want {
		t.Errorf("wrong resource type in plan\ngot:  %s\nwant: %s", got, want)
	}
	if !instanceChange.Change.Actions.NoOp() {
		t.Errorf("wrong action in plan\ngot:  %#v\nwant no-op", instanceChange.Change.Actions)
	}

	// No we'll change the triggers, which should cause null_resource to show
	// as needing replacement in the plan.
	wd.RequireSetConfig(t, `
resource "null_resource" "test" {
  triggers = {
    a = 2
  }
}
`)
	wd.RequireCreatePlan(t)
	plan = wd.RequireSavedPlan(t)
	if got, want := len(plan.ResourceChanges), 1; got != want {
		t.Fatalf("wrong number of resource changes in plan %d; want %d", got, want)
	}
	instanceChange = plan.ResourceChanges[0]
	if got, want := instanceChange.Type, "null_resource"; got != want {
		t.Errorf("wrong resource type in plan\ngot:  %s\nwant: %s", got, want)
	}
	if !instanceChange.Change.Actions.DestroyBeforeCreate() {
		t.Errorf("wrong action in plan\ngot:  %#v\nwant destroy then create (replace)", instanceChange.Change.Actions)
	}

	// For good measure, we'll apply that saved plan to make sure the destroy
	// works (should be a no-op, but successful).
	wd.RequireApply(t)
}

Behind the scenes, the methods of this wd object are manipulating a temporary directory. RequireSetConfig writes a configuration file into that directory (or fails te test if it cannot), RequireInit runs terraform init, RequireApply runs terraform apply, etc.

Some initialization code not shown here prepares the global testHelper object that knows where to find a terraform executable to run, and then RequireNewWorkingDir establishes the temporary working directory and places a minimal set of files inside to ensure that within this particular working directory the null provider is interpreted as the current code we're working on.

Test Program as Plugin Server

Terraform plugins are child processes that are launched by Terraform CLI itself and speak a Terraform-specific RPC protocol over a local socket in order to integrate into the Terraform operation lifecycle.

Normally when testing Terraform provider changes against a real Terraform CLI executable we must separately run go install to build a binary of the provider and make sure it's in one of the locations where Terraform CLI will discover it. That adds extra overhead to the develop/test cycle, and a key advantage of acceptance tests is shortening that test cycle by directly testing the current provider code in a dev environment, implicitly built by the go test command.

To preserve that benefit, terraform-plugin-test employs a sneaky trick: the test program itself is able to initialize as a Terraform provider plugin.

The go test command internally builds a temporary executable containing the test code for the target package and then runs that executable, capturing its output. Aside from specific expectations about what it will write to stdout and how it will exit, the test program is just a normal executable, and can in principle do anything a normal program could do.

My modified null provider includes the following special TestMain in its test package:

func TestMain(m *testing.M) {
	testHelper = initProviderTesting("null", Provider)
	status := m.Run()
	testHelper.Close()
	os.Exit(status)
}

// initProviderTesting is a helper that would ideally be provided by the SDK
// itself if the SDK were to adopt tftest as its testing mechanism, but it
// lives here for now for ease of prototyping.
func initProviderTesting(name string, providerFunc plugin.ProviderFunc) *tftest.Helper {
	if tftest.RunningAsPlugin() {
		// The test program is being re-launched as a provider plugin via our
		// stub program.
		plugin.Serve(&plugin.ServeOpts{
			ProviderFunc: providerFunc,
		})
		os.Exit(0)
	}

	return tftest.AutoInitProviderHelper(name)
}

Go test programs use TestMain as their entry point, and so the code shown above runs soon after the test program is launched. The initProviderTesting helper then makes a decision: if it seems to be running as a child process of Terraform then it will start a plugin protocol server and begin serving requests, but otherwise it will just initialize our singleton testHelper object and then run the tests as normal.

tftest.RunningAsPlugin exploits the fact that when Terraform launches a program as a plugin it sets a specific environment variable that the child program can use as a heuristic to recognize that situation. Normal Terraform plugin programs just use that to print out a message when not run as a plugin, but our special test program can use it to switch between being a test program and being a plugin.

To complete the trick, wd.RequireInit pre-populates the plugins directory within its temporary working directory with a symlink to the test program named such that Terraform will find it and accept it as a plugin for provider null. When the test later runs wd.RequireApply, the test program launches Terraform CLI as a child process, which then in turn launches a second copy of the provider's test program as a grandchild process.

Logs and Debugging

An unfortunate consequence of this plugin-launching trick is that the actual provider code is not running in the same process as the test program itself, and so any logging it produces is intercepted by Terraform CLI rather than directly visible in the test logs as before.

This also complicates an already-complicated story when trying to use interactive debuggers like Delve with Terraform providers: unless you want to run three copies of the debugger at once, you must decide whether to attach the debugger to the test program that's actually running the tests, or to Terraform CLI, or to the second copy of the test program that's running the main provider plugin code.

For the logging story, I expect there's a solution where the test harness could run Terraform CLI in a way that causes it to write its logs to a special file descriptor that the test program can then capture and repeat, though that would require some changes to Terraform CLI itself and thus would not be compatible with any already-released Terraform executables, including all of the Terraform 0.11 releases.

I didn't investigate this area further. If we move forward with this testing strategy, this'll be an area where more research is required.

Compatibility with existing acceptance tests

The API I wrote for this prototype is quite low-level, wrapping individual Terraform CLI operations. Real existing provider acceptance tests are written against an API that is considerably more high-level. The test gives a sequences of configurations to try and the test framework itself handles the individual refresh, plan, and apply operations for each of the configurations.

It seems possible in principle to change the implementation of the existing test framework to call into the "working directory" API of terraform-plugin-test instead of calling into the Terraform Core internal API, and thus allow existing tests to run unmodified. Any test that is relying on the test code and the "real" code running in the same process would fail (e.g. if the test is mocking something via an internal API call), but most existing acceptance tests do not make such assumptions and may "just work".

With that said, there's a significant amount of additional overhead in running Terraform CLI multiple times per test step, and the existing test framework's level of abstraction would require each test to call unconditionally a non-trivial sequence of terraform-plugin-test methods:

RequireInit to initialize the working directory
RequireCreatePlan to create an initial plan
RequireSavedPlan to obtain the contents of the saved plan for analysis
RequireApply to apply the saved plan
RequireCreatePlan again to make sure that the apply was convergent
RequireSavedPlan to inspect the saved plan and make sure it's empty

Provider acceptance test suites for the larger providers already take many hours to run, so the additional overhead of launching six Terraform child processes that in turn launch their own plugin child processes may prove too expensive.

The existing acceptance testing framework API is also particularly focused on testing "create", "update", and "destroy" actions. It has no first-class support for test assertions against the generated plan. For many providers the plan implementation is likely to be unit-testable rather than acceptance-testable, but being able to directly test the result of planning would allow us to directly test complex situations in remote APIs where changing one attribute implicitly affects the value of another, etc. Hopefully the provider acceptance testing framework will grow to include explicit plan testing in the future, too.

Next Steps

Along with releasing the Terraform plugin SDK as a separate repository, HashiCorp has also established a new team dedicated to maintaining the SDK and associated provider development tools.

As a consequence, any next steps in this area will not be mine to make. The team might choose to continue research along the same lines as I described above, or may choose to take another direction entirely if the above drawbacks seem insurmountable. Keep an eye on the Terraform Plugin SDK repository for further updates!