Terraform's various providers are essentially adapters from Terraform's "idealized" view of an object lifecycle to the practical reality of actual vendor APIs.
As a result, early on in their development it became clear that unit testing alone cannot give complete coverage of their functionality: the changing behaviors of the remote API significantly affect the provider's functionality, and so testing it end-to-end against real APIs has been an effective way to verify that provider behaviors are not accidentally broken by ongoing maintenence, and to recognize when upstream API changes have inadvertently broken a provider's assumptions.
The Terraform provider teams refer to these end-to-end tests as acceptance tests, and primarily they run them just prior to starting work on a system and then again a change is implemented to verify that it doesn't inadvertently affect unrelated functionality.
However, because Terraform's providers were originally developed in the same codebase as Terraform itself, the acceptance testing framework was developed as a Go package wrapping Terraform Core's internals, directly calling in to the same underlying APIs as Terraform CLI does. When we split providers into separate codebases for Terraform 0.10, the providers began depending on Terraform CLI's repository as an upstream library dependency, and thus retained the acceptance testing functionality through their vendored copies of Terraform Core.
That architecture became particularly problematic during the Terraform 0.12 effort because it means that each provider repository can only depend on one version of Terraform Core at a time, and thus it was impossible to run the provider acceptance tests against both Terraform 0.11 and Terraform 0.12 without resorting to import rewriting hacks.
In the aftermath of Terraform 0.12, I did some prototyping work around other ways providers might implement acceptance tests such that they are not coupled to any particular release of Terraform Core.
Terraform is not a Go library
Although the layout of Terraform's git repository allows parts of it to be treated as a Go library, the only public interface to Terraform Core is via Terraform CLI commands.
We've not put technical measures in place to block importing of Terraform Core packages from external codebases, but we also don't make any effort to preserve Go API compatibility for these between releases: a change to Terraform Core is free to break any of the internal APIs as long as it simultaneously repairs all of the callers.
Splitting the providers into separate codebases in Terraform 0.10 without first creating a formal architectural boundary between the providers and Terraform Core unfortunately created a de-facto library API out of a subset of Terraform Core functionality. We accepted this for Terraform v0.10, v0.11, and v0.12 just because other work took priority. However, Terraform v0.12 created particular tensions for that accidental architecture due to it resulting in significant changes to Terraform internals and inter-package interfaces.
In order to complete the Terraform v0.12 release without adding significant provider changes to its scope, we left lots of dead code in Terraform Core that is no longer being used by Terraform Core itself but was still being imported by the providers. By far the most significant bother in that regard is that the provider acceptance testing framework essentially depends on the whole of Terraform Core itself, because it runs Terraform Core in-process as if it were a Go library.
We recently released a separated SDK for Terraform providers to sever the dependency between providers and Terraform Core, but in its initial incarnation it is effectively a fork of Terraform Core itself in order to support the various existing acceptance tests in published providers.
Acceptance Testing via Terraform CLI
Given that Terraform CLI is the public interface to Terraform, and that acceptance tests are intended as end-to-end tests, I've been investigating the practicality of implementing acceptance testing in terms of official Terraform CLI release executables running as child processes.
The result of that investigation was an experimental Go library
called terraform-plugin-test
.
Since it's just a prototype, it has an API that maps quite directly to the main Terraform CLI operations and is thus not compatible with Terraform's existing acceptance testing framework, but it does illustrate that it is technically feasible to write tests that orchestrate Terraform CLI processes and capture results, by exploiting some new features added in Terraform v0.12 that allow machine inspection of plan and state artifacts.
To exercise it, I wrote a test function for Terraform's null
provider using
this new package, with the following implementation:
package null import ( "reflect" "testing" ) func TestResource(t *testing.T) { wd := testHelper.RequireNewWorkingDir(t) defer wd.Close() wd.RequireSetConfig(t, ` resource "null_resource" "test" { triggers = { a = 1 } } `) wd.RequireInit(t) defer wd.RequireDestroy(t) wd.RequireApply(t) state := wd.RequireState(t) if got, want := len(state.Values.RootModule.Resources), 1; got != want { t.Fatalf("wrong number of resource instance objects in state %d; want %d", got, want) } instanceState := state.Values.RootModule.Resources[0] if got, want := instanceState.Type, "null_resource"; got != want { t.Errorf("wrong resource type in state\ngot: %s\nwant: %s", got, want) } if got, want := instanceState.AttributeValues["triggers"], map[string]interface{}{"a": "1"}; !reflect.DeepEqual(got, want) { t.Errorf("wrong 'triggers' value\ngot: %#v (%T)\nwant: %#v (%T)", got, got, want, want) } // Now we'll plan without changing anything. That should produce an empty plan. wd.RequireCreatePlan(t) plan := wd.RequireSavedPlan(t) if got, want := len(plan.ResourceChanges), 1; got != want { t.Fatalf("wrong number of resource changes in plan %d; want %d", got, want) } instanceChange := plan.ResourceChanges[0] if got, want := instanceChange.Type, "null_resource"; got != want { t.Errorf("wrong resource type in plan\ngot: %s\nwant: %s", got, want) } if !instanceChange.Change.Actions.NoOp() { t.Errorf("wrong action in plan\ngot: %#v\nwant no-op", instanceChange.Change.Actions) } // No we'll change the triggers, which should cause null_resource to show // as needing replacement in the plan. wd.RequireSetConfig(t, ` resource "null_resource" "test" { triggers = { a = 2 } } `) wd.RequireCreatePlan(t) plan = wd.RequireSavedPlan(t) if got, want := len(plan.ResourceChanges), 1; got != want { t.Fatalf("wrong number of resource changes in plan %d; want %d", got, want) } instanceChange = plan.ResourceChanges[0] if got, want := instanceChange.Type, "null_resource"; got != want { t.Errorf("wrong resource type in plan\ngot: %s\nwant: %s", got, want) } if !instanceChange.Change.Actions.DestroyBeforeCreate() { t.Errorf("wrong action in plan\ngot: %#v\nwant destroy then create (replace)", instanceChange.Change.Actions) } // For good measure, we'll apply that saved plan to make sure the destroy // works (should be a no-op, but successful). wd.RequireApply(t) }
Behind the scenes, the methods of this wd
object are manipulating a temporary
directory. RequireSetConfig
writes a configuration file into that
directory (or fails te test if it cannot), RequireInit
runs terraform init
,
RequireApply
runs terraform apply
, etc.
Some initialization code not shown here prepares the global testHelper
object
that knows where to find a terraform
executable to run, and then
RequireNewWorkingDir
establishes the temporary working directory and
places a minimal set of files inside to ensure that within this particular
working directory the null
provider is interpreted as the current code we're
working on.
Test Program as Plugin Server
Terraform plugins are child processes that are launched by Terraform CLI itself and speak a Terraform-specific RPC protocol over a local socket in order to integrate into the Terraform operation lifecycle.
Normally when testing Terraform provider changes against a real Terraform CLI
executable we must separately run go install
to build a binary of the
provider and make sure it's in one of the locations where Terraform CLI will
discover it. That adds extra overhead to the develop/test cycle, and a key
advantage of acceptance tests is shortening that test cycle by directly
testing the current provider code in a dev environment, implicitly built by
the go test
command.
To preserve that benefit, terraform-plugin-test
employs a sneaky trick:
the test program itself is able to initialize as a Terraform provider plugin.
The go test
command internally builds a temporary executable containing the
test code for the target package and then runs that executable, capturing its
output. Aside from specific expectations about what it will write to stdout
and how it will exit, the test program is just a normal executable, and can
in principle do anything a normal program could do.
My modified null
provider includes the following special TestMain
in its
test package:
func TestMain(m *testing.M) { testHelper = initProviderTesting("null", Provider) status := m.Run() testHelper.Close() os.Exit(status) } // initProviderTesting is a helper that would ideally be provided by the SDK // itself if the SDK were to adopt tftest as its testing mechanism, but it // lives here for now for ease of prototyping. func initProviderTesting(name string, providerFunc plugin.ProviderFunc) *tftest.Helper { if tftest.RunningAsPlugin() { // The test program is being re-launched as a provider plugin via our // stub program. plugin.Serve(&plugin.ServeOpts{ ProviderFunc: providerFunc, }) os.Exit(0) } return tftest.AutoInitProviderHelper(name) }
Go test programs use TestMain
as their entry point, and so the code shown
above runs soon after the test program is launched. The initProviderTesting
helper then makes a decision: if it seems to be running as a child process
of Terraform then it will start a plugin protocol server and begin serving
requests, but otherwise it will just initialize our singleton testHelper
object and then run the tests as normal.
tftest.RunningAsPlugin
exploits the fact that when Terraform launches
a program as a plugin it sets a specific environment variable that the child
program can use as a heuristic to recognize that situation. Normal Terraform
plugin programs just use that to print out a message when not run as a plugin,
but our special test program can use it to switch between being a test program
and being a plugin.
To complete the trick, wd.RequireInit
pre-populates the plugins directory
within its temporary working directory with a symlink to the test program
named such that Terraform will find it and accept it as a plugin for
provider null
. When the test later runs wd.RequireApply
, the test
program launches Terraform CLI as a child process, which then in turn launches
a second copy of the provider's test program as a grandchild process.
Logs and Debugging
An unfortunate consequence of this plugin-launching trick is that the actual provider code is not running in the same process as the test program itself, and so any logging it produces is intercepted by Terraform CLI rather than directly visible in the test logs as before.
This also complicates an already-complicated story when trying to use interactive debuggers like Delve with Terraform providers: unless you want to run three copies of the debugger at once, you must decide whether to attach the debugger to the test program that's actually running the tests, or to Terraform CLI, or to the second copy of the test program that's running the main provider plugin code.
For the logging story, I expect there's a solution where the test harness could run Terraform CLI in a way that causes it to write its logs to a special file descriptor that the test program can then capture and repeat, though that would require some changes to Terraform CLI itself and thus would not be compatible with any already-released Terraform executables, including all of the Terraform 0.11 releases.
I didn't investigate this area further. If we move forward with this testing strategy, this'll be an area where more research is required.
Compatibility with existing acceptance tests
The API I wrote for this prototype is quite low-level, wrapping individual Terraform CLI operations. Real existing provider acceptance tests are written against an API that is considerably more high-level. The test gives a sequences of configurations to try and the test framework itself handles the individual refresh, plan, and apply operations for each of the configurations.
It seems possible in principle to change the implementation of the existing
test framework to call into the "working directory" API of
terraform-plugin-test
instead of calling into the Terraform Core internal
API, and thus allow existing tests to run unmodified. Any test that is relying
on the test code and the "real" code running in the same process would fail
(e.g. if the test is mocking something via an internal API call), but most
existing acceptance tests do not make such assumptions and may "just work".
With that said, there's a significant amount of additional overhead in running
Terraform CLI multiple times per test step, and the existing test framework's
level of abstraction would require each test to call unconditionally a
non-trivial sequence of terraform-plugin-test
methods:
RequireInit
to initialize the working directoryRequireCreatePlan
to create an initial planRequireSavedPlan
to obtain the contents of the saved plan for analysisRequireApply
to apply the saved planRequireCreatePlan
again to make sure that the apply was convergentRequireSavedPlan
to inspect the saved plan and make sure it's empty
Provider acceptance test suites for the larger providers already take many hours to run, so the additional overhead of launching six Terraform child processes that in turn launch their own plugin child processes may prove too expensive.
The existing acceptance testing framework API is also particularly focused on testing "create", "update", and "destroy" actions. It has no first-class support for test assertions against the generated plan. For many providers the plan implementation is likely to be unit-testable rather than acceptance-testable, but being able to directly test the result of planning would allow us to directly test complex situations in remote APIs where changing one attribute implicitly affects the value of another, etc. Hopefully the provider acceptance testing framework will grow to include explicit plan testing in the future, too.
Next Steps
Along with releasing the Terraform plugin SDK as a separate repository, HashiCorp has also established a new team dedicated to maintaining the SDK and associated provider development tools.
As a consequence, any next steps in this area will not be mine to make. The team might choose to continue research along the same lines as I described above, or may choose to take another direction entirely if the above drawbacks seem insurmountable. Keep an eye on the Terraform Plugin SDK repository for further updates!