Tech Debt in Terraform's Plugin SDK

A few months ago, I posted a bit of a Terraform 0.12 retrospective, describing from my perspective (and reflecting only my own views, not that of my team or employer) the various steps and unexpected challenges along the way.

This article is, in a sense, a continuation of that: Terraform v0.12 development hit some additional wrinkles in the subsequent months which we are now emerging from. As before, this article is entirely my own perspective and contains no opinions of HashiCorp nor of anyone else on the Terraform team.

The Terraform Plugin SDK

I work primarily on Terraform Core, the execution engine that implements Terraform's standard workflow, but all of the interesting actions taken while executing a Terraform plan are the domain of Terraform providers, which are implemented as out-of-process plugins by a growing number of people who have deep expertise with the various cloud platforms and other systems that Terraform can interact with.

Terraform Core interacts with these provider plugins using a local RPC protocol over a socket. This architecture is intended to insulate Terraform Core from any possible crashing bugs in the plugins, ensuring that a Terraform operation can still exit cleanly and commit partially-updated state in the even of a bug in a plugin.

The Plugin SDK, then, is the "server-side" implementation of this protocol and a bridge to the provider code itself. We refer to the plugins as the "server" because from the perspective of our local socket connection that is accurate: Terraform Core launches the plugin, waits for it to create a listen socket, and then connects to that socket. For similar reasons, we consider Terraform Core to be the "client" in this protocol.

Provider plugins written in Go use the Plugin SDK as a library, then implement service-provider-specific behavior in terms of the service provider's own SDK, often reaching out to remote hosts over the network. The SDK raises the level of abstraction above that of the wire protocol, making providers more convenient to implement and, as we will see, imposing some conventions that were not enforced by Terraform Core itself until v0.12.

In the early days

Terraform's provider model was established very early on. The earliest release tag we have in GitHub is v0.4.2 from back in April 2015, and the provider plugin model was already established at that point and several plugins implemented, although at that point the providers were maintained within the main Terraform repository and bundled in Terraform's distribution archives.

Until development of the SDK started in 2014, providers were implemented directly against the protocol's RPC stub interface. In Terraform's early life prior to this, the exact scope of Terraform was not as well-defined and so the plugin protocol was, essentially, a direct extension of Terraform Core's own workflow, sharing many of the same Go types, and as a result depending on a number of implementation details.

I consider this is a good example of a careful application of technical debt: exposing Terraform Core's expectations directly avoided imposing any additional assumptions that may make later development more difficult. Instead, the Terraform team of the time elected to wait and learn more about the project's needs and scope.

In practice this caused no immediate problem: providers were slightly harder to develop, but since they were all at that point being developed by the same core team and being released in lock-step with Terraform itself, the team could ensure that all of the providers followed similar conventions and make any adjustments needed as the assumptions evolved.

Once the Terraform Core assumptions became relatively stable, the early versions of the SDK (known then just as helper/schema, its package path in the repository) began to impose certain assumptions on the plugin side, leaving the protocol itself still just a raw representation of Terraform's needs while avoiding the need to carefully re-implement the same conventions across many different providers.

It turns out that the SDK was the component responsible for a number of behaviors that users grew to take for granted as standard Terraform behavior, even though they were actually just convention from Terraform's perspective. For example, the SDK was responsible for the ability to use both the arguments and exported attributes from a resource in interpolation expressions; Terraform Core imposed no such correlation of attributes between the configuration, the diff, and the state.

This architecture did a great job of keeping our options open while we learned more about the problem space and introduced additional features to the RPC protocol, with the SDK often acting as a useful abstraction layer to avoid making updates to every single provider whenever the protocol changed.

The core/provider split

Technical debt, much like the financial instrument it uses as metaphor, gives us increased leverage in the short term in return for greater cost in the long term. The longer the debt remains "unpaid", the greater the cost. The challenge, then, is in deciding when is the right time to pay off that debt.

As the number and size of providers grew, it became clear that having a single team maintaining all of Terraform Core and all providers in a single repository was not scaling well. Organizational changes to better partition the work went a long way to help, but eventually (in early 2017) it became clear that providers needed to have a release lifecycle independent of each other and of Terraform Core.

Some providers move very quickly, possibly producing multiple releases per week, while others are just "done" and never change. Terraform Core has been between these two extremes: development has been constant, but the changes tend to be more involved and released in larger chunks of work in order to minimize workflow disruption for users.

Terraform 0.10 introduced for the first time the idea of automatically installing providers downloaded separately from Terraform itself, along with a means for users to specify which versions of each provider their configurations were compatible with. We achieved that largely just by packaging the same plugin binaries into their own distribution archives; the binaries themselves were largely unchanged from what had been bundled with Terraform in earlier versions.

This change of approach was an enormous benefit in unleashing Terraform's ecosystem. As I write this, there are dozens of different developers working on different Terraform providers, across many different companies. Some providers are maintained by engineers who work for the vendor the provider corresponds to, others are maintained by HashiCorp in partnership with a vendor, and many more are maintained by third-party developers who graciously contribute their work to be used by the broader community.

Unfortunately, in the excitement of making this change, we missed a critical point in the repayment of our technical debt: the tight coupling between Terraform Core and providers in the plugin protocol worked just fine when everything was released and upgraded on the same cadence, but once provider releases took on a life of their own we realized too late that future changes to the plugin protocol would now be much harder to achieve.

Drift of conventions

The concrete wire protocol used by providers was one objective limitation of how we split out provider development. Another more subjective effect was that decentralizing provider development over so many different teams meant that over time some conventions began to drift.

While the Plugin SDK did enforce certain assumptions, it too had quite a lot of flexibility in its design, for similar reasons of keeping options open. Early provider developers had implicitly followed certain conventions of which features to use in which combinations, how to name arguments, how closely to follow external API design vs. making Terraform-specific adjustments, etc.

Without that informal central design effort, different provider teams began to combine SDK features in new and interesting ways. Differing assumptions in upstream APIs begin to surface as subtle differences in style between the providers, and various SDK features that had been added into the mix over the years developed some very interesting and complex emergent behaviors when combined.

With Terraform Core essentially allowing providers free rein, the SDK itself was the only enforcer of shared conventions. Different tradeoffs in provider design led to Terraform Core interpreting results in subtly different ways, and some of these results seemed reasonable at first encounter but we found them to have usability issues or outright buggy behavior in many cases.

When Terraform 0.12 introduced for the first time a consistent type system within Terraform Core itself, Terraform Core was required to take a more opinionated stance on what is considered "correct" provider behavior so that it could in turn rely on those assumptions to implement the new configuration language features and fix certain bugs that had resulted from SDK and provider inconsistencies.

Plugin protocol version 5

As discussed in my earlier article, the configuration language improvements for Terraform 0.12 required ensuring that values can be passed to and returned from providers in a way that doesn't lose type information. This required a significant redesign of Terraform's plugin RPC protocol.

The plugin protocol had a simple versioning mechanism at its introduction, where the client and each server must simply agree on a version integer in order to begin communication. This mechanism was used only to return an explicit error when an incompatibility was detected, to avoid any strange or dangerous behavior that may arise if actually attempting to communicate with mismatched protocols.

By the time we did the core/provider split in Terraform 0.10, the protocol was already at version 4. The protocol did not change at all throughout the 0.10 and 0.11 series, and so it took until 0.12 for us to need to contend with a protocol change in the post-provider-split world. The new protocol introduced in Terraform 0.12 has version number 5.

Knowing that many users would not be able to upgrade to Terraform 0.12 immediately, we wanted to ensure that ongoing provider releases would be usable both with Terraform 0.11 and 0.12, and thus for the first time we needed to offer dual-version plugins. Along with protocol 5 then, we introduced a new negotiation mechanism which allows the plugin client to indicate which versions it supports and invite the server to choose a single mutually-supported version for communication. When a plugin is launched by a client that doesn't support this negotiation protocol, it defaults to protocol 4 under the assumption that it is talking to Terraform 0.10 or 0.11.

Another big part of the problem here was ensuring that the SDK itself could work with both protocols. It had never had to deal with a dual-protocol situation before and was not designed with that in mind, so to avoid significant changes to providers' own source code we implemented a "shim" layer that translates protocol 5 requests into protocol-4-style operations (using the various legacy types from Terraform Core) and then translates the responses back to protocol 5 to return.

Shim explosion

At the time I wrote my previous article, we were largely finished development of Terraform Core v0.12 (some remaining bugs notwithstanding) and we thought we were in good shape with the SDK shims to protocol 5. Initial manual testing against hand-made builds of existing providers gave us the sense that things were working well, and so we were getting ready to begin making provider releases that were compatible with protocol 5.

Unfortunately, as we worked with the various provider teams to try the upgrade and run the existing acceptance tests, it became clear that we actually had a rather intractable problem: the flexibility the SDK offered meant that similar results could be obtained in multiple ways, and that combined with outright bugs in the old SDK (which we had originally assumed were as-yet-unexplained Core bugs) to create a situation where there was no possible shim implementation that would preserve the emergent behavior of all resources across all providers.

The provider developers themselves did nothing wrong here: they used their best judgement to map lots of very different upstream APIs into Terraform's shared language via the abstractions offered in the SDK. The SDK's abstraction was designed to be flexible to maximize the ability to meet new requirements, but intentional flexibility was often not distinguishable from buggy behavior, and bugs that did emerge were reported so far from their root causes that we hadn't yet determined the SDK as root cause.

It was at this point that our two different forms of technical debt came to collect: designing the old protocol around Terraform's implementation details required the new protocol to deviate quite a lot from it, making the shimming behaviors non-trivial. By retaining flexibility in the SDK rather than making it increasingly opinionated, we allowed mutually-incompatible assumptions to emerge that had to be decided either one way or the other in order to shim.

Related to this, we also learned late in the process that some unintended flexiblity inherent in how Terraform 0.11 mapped configuration to a data structure for the provider to consume combined with the previously-mentioned SDK flexibility to allow some unexpected configuration approaches that some providers had come to rely on, which the intentionally-more-constrained new language can no longer support exactly even with shims in place.

As a result, we're now forced to require certain changes to providers that we did not expect at the outset. Investigating many different providers and the differing assumptions they made has been our main task since my last article in October. There are still some details left to figure out as I write this, but we've now got a much stronger understanding of the scope of these differences and, as a nice bonus, we can finally explain the root cause of a number of strange bugs that have been cropping up intermittently for years!

Conclusion

For me, the big learning in this addendum to the Terraform 0.12 process is that technical debt requires careful management.

I totally agree with the tradeoffs made in Terraform's early life, since they are what allowed Terraform to grow its capabilities so quickly and to react to user feedback. With that said, as people come and go from teams and priorities shift, it can be easy to lose track of exactly what debt remains unpaid, and thus to miss the prime opportunity to repay that debt.

With the benefit of twenty-twenty hindsight it would be easy to look back now and say that Terraform Core should've been enforcing a set of conventions from the outset in order to ensure a consistent user-experience, but those decisions were made in a very different context and a lot more uncertainty.

Instead, I wish that we'd been more aware of these items of debt when we were planning the core/provider split. Had we been more concious of that accrued "interest", we might've implemented our protocol version negotiation mechanism earlier, and put in place sociotechnical means to guide provider developers towards a more consistent set of assumptions and design tradeoffs.

That, for me, is the most relevant insight I can take with me to future projects. Not that technical debt is universally bad (it's not!), but that we need to be more deliberate in accounting for it so that we can be on the lookout for good opportunities to pay it down.