The Long Story of Terraform v0.12

Terraform v0.12 was the longest major release development cycle in Terraform's history. Depending on how you count it, the total elapsed time for this release is anywhere from nine months to over a year. The release was originally pencilled in for some time in April 2018 but as I write this in October 2018 we have only just released the first alpha build. What gives?

A variety of factors contributed to the protracted development time for this release, many of which were purely technical but some of which were more socialtechnical concerns. This article is a personal retrospective on the v0.12 release development process, from my perspective. As always with my development log, the views in this article are my own, and not those of HashiCorp or of anyone else on the Terraform team.

Before v0.12

The primary theme of the v0.12 release cycle was making improvements to Terraform's configuration language to improve readability and add some new capabilities that had been frequently requested from prior releases. It was also an opportunity to fix some of the bugs and inconsistencies in the language implementations in prior versions.

However, work in this area actually began long before the start of the v0.12 development cycle: Back in Terraform 0.7 — some time before I joined the Terraform team at HashiCorp — the team had attempted an incremental improvement to introduce first-class support for lists and maps, where before these had been essentially “faked” using formatting conventions against strings.

During that work it became clear that Terraform's old assumptions about configuration value types ran very deep, and so v0.7 introduced some support for lists and maps, but with several limitaions that users quickly found frustrating: most functions could deal only with shallow lists of strings or maps of strings, and use of these types was hamstrung by the lack of features in the language itself for transforming and filtering these collection types.

At the conclusion of the work in v0.7, Terraform internally had at least five different representations of types and values, with various differences between them:

Structures written directly HCL syntax were represented as a subset of Go types passed around as interface{} values, with values each having one of the concrete types string, int, float, bool, []interface{}, or map[string]interface{}.
Terraform later took strings from the HCL result and passed them through HIL — the string interpolation language implemented separately — which internally used a slightly different subset of Go types as HCL with an additional layer of its own type specifications specializing these. HIL values could be strings, integers, floats, bools, lists (of any single type), or maps (of any single type). HIL also included a special "unknown" type that was used to represent values that Terraform will not know until the “apply” phase.
To preserve the assumptions in other parts of Terraform in the 0.7 release, HIL always converted any primitive-typed value to a string on exit from its own interpreter, creating another representation which only supports strings, lists, and maps.
Terraform State and Plan in v0.7 retained their historical representation as a map[string]string in Go — as of v0.7 was known as the "flatmap" representation — which supported list and map structures by using conventions with dot-separated keys in the map and then attempting to recover these by responding to those naming conventions. All leaf values were converted to strings if they were not already, and maps may only be of string due to the naming collision that would otherwise be caused by map keys containing the dot character.
Terraform's plugin development SDK — the helper/schema package in the Terraform Core repository as of v0.7 — has its own schema definition mechanism that attempts to abstract over all of the others listed above, so that provider developers don't need to separately handle each one. The SDK's type system includes object types (the resource types themselves and their nested block structures), strings, integers, floats, bools, list of any single type, set of any single type, and map of any single type.

As you might imagine, Terraform was performing a lot of conversions between the different representations here, in many cases needing to make lossy mapping decisions such as converting numbers and bools to strings. The rough edges between these were the cause of some very strange bugs, including GitHub issue #13512 which arose from a particularly gross chain of conversions.

It was not long after the v0.7 release that I started taking an interest in continuing this work to smooth out Terraform's language implementations. The Terraform community had discovered a number of great techniques and workarounds to do more with Terraform than it was originally intended to address, and I wanted to try to make some of those discoveries first-class in the language.

Still an open source contributor at the time, I got off to a good start with incremental work to improve the HIL parser, in which I also stated the next four incremental steps I wanted to take from here and saw some agreement that these were desirable improvements.

After this I managed to make several more incremental steps, including more intuitive operator precedence and boolean operations. The latter of these already started to hit against more fundamental limitations though: HIL's interpreter design did not allow for the control flow required to evaluate only one of the two result expressions in a conditional, and HIL's type system did not retain element type information for lists and maps, and so these together left the conditional operator constrained only two choosing between two primitive-valued expressions that must both evaluate with no errors.

I eventually attempted to "fix" the HIL type system to track more information (in HIL PR #42) but that change ended up so invasive to the HIL API that it could not be merged. It also only addressed the type system issues within HIL itself, and did nothing about the other four representations that each had their own strange quirks.

It was at this point that I started building cty, which is in some sense a formalization of the idea of using a subset of Go types to represent a dynamic language type system, but built in such a way that all of the invariants of that system are enforced and callers interact with it via methods. My original intent was to start by retrofitting this into HIL similarly to my earlier PR — accepting the HIL API breaks that seemed inevitable by this point — and then gradually update the other four representations in Terraform to use this too, eventually removing all of the lossy conversions.

I had the basic ideas of cty fleshed out in early 2017, and began to try to integrate it into HIL. However, I ran into many of the same challenges as my previous PR, and eventually abandoned this idea. Instead, I started to prototype what I was then calling zcl, which is in a sense a "spiritual successor" for HCL and HIL, incorporating the language design ideas from both but with a new implementation that is built around cty concepts from the ground up and merging the two languages together.

In these early days I didn't have a concrete idea whether Terraform would ever make use of this new implementation or whether instead I was prototyping for some eventual improvements to the existing HCL and HIL engines. But I anticipated that such broad changes would be controversial and so I made a point of sharing my design notes with Mitchell Hashimoto — at that point the Terraform lead along with his numerous other responsibilities! — to get a sense of where his reaction was on a scale from "definitely not" and "that seems interesting". (It was the latter!)

Not long after this the story took an interesting direction in that HashiCorp reached out to see if I would be interested in joining the Terraform team full-time. At that moment I was on a short break while I moved across the country, but eventually I accepted and joined the team in April 2017.

Work on cty and zcl went cold for a while after I joined as I got settled in to the team and worked on other things, but we all knew that the Terraform language needed further improvements and so it wasn't long before the conversation shifted to how we could get from the zcl prototype to a worthy major update to HCL itself, which in the interests of reducing internal confusion we began to call "HCL 2".

Terraform v0.11's focus was on improving the module concept in the Terraform language, which is a feature that is implemented within Terraform itself rather than in the underlying HCL or HIL libraries. This gave some runway for us to figure out how we might gracefully transition from the original HCL and HIL implementations over to "HCL 2", which eventually became the focus for Terraform v0.12. For notational ease I'm going to henceforth refer to the original HCL implementation as "HCL 1".

Gradual Repair

Those who were paying close attention to Terraform GitHub issues towards the end of the v0.11 release cycle may remember me making several vague comments about transitioning to a new configuration language implementation, sometimes also mentioning a plan to produce an "opt-in preview release" that we'd originally hoped to do in the later releases in the v0.11 line.

That was the public-facing manifestation of an effort to swap all HCL and HIL calls for equivalent "HCL 2" calls only within Terraform's configuration layer, leaving the rest of Terraform Core unchanged while we'd get to gather feedback on the language changes before launching it "for real" in v0.12.0.

The opt-in HCL 2 parser was actually merged into Terraform as early as v0.10.7, though omitted from released builds using a compile-time flag because it was not able to progress past the static validation stage. In later updates we "shimmed" the decoded values from HCL2 to match approximately the HCL and HIL config representations described in the previous sections, and got that passing down into Terraform Core smuggled inside the old "raw config" type.

This direction seemed promising at first, with Terraform's basic functionality operating in a contrived test environment using mocked providers. However, it wasn't long before we ran into our first two points of friction:

In order to deal with the JSON syntax ambiguities that plagued Terraform's use of HCL 1, the HCL 2 API requires a description of the expected configuration structure to be provided by the calling application during decoding. Although Terraform Providers built with the SDK do have a schema, in Terraform v0.11 and prior that schema is not visible to Terraform Core.
The RPC protocol Terraform used to communicate with provider plugins used the Go net/rpc package to serialize and transmit the exact same configuration struct types used internally by Terraform Core, which meant that their exact internal structure was effectively part of the wire protocol and could therefore not be changed without breaking compatibility.

Both of these issues indicated that we would need to make breaking changes to the provider RPC protocol in order to complete the transition. From a user's standpoint, the effect is that already-released providers would not be compatible with this new release.

Upgrade Fatigue

By this point we'd been making some non-trivial breaking changes in v0.9, v0.10 and v0.11, and so we were very concious of the fact that these complex upgrade cycles were causing friction for users.

Although some breaking changes for v0.12 were unavoidable, we resolved to try to anticipate what other breaking changes might be required as we iterate towards an eventual v1.0 release and see if we could pack those into as few further releases as possible.

That longer-horizon exploration put development work on pause for a month or so, and the outcome of it was the following expanded scope for the v0.12 release:

Reserve in the configuration language certain attribute and variable names that we expect to need for other planned features, such as the for_each argument within resource blocks, so that these features can be introduced later without further breaking changes.
Adopt a new provider RPC protocol that is unlikely to require more breaking changes in future. This would include switching from net/rpc to grpc, keeping the RPC message formats cleanly separated from Terraform's internal structures, and using the HCL 2 type system for configuration and state representations.
Adopt a new JSON state representation that is unlikely to require any significant breaking changes. Most importantly, this would move away from the "flatmap" encoding to a more intuitive JSON encoding of the HCL 2 type system.
Adopt a new saved plan file format that is less coupled to Terraform's internal structures, uses the HCL 2 type system for before and after values, and has more room for future expansion without further overhauls. Plan files are not portable between Terraform verions, so compatibility here is not as important for the moment, but some third-party tools do still analyze saved plans and so we wanted to minimize future churn for these.

The provider RPC protocol was a particularly sensitive issue because there are numerous existing provider codebases already existing and we knew that making significant changes to those as part of the v0.12 scope would be a bridge too far. As a result, it became the job of the provider SDK to support both the old and new protocols and translate the new protocol onto the existing types and interfaces. Providers can then in principle just upgrade their vendored SDK to get support for the new protocol, but as a result the SDK cannot yet make use of any of the additional possibilities of the new protocol.

With this expanded scope set, this work became the full focus of the Terraform Core team during the v0.12 development cycle. With changes to so many subsystems happening at once, and with the plugin SDK depending directly on several Terraform Core internal types, the gradual repair approach became impractical and so we were instead forced to settle for a "break everything and then fix it" approach.

Refactoring Terraform

Go is a static, nominal-typing language, and the language design intentionally resists "clever" abstractions and dynamic shim behaviors. In a Python, JavaScript, or Ruby program, one might approach a refactor of this magnitude by using duck typing to pass a "close enough" object to existing functions until they have been updated to use the new types. Such an approach is not available for Go, since types are checked by name.

Recent versions of Go have some affordances for refactoring, such as type aliases which allow the same type to have two different names. Unfortunately, this particular effort was not limited to just renaming types, with the internal details of several types also having to change along the way.

In an attempt to minimize the amount of time spent in an unbuildable state, we began by copying the existing code to a new location and adapting it, relying only on unit tests to verify the behavior of what was otherwise dead code in the overall system. This resulted in the new top-level packages configs, states and plans, which are HCL2-flavored replacements for some existing functionality from the config and terraform packages.

However, it was inevitable that eventually we would need to begin swapping out old types for new, and as soon as the first mention of config.Module was replaced by configs.Module Terraform could not be built nor tested until every other occurance was updated. If these two types were exactly equivalent then this would be a trivial search and replace operation, but of course the types of some fields and the signatures of some methods also changed to adopt HCL 2 concepts, and so a significant amount of other related code had to be updated.

The result was a huge commit touching 130 separate files. A commit like this is never desirable, since commands like git annotate and git bisect will trip over this commit for years to come, but in this case it seemed like the least bad option in order to get this work done in a reasonable amount of time.

After merging this huge commit came the work of gradually repairing all of the failing tests. During at work we found that there were many problems that the Go compiler couldn't help us with at all due to some modelling decisions made in Terraform. The full details here could be a long article in themselves, but suffice to say that Terraform has a number of situations where components are bound dynamically to one another, and these connections are often not type-safe.

Deep vs. Shallow Testing

In software engineering we employ a number of different approaches to automated software testing, each with different tradeoffs. The most common distinction is in unit testing vs. integration testing, which are two extremes on a spectrum of how much of a program is covered by a particular test.

Terraform is a complex system and so much of its behavior emerges from how the subsystems are connected rather than from each unit individually. As a result, the Terraform codebase uses a lot of integration testing, in most cases exercising all of the layers from just behind the UI all the way to a mocked provider implementation.

These integration tests were very useful in this effort because they detected situations where we'd incorrectly re-assembled the parts after refactoring. However, a downside of these tests is that it is often unclear exactly what they are intending to test, and a single bug can manifest as hundreds of separate test failures that make root cause analysis difficult.

The Terraform Core team ended up spending several months (elapsed time) gradually reviewing all of the test failures one by one. In most cases the eventual fix was straightforward, but tracking down a particular problem could take several hours. In the early stages of this work, with so many failing tests it was often unclear whether a particular fix was progress or whether it had simply regressed another test.

This test repair time was the primary cause of the delayed release of Terraform v0.12. The large amount of integration test coverage in Terraform was definitely an asset for finding bugs we would not otherwise have detected until manual testing, but the sheer amount of code executed by each test makes them a very blunt instrument for debugging any regressions they uncover.

As I write this, we still have some remaining broken tests in the master branch which we accepted as a compromise in order to finally cut the alpha1 release. Our original goal was to keep the master branch stable throughout this work, but we did not anticipate just how much time we would spend finding root causes for test failures.

Scope Creep

Inevitably there were some changes required that we did not originally anticipate, adding additional changes and risk to what was already a very large release.

Some of these were within Terraform Core itself. For example, we realized during some automated testing with verified modules in Terraform Registry that some modules were relying on a validation bug in prior versions that allowed, in some cases, users to treat a nested block as if it were an attribute with a dynamic expression assigned to it. This was never intended and didn't work reliably, but the fact that existing modules had successfully worked around the issues meant that we needed to bring forward the implementation of the new dynamic block construct, originally intended to come in a later release.

The scope of this release also grew to include some other codebases, however. Terraform Registry parses the top-level constructs from any module imported into it in order to provide documentation, and so of course it also needed to be updated to support HCL 2 syntax. This and other similar integrations caused v0.12 development work to also enter the roadmap for other teams, requiring additional coordination and communication.

Finally, we had originally intended a refresh of the plan UI to fall into the next release, but it turned out to be simpler to complete a prototype of that new UI than to shim the new HCL 2 constructs to fit within the old flatmap-oriented rendering. As a result, the new plan rendering became a bonus feature for v0.12 that was originally intended for a subsequent major release:

Terraform has created an execution plan, shown below. Changes are indicated
with the following symbols:
  + create

  # aws_instance.nomad_server[0] will be created
  + resource "aws_instance" "nomad_server" {
      + ami           = "ami-bcdef123"
      + id            = (known after apply)
      + instance_type = "m3.medium"
      + private_ip    = (known after apply)
    }

  # aws_instance.nomad_server[1] will be created
  + resource "aws_instance" "nomad_server" {
      + ami           = "ami-bcdef123"
      + id            = (known after apply)
      + instance_type = "m3.medium"
      + private_ip    = (known after apply)
    }

Plan: 2 to create

Conclusion

Usually the goal of a retrospective is to identify things that went well and things that could be improved for next time. This has been such an unusual project in many ways that I think many of my takeaways do not generalize to other projects: most projects specifically avoid making large changes like this, due to the many tensions that are created.

With this said, my main takeaway from this project is that this should be the last of its type in Terraform for quite some time. Such a significant change to the configuration language was unavoidable given how much we've learned about Terraform use-cases since the initial v0.1 release, but that knowledge also equips us with some confidence that development can be more incremental from this point onwards.

Our experience with problems that the Go compiler could not help us detect led to some architectural changes to improve that situation in future. The most significant of these is the introduction of the addrs package for strongly-type representations of references to common constructs within Terraform, where previously we relied heavily on runtime checks. Use of these address types also makes other signatures within Terraform easier to understand: the fact that a function takes a resource address rather than a resource instance address tells you that it's working with the pre-expanded form of resources we see in configuration rather than the expanded form retained in the state and in plans.

As I write this, the v0.12.0 release is still not fully complete, but we can at least see the light at the end of the tunnel. The remaining work is largely in coordination with the external dependencies mentioned in an earlier section. Our experience with this project has made it much clearer to us where Terraform has interfaces with other codebases and teams, and we now have sociotechnical mechanisms in place to deal with these interactions more deliberately in future.

Time will tell whether we made the right tradeoffs in our new file formats and protocols in this release. The remaining releases prior to 1.0 will give us some opportunities to address any misses before we consider these decisions as hard compatibility constraints.

As the first entry in my new public development log, this is inevitably a long one as I catch up on a multi-year development effort. Moving forward I hope to use shorter entries in this log to capture research results and learnings as they come up. I found it useful to keep private notes on various aspects of this work along the way, and I find that writing it all out in long form is a good way to organize these thoughts better for future consumption.