Re-imagining Terraform Workspaces

Terraform introduced the idea of "state environments" in Terraform 0.9 as a way to have multiple distinct state objects associated with a particular configuration.

State environments were added within Terraform's existing assumptions that (unless you're using Terraform in an unusual way) the current working directory is a single root Terraform module, and the design assumed that the source of record for which environments exist would be the remote system where the state snapshots are stored, which was the most intuitive extension of Terraform's existing model where one configuration had one state object.

In Terraform 0.10 we responded to user feedback that the term "environments" was confusing, because it overloaded some terminology that usually refers to deployment stages for applications, such as deploying to a staging environment before deploying to a production environment. In order to more clearly separate these concepts, we renamed "state environments" to "workspaces".

This initial workspaces concept was aimed at one specific need: being able to temporarily deploy a second "copy" of some infrastructure, often during development of the Terraform configuration, without disturbing the "real" infrastructure. In practice though, that need was not as common as initially supposed, and workspaces ended up being used as a building block for a similar problem: deploying the same infrastructure permanently as separate deployment stages, with different settings in each.

Workspaces as currently designed (as of Terraform 0.12) are not a good fit for that use-case, and so modelling separate deployment stages for the same configuration involves carefully coordinating several details separately:

Select the appropriate workspace.
Provide the appropriate set of variables to Terraform, often using a -var-file argument.
Make sure Terraform is initialized with the appropriate backend configuration, in architectures where each deployment stage has its own segregated Terraform state storage.

Unless you've got Terraform wrapped up in some special automation, it's easy to get one of the above details wrong and, if you're not paying close attention to the plan output, cause unexpected and unwanted changes to your infrastructure, such as accidentally binding production infrastructure to staging infrastructure.

For that reason, the Terraform team's official recommendation in the documentation was not to follow the above model and instead model separate development stages as entirely separate configurations that share child module to represent the common elements. That approach has the advantage that these separate root configurations can encapsulate the variables passed to those shared modules and the separate backend configurations, so that switching to another directory is all that's required to fully switch between deployment stages, and Terraform can then be used in a standard way without risk of cross-contamination between the deployment stages.

The configuration-per-environment pattern provides a solution to isolation between deployment stages, and brings the full definition of those environments into the codebase rather than in a remote system, but left some other use-cases unaddressed or incompletely addressed, including but not limited to:

Explicitly representing the relationships between different configurations so that it is clearer when one workspace should be applied before another, etc.
Allowing certain variable values to be easily shared between different configurations without duplicating them in multiple locations.
Allowing backend configurations to be constructed systematically across a system, rather than independently configured from scratch for each configuration.
Explicitly marking which filesystem directories belong to a particular Terraform configuration for the purposes of uploading it to Terraform Cloud to run remote operations.

For these and other reasons, I've been investigating other possible approaches to modelling these concepts, based on lots of feedback from end-users who are making use of Terraform systematically for larger systems, often through some sort of wrapper automation today in order to smooth the custom usage patterns they've adopted.

After absorbing a lot of varied feedback and teasing out some common elements, I've built an initial prototype of one possible way to alter Terraform's model to better support these use-cases. This is not a proposal on how to move forward, not something that will ship in its current form, and is mainly aimed at experimenting with high-level ideas rather than implementation details. If elements of this prototype do move forward to real implementation in a future version of Terraform, they are likely to be quite different after exposure to further discussion and weighing of design tradeoffs.

Workspaces as Code

A very common theme in feedback was a frustration that Terraform's current idea of workspaces establishes the source of record for which workspaces exist to be the remote state storage, not the configuration.

In current Terraform, we configure a thing called a backend and the backend is responsible for deciding which workspaces exist, where state snapshots for those workspaces are stored, and also where Terraform operations against each workspace will run by default (local vs. remote operations).

# Terraform 0.12-style backend configuration

terraform {
  backend "s3" {
    bucket               = "example"
    region               = "us-west-2"
    key                  = "terraform.tfstate"
    workspace_key_prefix = "awesomeapp"
  }
}

Having configured the backend as above, Terraform will list the contents of the named S3 bucket to look for objects whose key matches a certain pattern built from workspace_key_prefix and key in order to determine which workspaces exist. Creating a new workspace is implemented by writing a new object into the S3 bucket. There is no record in the Terraform configuration of which workspaces exist, and there is no straightforward way to automatically set input variables differently per workspace.

The concept of a "backend" in Terraform is also rather overloaded, especially with the introduction of remote operations (via the special "remote" backend) in a Terraform 0.11 minor release. That makes backends hard to explain conceptually and makes their configuration schema complicated.

With this primary feedback and some other secondary feedback in mind, I set myself the following requirements for an initial prototype:

Workspaces should be defined via configuration in the VCS repository, not via directly-mutable state in a remote system.
Each workspace has both a state storage location (similar to current Terraform) and a set of input variables to use when working with that workspace.
To continue to support the more complex situation where environments have differences that cannot be coveniently represented by variables alone, or where a larger system is decomposed into many layered subsystem configurations, each workspace should be permitted to have its own root configuration directory, too.
As much far as possible without creating ambiguities or sequencing problems, workspace configuration should allow complex expressions as well as literal values in order to permit systematic configuration of larger systems, rather than just aligning various separate configurations by convention.

Initially I began designing some language constructs for use in normal Terraform configuration files to define workspaces, but after imposing on myself the requirement of supporting each workspace having its own configuration directory I switched to establishing a separate file for configuring workspaces, which I've named the "project configuration file" for the sake of this prototype, and thus also established a new noun "project" to represent the container for one or more workspaces.

A project configuration file is named .terraform-project.hcl and a moderately complex one representing two subsystem tiers across two environments might look something like this:

# Prototype "Terraform Project" configuration

locals {
  environments = toset(["QA", "PROD"])
}

workspace "network" {
  for_each = local.environments

  remote = "app.terraform.io/awesomecorp/network-${lower(each.key)}"
  config = "./modules/network"
  variables = {
    environment = each.key
  }
}

workspace "monitoring" {
  for_each = local.environments

  remote = "app.terraform.io/awesomecorp/monitoring-${lower(each.key)}"
  config = "./modules/monitoring"
  variables = {
    environment = each.key
    vpc_id      = workspace.network[each.key].vpc_id
  }
}

This project configuration language is built on the same HCL foundations but is distinct from the main Terraform language. This language includes a number of concepts, not all of which are illustrated in the above example:

Workspace configurations (workspace blocks) define either one or multiple workspaces with similar arguments, depending on whether for_each is set, each of which can have its own configuration path, variables, state storage, and possibility of remote operations.
Local values (in locals blocks) are equivalent to the feature of the same name in the main Terraform language, and allow factoring out common expressions so that we can refer to them in multiple locations elsewhere in the configuration.
Context values (in context blocks) are zero or more named values that can be set during terraform init to provide any settings required to work with a particular project in a particular context. For example, it might be necessary to provide the path to a file containing credentials information for a particular state storage implementation whose location can vary depending on which machine Terraform is being used from, without modifying the configuration itself.
Upstream workspace references (in upstream blocks) serve a similar function to the terraform_remote_state data source, establishing an explicit dependency on one or more workspaces declared in other projects so that their outputs can be used to configure workspaces from this project.

Workspace configurations are the most important concept, and each have the following elements:

A path to a directory containing the configuration for the workspace, which is the same directory as the project root by default. Configuration directories can be shared between workspaces if needed or can all be distinct, depending on what makes most sense for the project in question.
A set of variable values to use when working with that workspace. This is particularly useful when multiple workspaces share the same configuration directory, since this allows for minor adjustments in configuration in spite of the configuration files themselves all being shared.
An optional reference to a remote workspace, which allows easy configuration of both state storage and optional remote operations in the case where a user is using Terraform Cloud. This replaces the "remote" backend in current Terraform; its absense implies local operations only, optionally with remote state storage as in the next bullet point.
For workspace configurations without remote set, the a pluggable state storage implementation from a Terraform provider that will be used as the location for state storage, replacing all of the backends other than "remote".

Notice that the concept of a "backend" is not visible at all anymore. The idea of running operations remotely is built into the remote argument, which covers one feature currently associated with backends. That then simplifies the remainder of the problem to just one of state storage, which is a well-understood idea that dates back to very early versions of Terraform and whose interfaces have not changed along the way aside from the introduction of optional locking.

The configuration shown above, due to its use of for_each in each of the two workspace blocks, actually declares four distinct workspaces:

network.QA
network.PROD
monitoring.QA
monitoring.PROD

In this way, this project is representing a decomposed system consisting of two independently-deployable subsystems across two deployment stages. One of the output values from each of the network workspaces is then used as an input variable for the corresponding monitoring workspace, establishing dependency relationships between the workspaces in a similar sense to how the main Terraform language establishes dependency relationships between resources.

In order to scope things down for initial prototype and avoid the need to refactor dozens of existing backend implementations, I restricted my initial exploration work to a subset of the above features:

Context values are parsed by the project language parser but there is not yet support for setting them during terraform init.
Upstream workspace references are not supported.
Remote workspaces and pluggable state storage are not implemented; state snapshots for a workspace are always stored on local disk in a hard-coded location used only for prototyping.

This reduced scope still allows exploring the new workflow implied by this design idea, without needing to design new interaction patterns with external systems like Terraform Cloud and state storage. Since this subset only supports local operations and local state storage, there's no significant value in implementing the context value idea either because the context is naturally scoped to the local project directory.

Prototype Workflow

I've implemented the prototype enough to demonstrate the slight workflow changes it implies. Using the project configuration shown above and a pair of simple placeholder root modules in ./modules/network and ./modules/monitoring, we can take it for a spin. First, we'll just make ourselves comfortable by seeing how Terraform is understanding the project configuration:

$ terraform workspace list
  monitoring.PROD
  monitoring.QA
  network.PROD
  network.QA

$ terraform workspace select network.QA
Switched to workspace "network.QA".

$ terraform workspace show
Workspace network.QA

Configuration Root: ./modules/network
Input Variables:
  environment = cty.StringVal("QA")
Dependencies:

With these commands we can see how Terraform understood the workspace configurations, and switch between them. Here I selected network.QA first because monitoring.QA depends on it. Note that the remote argument in the example has no effect here because that feature is not implemented in this initial prototype; state storage is always local and it always uses local operations.

Now we can use Terraform in a normal-ish way, though we don't need to manually switch into the ./modules/network directory first in order to apply the configuration from there:

$ terraform apply

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # null_resource.foo will be created
  + resource "null_resource" "foo" {
      + id       = (known after apply)
      + triggers = {
          + "environment" = "QA"
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions in workspace "network.QA"?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

null_resource.foo: Creating...
null_resource.foo: Creation complete after 0s [id=5002033093637402238]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Outputs:

vpc_id = vpc-f8ff454227516224

For demonstration purposes here I'm using a null_resource instance as a placeholder, and the vpc_id output is just a hash of its id (a timestamp).

Due to a current limitation of the prototype, we are also required to apply network.PROD before moving on, because the workspace dependency analyzer isn't sophisticated enough to understand that monitoring.QA only depends on network.QA and not also on network.PROD. That follows the same pattern as above.

We can now switch to the monitoring.QA workspace and apply it, because we've determined the value needed to populate its vpc_id input variable:

$ terraform workspace select monitoring.QA
Switched to workspace "monitoring.QA".

$ terraform apply

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # null_resource.foo will be created
  + resource "null_resource" "foo" {
      + id       = (known after apply)
      + triggers = {
          + "environment" = "QA"
          + "vpc_id"      = "vpc-f8ff454227516224"
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions in workspace "monitoring.QA"?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

null_resource.foo: Creating...
null_resource.foo: Creation complete after 0s [id=8468519291883101686]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Outputs:

envirovpc = QA vpc-f8ff454227516224

I'm again using just a null_resource instance as a placeholder, but this time the vpc_id value automatically propagated from the outputs of the network.QA workspace into the input variables of the monitoring.QA workspace, allowing it to appear in the configuration of null_resource.foo.

Initial Observations

The process of building and experimenting with this prototype has already identified a number of details that require further thought.

Firstly, the ability to use outputs from one workspace to generate the configuration source path and state storage location for a workspace creates some chicken vs. egg problems in terraform init: we need to be able to analyze the configuration and state for all workspaces in order to determine which providers and modules to install, but if dependent workspaces haven't been applied even once yet then evaluating the a downstream workspace configuration can fail. In the prototype this is handled just by emitting warnings during terraform init, but that would not be sufficient for real-world use.

Secondly, workspace blocks behave like resource blocks in the main language in that any dependencies we infer are always between the blocks themselves rather than the dynamically-produced instances of those blocks. In the above case, that means Terraform sees a dependency edge from monitoring.QA to network.PROD even though in practice that dependency relationship is unnecessary. HCL itself is currently unable to improve on this because it doesn't understand the meaning of each.key in the expression workspace.network[each.key].vpc_id. Addressing this might require some more complex static analysis in HCL, or some explicit mechanism within the project configuration language to represent dependencies between individual workspaces.

It might well be that in order to move forward with a design containing these concepts the project configuration language would need to be more restrictive in what it allows, and perhaps rely more on convention over configuration for certain values that we must determine statically. There's plenty of room for compromise in the space between current Terraform's requirement that the backend configurations be totally static and the full flexibility implied by this initial prototype.

I'm also still not feeling sure about introducing a new separate language for defining workspaces at all. It might work out better to somehow infer workspace configuration through convention, such as a filesystem directory layout, but potentially that introduces all of the same concepts but in a much less explicit way, and so might end up being worse in the end.

Where to from here?

I created this prototype only to help me understand the problem space better by trying for a strawman solution. It's very unlikely that any of the details in this initial prototype will make it into an official Terraform release, because there's still plenty more research, design, and prototyping work to do before we can feel confident that I've fully understood the problem space and associated solution space, and that the broader Terraform team is comfortable with any proposed direction.

Aside from the details I explored here, there are broader concerns to consider. For example, I need to consider how any changes we make in the modeling of workspaces could translate into Terraform Cloud and Terraform Enterprise, both of which cover a larger workflow space that includes code review and automatic plan and apply for VCS changes. What does it mean to merge a commit that includes changes to several workspaces at once?

I'd also like to explore the idea of a cascading speculative plan. Currently Terraform works on only one workspace at a time, but in a decomposed system making a change to one layer can lead to an implied, automatic change in another. To fully understand the scope of a change, it would be necessary to produce plans across multiple workspaces at once and have downstream plans be based on the plans from upstream workspaces, rather than their current state. That represents a pretty significant shift in how Terraform thinks about workspaces, so will be a particularly interesting area to explore.

I'm still at a very early stage with this exploration, so no real implementation work is expected in this area for quite some time. I'll share more in later articles as my research continues.