In Terraform today there is the mechanism of "remote state", which allows Terraform to write state snapshots to some remote network service instead of local disk, and relatedly to help coordinate multiple users working with the same workspace via central locking.
That mechanism is currently hidden behind the broader concept of "backends". Terraform's current concept of "backends" aggregates together a number of separate concerns, including:
Determining which workspaces exist for a particular configuration.
Storing state snapshots for each of the workspaces that exists.
Possibly running Terraform indirectly on a remote system.
All of the backends that exist today are embedded in Terraform itself. This is partially just a matter of historical events (we didn't yet attempt to split them out into plugins), but a notable blocker to splitting out backends into separate plugins has been that the interface for running Terraform either locally or remotely is rather tightly coupled to Terraform itself, and so it isn't really practical to create a plugin interface boundary around that functionality.
In an earlier article I discussed a prototype of quite a significant set of changes to how Terraform thinks about these concepts, which included the idea of removing the idea of backends from the user-facing model altogether.
However, the most pressing concern in this area today is having the state storage and workspace management functionality for various different third-party services embedded in Terraform Core itself. This causes maintenence headaches because many of these implementations share business logic with a corresponding provider, and having them all together in the Terraform Core codebase tends to make the Terraform Core team a bottleneck for expanding the set that are available.
This week I've been looking at some less dramatic changes we could make to the backends implementation that could allow splitting the parts that interact with third-party services into our existing provider codebases, while still broadly retaining the same user-facing concepts and configuration language constructs.
Separating Concerns
As described above, backends are currently responsible for three concerns that are related but separable.
The lowest level component is the state storage, which consists of implementing the following four operations:
Save a new array of bytes
Retrieve the most recently-saved array of bytes.
Acquire a lock that other co-operating concurrent instances cannot hold at the same time.
Release a previously-acquired lock.
On top of that baseline, the next component is workspace management. For that, the two required operations are:
Provide a full list of currently-existing workspaces.
Create a new workspace.
Delete an existing workspace.
Produce the settings necessary for the state storage engine to service a particular existing workspace.
The remote operations functionality builds on workspace management by adding
the possibility for a workspace to be associated with a remote execution API.
That's relevant only for the special remote
backend, so I won't discuss this
part any further. For the rest of this article, I'm discussing all of the
backends except the remote
backend, which all run operations locally and
vary only in how they implement workspace management and state storage.
Why separate Workspace Management from State Storage?
In Terraform's current model, the set of currently-existing workspaces and the state storage for those workspaces are coupled together by virtue of being managed by the same backend: if you're storing your state in Amazon S3 then the key listing of of your S3 bucket is always authoritative for which workspaces exist.
However, this coupling has caused frustration for lots of users, particularly in architectures where each environment is split into an entirely separate account on a cloud provider.
In Terraform's current design, those using Amazon S3 for state storage are forced to keep all of the state snapshots for the workspaces of a particular configuration in the same S3 bucket. A common and very reasonable design choice is to have one state storage bucket per environment with a state snapshot per component, as opposed to having one state storage bucket per component with a state snapshot per environment. Terraform's current workspace management design is inappropriate for that approach.
As discussed in my previous article on this subject, it seems more appropriate for the workspace definitions to live in the configuration rather than "in the backend". The other article takes quite an extreme approach to get there, but by separating the idea of state storage from the idea of workspace management we can potentially get a similar effect without such a drastic change to the configuration model:
terraform { backend "local" { workspace "PROD" { state_storage "aws" { bucket = "awesomecorp-terraform-PROD" dynamodb_table = "TerraformPROD" path = "happyapp.tfstate" } } workspace "STAGE" { state_storage "aws" { bucket = "awesomecorp-terraform-STAGE" dynamodb_table = "TerraformSTAGE" path = "happyapp.tfstate" } } } }
The above configuration makes a few different statements:
backend "local"
indicates that we want to run operations locally, rather than in a remote system like Terraform Cloud.workspace "PROD"
declares a workspace called "PROD", managed by the local backend.workspace "STAGE"
declares a workspace called "STAGE", managed by the local backend.The two
state_storage "aws"
blocks state that these two workspaces should have their state snapshots stored using the "aws" state storage implementation, which I'm assuming here represents the pairing of Amazon S3 for storage and DynamoDB for locking that the current "s3" backend uses.
Crucially, notice that the "workspace management" role is being handled here by the local backend via its configuration, rather than by the "s3" backend. Nonetheless, we can still (hypothetically) declare that the state snapshots for these namespaces should be kept in Amazon S3, but we give the exact location for each workspace because in this case we would no longer be scanning the S3 bucket in order to decide which workspaces exist.
The configuration example above is not really the point here; it's intended only to illustrate that workspace management and state storage are separable concepts. Each work space has a state storage, but it's not necessary for the state storage to use the same source of record as the workspace management.
State Storage and Workspace Management in Providers
The most popular of the current Terraform backends correspond to a third-party cloud service that already has a Terraform provider:
The
s3
backend is related to theaws
provider.The
azurerm
backend is related to theazurerm
provider.The
consul
backend is related to theconsul
provider.(etc)
While this pairing isn't true for all of the existing backends, it seems conceptually nice to say that the AWS provider is the home for all of the Terraform features that interact directly with AWS APIs. It also creates an architectural convenience in that these pairings of backend and provider tend to share code for interacting with the vendor SDK, for collecting and passing credentials, etc.
For that reason, I'm very keen on the idea of adding state storage and workspace management to the suite of concepts currently offered by provider plugins, rather than introducing a new type of plugin. For those few backends that do not currently have an associated provider, it seems reasonable to create a provider that might initially only contain the state storage and workspace management components. Those providers could potentially grow to define resource types too, if that makes sense for the remote system in question and if there's sufficient interest in managing that system with Terraform.
If we were to make a significant overhaul of Terraform's model of workspaces as discussed in the earlier article then pivoting to state storage living in providers could be a logical addition to that, with workspace management then moving into the project definition language.
However, we'd ideally like to move the existing state storage and workspace management implementations out of Terraform Core without coupling that with a significant change in user-facing concepts and configuration language.
A Provider API for State Storage
Terraform's provider API is defined using
Protocol Buffers, because the
provider protocol is based on gRPC. We could add the following additional
operations to the existing Provider
service definition in order to expose
state storage in providers:
rpc ValidateStateStorageConfig(ValidateStateStorageConfig.Request) returns (ValidateStateStorageConfig.Response); rpc ReadStateStorage(ReadStateStorage.Request) returns (ReadStateStorage.Response); rpc WriteStateStorage(WriteStateStorage.Request) returns (WriteStateStorage.Response); rpc LockStateStorage(LockStateStorage.Request) returns (LockStateStorage.Response); rpc UnlockStateStorage(UnlockStateStorage.Request) returns (UnlockStateStorage.Response);
Following the existing design of resource types, a provider could offer zero
or more state storage implementations, each with a name that either matches
or starts with the provider name, like our example "aws"
above, or
a more specific name like "aws_s3_dynamodb"
if it would make sense to have
multiple state storage implementations in the same provider.
The contents of the request and response messages follow logically from the
state storage operations I described above, so I won't show them all in full
detail, but I'll include the hypothetical ValidateStateStorage
messages as an
example to illustrate some general points:
message ValidateStateStorageConfig { message Request { string type_name = 1; DynamicValue config = 2; } message Response { repeated Diagnostic diagnostics = 1; } }
From the provider's perspective, much like for resource types it must define
the set of state storage type identifiers it supports, like "aws"
in our
example above, and then for each one a schema for its configuration. Terraform
Core would then pass the selected type name and configuration in the Request
for each of these RPC operations, allowing the logic in the provider to
identify a specific location to store state snapshots and a specific mechanism
to manage locks.
A Provider API for Workspace Management
If we were willing to start over with, an entire new language for
defining the workspace for a project, or a workspace
block type within the
local
backend as I showed in an earlier section, the above interface for
pluggable state storage would be sufficient.
However, to work within our current model of backends we need to support pluggable management of workspaces too. Ideally I'd like to support it in a way that allows for the decoupling of workspace management and state storage in future, since that seems a likely architectural direction.
The operations for workspace management in the provider API would be something like the following:
rpc ValidateWorkspaceManagementConfig(ValidateWorkspaceManagementConfig.Request) returns (ValidateWorkspaceManagementConfig.Response); rpc ListManagedWorkspaces(ListManagedWorkspaces.Request) returns (ListManagedWorkspaces.Response); rpc CreateManagedWorkspace(CreateManagedWorkspace.Request) returns (CreateManagedWorkspace.Response); rpc DeleteManagedWorkspace(DeleteManagedWorkspace.Request) returns (DeleteManagedWorkspace.Response);
Just as with the state storages, I'm imagining that a provider declares support for zero or more named workspace management implementations, and defines a configuration schema for each one.
In order to interoperate with the state storage API described in the previous
section without coupling it tightly to the workspace management API, I'd
consider defining ListManagedWorkspaces
as follows:
message ListManagedWorkspaces { message Request { string workspace_management_type_name = 1; DynamicValue config = 2; } message Response { repeated Diagnostic diagnostics = 1; repeated ManagedWorkspace workspaces = 2; } message ManagedWorkspace { string name = 1; string state_storage_type_name = 2; DynamicValue state_storage_config = 3; } }
Specifically, each of the listed workspaces includes the properties
state_storage_type_name
and state_storage_config
which specify
automatically-generated input to the state storage API. In a sense, the
provider is calling into itself here, but doing so indirectly via Terraform
Core so that the state storage API is still exposed as a first-class interface
Terraform could potentially use directly in future.
What of Backends?
That still leaves us with the question of what to do with Terraform's existing "backend" concept. It remains overloaded as both a way to select between local and remote operations and as a way to select different workspace management and state storage for local operations, so a direct mapping into the above APIs would likely not be suitable.
For that reason, I'm considering a multi-phase approach:
Implement the provider-based workspace management and state storage as described above, and make all of Terraform's backends except
local
andremote
each be a thin wrapper around one workspace management implementation offered by a provider. To remain as compatible as possible, each new workspace management implementation would mimic the configuration schema of the built-in backend it's replacing.For this step, Terraform Core is still the arbiter of which backends exist. It still knows that there's a backend called "s3", but rather than having the full implementation of that backend it would instead just know to request the AWS provider and call into its primary workspace management implementation, delegating all of the interesting business logic to the provider plugin.
Introduce something like the local backend
workspace
block type I showed in an example above, making a small step towards workspaces being mastered in the configuration rather than in a remote system, and for the first time exposing the concept of state storage directly in the user model, allowing users to configure it directly in the configuration language.
At this point it would be possible to write new state storage implementations that Terraform Core isn't aware of at all.
Consider phasing out the backends other than
local
andremote
altogether, providing an automatic migration process to convert existing remote workspace data into abackend "local"
block withworkspace
blocks inside.This can exploit the fact that the workspace management API is defined to produce just configuration for state storage, and thus ask that API one final time for the storage configuration of each workspace to freeze into
workspace
blocks in the configuration. After that, the workspace management API in providers would no longer be used.Consider a more elaborate rethinking of the ideas of backends and workspaces along similar lines to my earlier article.
This would be the point where the configuration language for workspaces might change significantly, but all of the previous steps would've put all of the underlying concepts in place so that hopefully it would be practical to implement an automatic migration to the new model, whatever that turns out ot be.
None of the above is an actual plan. For one thing, even step one would require some significant internal refactoring work as described in the following section. However, I'm pleased to have been able to decompose this work into some smaller steps, even if these aren't exactly what we end up doing, because having all of the interactions with third-party services extracted from Terraform Core will make both Terraform Core and these external implementations easier to maintain.
The Catch: A Chicken/Egg Problem
An assumption in the potential plan in the previous section is that it's possible to install one or more providers as part of configuring a backend. That is easy to say but harder to do, for two reasons.
Firstly, terraform init
currently analyzes both the configuration and the
latest state snapshot in order to decide which providers to install. Analyzing
the state is not usually necessary, but can be important in the tricky case
where an object still exists in the remote system (tracked in the state) but
is no longer represented in the configuration. Terraform would need to install
a provider in order to delete that leftover object.
The problem here is that if state storage comes from a provider then we must install that provider before we can retrieve the state. That suggests a two-phase provider installation process:
Ask the backend which provider(s) it needs, and correlate those with any provider version constraints in the configuration to determine a suitable constraint set for installing just the providers the backend needs.
Ask the backend for the list of workspaces and the state for each workspace, at which point it will start up all of the plugins installed in the previous step to interact with the remote system.
Finally, use the configuration and the state together to produce the final constraint set for installing all of the providers needed by the configuration and state.
The second problem is a more hairy one, related to historical implementation details in Terraform CLI. When the backend idea was first implemented in Terraform 0.9, we had to contend with a matrix of different possibilities of migrating from legacy remote state to backends, migrating from the local backend to a remote state backend, etc. As a consequence, the current codepaths for initializing backends are quite complicated and not decomposed in such a way that it would be easy to insert a provider installation step into the middle.
To address that, we'll likely need to contend with a long-anticipated refactoring of the backend mechanism itself, which will hopefully allow us to split it into some smaller parts that are each easier to maintain in their own right, and then to introduce provider installation and instantiation where needed.
Next Steps
As is often the case for these research efforts, my goal here was just to collect data for further discussion, rather than to form a specific plan of action.
None of the above is ready for implementation or scheduled for any particular Terraform release. As discussion continues, I'll hopefully be able to share more refined versions of these ideas in future articles.