I spent most of last year working on Terraform Stacks, and in the process got an opportunity to revisit a number of different historical design requests that had been blocked (at least in part) by Terraform's promise of backward compatibility.

Although Terraform Stacks is certainly not a free-for-all for breaking compatibility — it aims to remain compatible with most existing published Terraform modules — it does allow some amount of renegotiation because choosing to use Stacks instead of the traditional workflow is opt-in, and so small differences in behavior for Stacks will not affect those who do not choose to use Stacks.

Although I revisited a number of questions, the one I'm going to focus on today is the question of allowing the use of variables in the module source argument, which is relevant here because Stacks treats dependencies a little differently than traditional Terraform does, and so could potentially have differed in what it allowed here. (Spoiler alert: It doesn't actually differ in this way, and this article has some details on why.)

Why support variables in source addresses?

Participants in the GitHub issue on this subject shared various reasons why they'd like to be able to use variables in their module source addresses. After reading them all, I found that there seemed to be two main themes:

  • Specifying a Git repository as the source address and wanting to encode a username and password in that address, but needing those credentials to vary depending on who is executing Terraform or what context Terraform is being executed in.

    Another variation on this is when an airgapped environment needs to install from some sort of "mirror" of the packages because the primary source location is not reachable from that network. I group these together because what they both have in common is representing differing implementation details of how the module is to be fetched, rather than just describing the package that needs to be installed and letting the installer worry about how to achieve that. The source argument should ideally focus primarily on the "what" and not on the "how".

    (This one got broken into a separate issue.)

  • Instantiating the same root module multiple times but using different versions of dependencies in each one.

    This one is often counter-intuitive to those coming from a pure software background, because there we tend to be accustomed to having exactly the opposite goal: full code parity across all environments except when a deployment is in progress. Later in this article I'll discuss some reasons why Terraform needs can differ a little in this respect.

I understand and appreciate both of these needs, and would like to address them both, but what has always given me pause on both of them is that Terraform is just one of many languages that has a means for specifying external dependencies, and I don't know of any other mainstream example that solves these problems by encouraging or allowing users to dynamically modify the dependency addresses.

I have a sense that some readers think that my desire for consistency with other language ecosystems is unnecessary or misguided. This is a subjective question with no objectively correct answer, but the rest of this article is an attempt to summarize my thought process in the hope that a reader would understand my conclusions as reasoned rather than arbitrary, even if they might disagree with the specific reasons and would wish to make different tradeoffs.

The benefits of fixed dependency addresses

The obvious next question is: why don't other language ecosystems allow the dependency specifications to vary each time the program is run, or each time it's compiled, without modifying the source code?

I cannot see into the minds of the folks who made those decisions, but I can see some clear benefits from treating dependencies as static and fixed in code, rather than dynamically-varying:

  1. Dependency installation can be handled as an independent step from compile and run.

    In many language ecosystems, the dependency installer is an entirely separate program from the compiler or runtime.

    In Terraform it is not a separate tool, but it is a separate command: terraform init is the only part of Terraform that installs external dependencies, and it intentionally does so without actually executing them because it's often valuable to be able to review or scan what's just been installed to make sure it doesn't contain anything harmful before executing it.

  2. Dependency upgrades can be partially automated by tools such as Renovate or Dependabot.

    This is essentially an extension of the previous point: allowing entirely separate tools to interrogate and modify the dependencies of a codebase allows for general-purpose software such as security scanners and tools which detect new versions and automatically propose upgrading.

    These are considerably harder to build (and to run at scale) if they are forced to run arbitrary code or fetch dynamic data from elsewhere in order to do their work.

  3. Dependency selections can be tracked in a lock file.

    Most modern language ecosystems follow a "trust on first use"-type model where a developer must take care to review any newly-added dependency but after that the system "remembers" what was reviewed as part of a dependency lock file, so developers can be confident that they are still running exactly what they reviewed over time, and only need to pay the cost of re-reviewing when they explicitly choose to upgrade.

    Terraform cannot currently do this for module dependencies due to another unfortunate deviation from practices in other language ecosystems: Terraform allows (and, by not providing any strong alternative, implicitly encourages) modules to modify their own source code directories during their work, thereby invalidating any checksum calculated from the module package. There's another open issue about module dependency locking which I would like to address eventually.

    That is something we were able to begin rectifying in Terraform Stacks, by treating module package directories as read-only. I expect to continue that maneuver by implementing something like my old proposal Mechanism for creating and using a local temporary directory. Assuming this breaking change is successful, I think at least for Stacks it will become possible to track module source packages in the dependency lock file as requested.

  4. Source code contains everything needed to replicate the original "build".

    In language ecosystems that involve ahead-of-time compilation into a distributable executable artifact, it's meaningful to discuss the idea of "reproducable builds" where the source code contains all information required to produce a byte-for-byte identical executable or distribution package.

    For dynamically-executed languages this idea is necessarily looser, but it's still often desirable to have confidence that e.g. all developers working on a codebase are running the same code, including dependencies, and that developers can feel confident that they are running the same code as is being checked by pre-merge tests and lints during code review.

All of the benefits above are harder to realize (to differing extents) if the dependency selections are not defined statically, in a way that can be consumed by software other than the main language runtime or compiler.

All of these benefits also all seem applicable (again, to differing extents) to Terraform. In some cases Terraform has separate existing design problems that make these benefits harder to realize than in other ecosystems, but I would like to fix those problems over time and I am nervous about making changes that would move Terraform further away from being able to realize these desirable benefits.

Overall, my position is that the more Terraform can follow the established practice of prominent language ecosystems the more likely Terraform can benefit from evolving industry standard practices and from general-purpose tools that are not written only for Terraform's special quirks.

What to do instead, then?

I've said both that I would like to solve the problems that motivated the original request and that I think directly doing what was requested would prevent Terraform from benefitting from consistency with other popular language ecosystems.

What's left between those two positions is to try to address the presented use-cases in a different way that does not decrease the consistency. Furthermore, I'd ideally like to solve these use-cases the same way that other ecosystems solve them, assuming that these problems are not somehow unique to Terraform.

The first use-case of providing credentials to Git definitely isn't specific to Terraform. All of the mainstream language ecosystems I have researched (including but not limited to: NodeJS/NPM, Python/pip, Go, and Rust/Cargo) support some way to specify a Git repository as a source address, and all of them seem to have resolved the credentials question in the same way: you configure your credentials out of band, in your Git configuration rather than in your dependency manifest.

This solution makes sense to me, because credentials are a property of who or what is installing the package, and don't change anything about which package is being installed. To put it another way: the credentials are not part of the identity or content of the dependency, but are rather an implementation detail of how that module gets fetched for local use.

I do, of course, acknowledge the real challenges that individuals shared with getting Git configured properly in transient execution environments like CI systems, where any custom environment setup needs to be repeated every time a job runs. However, those challenges are not unique to Terraform, and so I expect that this common pain across many language ecosystems will lead to gradual improvement of these execution platforms. Terraform — a relatively niche language by comparison — is best positioned to benefit from such improvements by working as much as possible like the larger ecosystems that these products would prioritize.

That then leaves the other use-case, of varying dependencies between environments, where the situation is more subtle...

Are version variations a Terraform-specific problem?

So far I've been readily comparing Terraform to general-purpose programming environments like NodeJS or Go, which are typically used to develop stateless applications which scale horizontally and are immutable once deployed. (They may use mutable state, but it's dealt with by something external to the application, such as a separate database server.)

It's tempting to think of Terraform in a similar way: it runs some stateless code to determine a desired state, then interacts with external stateful services. But of course that's a pretty naïve framing, because whereas a traiditional application uses an external stateful database, for many operators their Terraform configuration is (in some sense) the infrastructure it's describing. Changes to configuration are made with the intent of making specific corresponding changes to the real infrastructure being described, rather than the data store being something separate from the application logic.

Whenever I consider this, I find that the most compelling analogy is to a particularly-sticky gap that application developers often wish they didn't need to deal with: database schema or data migrations. As much as we'd like to believe that our stateless application code and our stateful database are separate autonomous systems, there are inevitably assumptions in the application code about the structure or content of the database

While we can tolerate a compute cluster temporarily running a mix of different application versions (within reason), there is typically only one database that they all must share. If a newer version of the application expects a different database schema than an older one, the database can only be in one of those two states and so at least one of them will be broken.

One way that teams try to address this is to make the new version of the application tolerate the schema both ways, and then they can complete the rollout of the new application version first and then treat the database schema change as a separate step performed independently of the application release. The migration is probably driven by some code that's distributed as part of the application, but that code is triggered explicitly once the new code is running, rather than becoming active immediately once the new application version is deployed.

Under this framing then, we might consider that a Terraform codebase is like an application that consists entirely of database schema migration logic, without any accompanying stateless application code. There is only one set of production infrastructure, and so it can only be either be changed or not changed.

A lot of teams using Terraform try to mitigate this by running a full copy of their production infrastructure that they can use essentially to "rehearse" changes made by Terraform. But this does assume that at any given time there is only one "old version" and only one "new version" and that the new version is gradually replacing the old version first in rehearsal (i.e. a staging environment) and then in production.

Other teams have more complicated environments. For example, a multinational organization might have its infrastructure partitioned by political region and operate each partition independently for legal or compliance reasons. In that case, what one region calls the "old version" might be the expected current version in another region.

Terraform's story for that today is considerably less satisfying: each region must have its own root module that includes a module block that calls whatever that region considers to be current at a particular time. Switching a region to a new release means changing that region's own root module.

That isn't actually all that onerous in practice for simple cases, but it runs into another quirk of Terraform's design that makes it unlike many other language ecosystems: each module call is versioned independently of the others. Whereas most languages require that the entire program agree on a single version of each dependency, Terraform is quite happy to let you request v2.0.0 of a module in one module block but v1.2.0 of that same module in another.

While I don't know that I would've made that design decision had I been around for it and had known what I know now, I can't argue that it isn't useful: before I worked at HashiCorp and was a Terraform user rather than developer I used it to implement a multi-step workflow for transitioning between versions of cluster-based software like HashiCorp Consul and ElasticSearch, where during a transition I would temporarily have two module blocks each requesting different versions of the module but joined together into a single hetrogenous cluster. Once the "new" half of the cluster indicated it was healthy, I'd then remove the module block for the old module and apply again, thereby destroying all of the nodes belonging to the old version and returning to a homogenous cluster again.

However, it does mean that in the more common case where an org wishes to use the same version of a module in all calls and upgrade them all in lockstep they end up needing to update every single module block to refer to the new version.

So maybe this is a Terraform-specific problem?

Reframing the Problem

It also turns out that, despite the reasons being different, a variation of this problem does exist for traditional application codebases: using as-yet-unpublished versions of dependencies while making cross-cutting code changes.

For example, when I'm working on Terraform I sometimes need to concurrently make a change to HCL, which is a separate library that provides the toolkit that the Terraform language is based on. The HCL change I've made is not yet published anywhere, so I cannot just change the fixed dependency in the Go module manifest file.

Go, in common with many other general-purpose language ecosystems, offers a way to solve this problem out-of-band, by creating a separate file that essentially overrides the dependency declarations just for my one working directory, without changing the file that's fixed in version control.

In Go's case, that concept is called workspaces and involves a go.work file. Of course the noun "workspace" already means something else in Terraform, so we'd need to find different terminology to talk about it but the concept still seems sound.

What's missing to support this in Terraform today is a way to talk symbolically about a module dependency so that the hypothetical "dependency override file" can talk about which dependencies it's intending to override. It would be challenging to do that in today's Terraform because each module block has its own separate source address, but Terraform does have its own module address syntax that's decoupled from any particular storage: module registry addresses.

A module registry address is special because instead of directly specifying the physical location of a package it instead captures two pieces of information: - A unique identifier to talk about a particular module package agnostic of how it would be installed, using DNS hostnames so that different organizations can populate that namespace independently without collisions. - The hostname of a default place where Terraform can ask for the available versions of that conceptual module and a physical location from which each version could be obtained.

Terraform currently doesn't really make much use of the first meaning when considering modules, but there's an analogous design pattern for provider plugins where the unique identifier allows overriding the installation strategy for subsets of providers, and I think we could pursue a similar principle for modules so that organizations can benefit from the abstraction of registry-shaped source addresses without actually needing to run a Terraform Registry.

A Dependency Override File

To make this more concrete, let's consider a hypothetical new file format for optionally overriding how Terraform interprets both module registry and provider registry addresses. I'm going to write it as HCL-based here because that's familiar to most Terraform users, but I would expect Terraform to accept a JSON variant of this too so it would be easier to generate mechanically when needed.

source_pkg "tf.example.com/org/app/aws" {
  # Force a specific version selection, but still use
  # the module registry at tf.example.com to find its
  # installation package.
  version = "1.0.0"
}

source_pkg "tf.example.com/org/app/gcp" {
  # Skip using the module registry altogether and just
  # select this branch from this git repository for
  # any call to this module.
  source = "git::https://git.example.com/org-app-gcp.git?ref=dev"
}

source_pkg "tf.example.com/org/app/azure" {
  # Skip installing anything at all and just use whatever
  # is in this local directory (resolved relative to
  # the file that contains these settings)
  source = "../org-app-azurerm"
}

# (and similar principle for provider packages)
provider_pkg "hashicorp/aws" {
  # any of the above options, with similar effects. This
  # would effectively generalize
  # "Development Overrides for Provider Developers"
  # https://developer.hashicorp.com/terraform/cli/config/config-file#development-overrides-for-provider-developers
  # to support non-development use-cases too, in a more
  # localized way than the existing network and filesystem
  # mirror mechanisms.
}

Module authors would then interact with these symbolic definitions by using registry-shaped addresses in their module blocks:

module "example" {
  source  = "tf.example.com/org/app/gcp"
  version = ">= 2.0.0"
}

By default Terraform would try to install the above by connecting to a module registry running on tf.example.com, but if the operator or execution environment supplied an override file like the above then Terraform would not attempt to contact that registry at all and would instead just behave as if the registry had returned git::https://git.example.com/org-app-gcp.git?ref=dev as the source location.

What I've described so far works as long as you intend to only use one version of each distinct module address at a time. For the rare cases where it's important to use multiple versions, the override format could potentially allow combining version with source to specify that the overridden source address is only applicable to a subset of the version numbers of a module:

source_pkg "tf.example.com/org/app/gcp" {
  version = "~> 1.0.0"
  source  = "git::https://git.example.com/org-app-gcp.git?ref=v1.0-dev"
}

source_pkg "tf.example.com/org/app/gcp" {
  version = ">= 2.0.0"
  source  = "git::https://git.example.com/org-app-gcp.git?ref=v2.0-dev"
}

I imagine then that Terraform would use the source_pkg block that best matches the version constraint in each moduleblock referring to tf.example.com/org/app/gcp, so that a call constrained like ~> 1.0.0 would select the first block but an unconstrained call would select the second block.

To make this practical to use, I imagine allowing each root module to have a default instance of this file format which terraform init would use when no other information is given. That file would be the first authority for the meaning of the abstract module and provider identifiers, so that teams that don't wish to run a network-based module or provider registry can place a file like this in their VCS repository instead.

But I would also imagine adding a new option to terraform init that can specify zero or more additional files to use, which would then supersede any matching declarations in the default file. That would allow teams to, for example, have a separate file for each environment and specify it when initializing, in a similar way to how some teams have a separate .tfvars file per environment and specify it when running terraform plan or terraform apply:

$ terraform init -override-deps=production.tfdeps.hcl

(We'd need to think carefully about how this situation would interact with the dependency lock file, since of course each distinct dependency override file has its own set of dependency selections. That seems solvable, though.)

This approach seems to solve the apparently-Terraform-specific problem of varying dependency selections based on what environment the configuration is being used in, and does so by adapting a pattern used to solve a different but similar problem in other language ecosystems.

It turns out that this more general problem is also a problem for Terraform -- module authors want to be able to vary their dependencies temporarily during development, too -- and so solving the problem in this way would also help to meet those needs.

Conclusion

In this article I've presented my understanding of the use-cases behind the request to allow including dynamic values into the source argument of a module block, enumerated some downsides of doing so, and then proposed some different approaches to meet the use-cases that make different design tradeoffs.

Overall my thesis here is that a relatively-niche language ecosystem like Terraform's benefits by being as consistent as possible with the practices followed in larger ecosystems, because general development tooling and industry practices are shaped by the decisions made in those ecosystems.

I believe that the problem of varying the credentials passed to Git when installing source packages from Git repositories is best solved by configuring Git itself and by cross-cutting improvements to CI systems to make such Git configuration more ergonomic, since those investments will have impact across many different language ecosystems rather than just Terraform.

On the other hand, the problem of varying dependency selections based on context is necessarily Terraform-specific both due to the stateful nature of Terraform and because Terraform has its own existing conventions for referring abstractly to dependencies. However, I think Terraform should still take design inspiration directly from designs such as Go Workspaces, and Cargo Workspaces for Rust, so that the differences are only in Terraform-specific syntax details and not in overall approach.

These two resolutions should meet the most important use-cases without the downsides of deviating significantly from the practices in other language ecosystems. The first is outside of Terraform's hands -- it's a matter of integrations between CI systems ang Git -- while the second represents a concrete new set of features that could be offered in a future version of Terraform CLI.

Ultimately whether to pursue it is not my call, but I think it would be of great benefit to fill in some of these gaps in Terraform's dependency-management story.