The Theory Behind Terraform's "for each"

In early use of Terraform, users usually first learn the syntax for requesting a single remote object, which is often the venerable aws_instance both due to popularity of usage and due to its relative simplicity:

resource "aws_instance" "example" {
  ami           = "ami-abc1234"
  instance_type = "t2.micro"
}

At this stage of learning, the terms resource, instance, and object can seem totally interchangable. This single configuration block is named resource and it creates a single object in AWS, and (confusingly) its type contains the word "instance" even though that is meant in a different sense than Terraform itself uses it.

It usually isn't long before users discover the special count argument, which allows the same configuration block to generate multiple similar objects. At this point we introduce Terraform's idea of instance, which is used to distinguish the single configuration block from each of the possibly-multiple things it describes.

I expect that for many users, the difference in meaning between "instance" and "object" remains obscure even after many years of use, because we intentionally minimize the exposure of the distinction in the Terraform UI. If you have used the special create_before_destroy lifecycle mode then you have indirectly benefitted from the distinction between instance and object in that brief period during terraform apply where both a new object and an old object exist for the same instance because the previous object (the "deposed" object, in Terraform's terms) has not yet been destroyed.

My main motivation for writing this article is to describe the motivations for the new for_each feature that will serve as an alternative to count in a forthcoming version of Terraform, but this is also an excuse to describe some of Terraform's other related concepts in perhaps a slightly more theoretical way than we tend to engage with them day-to-day.

This is certainly not required reading for Terraform users, and indeed part of Terraform's mission is to (as much as possible) worry about these details so you don't have to. With that said, I know that some people enjoy theory for theory's sake and others have learning styles that prefer to build the practice on the theory, so I hope that both of those audiences will find this a useful overview. It may also prove useful for someone considering contributing code changes to Terraform Core itself.

Terraform Configuration as a Function

Before getting into the details, I think it's worth introducing my mental model of Terraform Configuration, since the rest of this article will assume and build upon it.

We often describe Terraform's language as declarative, which is distinguished from imperative. By this, we mean that Terraform configurations describe a desired result rather than the individual steps required to produce that result.

To put that in more practical terms, we can think of a Terraform configuration as a function which returns not actual infrastructure (whatever that might mean) but instead a data structure describing what ought to exist.

This characterization is not completely honest, though: the data structure it returns does include some additional information which implies an ordering of any steps required to achieve the result, as statements of the form "A is required by B". The Terraform language syntax is designed to allow those relationships to usually be defined implicitly, but you can also list them explicitly using the depends_on argument within certain blocks.

Another way this model falls short of reality is that the full data structure describing all desired objects cannot be evaluated all at once. The most obvious situation illustrating that fact is when an identifier assigned by the remote system must be used as part of the configuration of some other object:

resource "aws_security_group" "example" {
  name = "https_server"

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_instance" "example" {
  ami           = "ami-abc1234"
  instance_type = "t2.micro"

  vpc_security_group_ids = [aws_security_group.example.id]
}

With this said, I think a more reasonable mental model is to say that the Terraform configuration is a function that returns a data structure containing other functions, each one defining a resource. In a conventional functional or imperative programming language, we might express the same thing in a different notation:

function all_infrastructure() {
  return {
    resources: [
      {
        type: "aws_security_group",
        name: "example",
        config: function () {
          return {
            name: "https_server",
            ingress: [{
              from_port: 443,
              to_port: 443,
              protocol: "tcp",
              cidr_blocks: ["0.0.0.0/0"],
            }],
          };
        },
      },
      {
        type: "aws_instance",
        name: "example",
        depends_on: ["aws_security_group.example"],
        config: function (sg) {
          return {
            ami: "ami-abc1234",
            instance_type: "t2.micro",
            vpc_security_group_ids: [sg.id],
          },
        },
      },
    ]
  }
}

Notice that the config function for the aws_instance resource takes an extra argument giving the final result of creating or updating the security group and uses it as part of its return value. The Terraform language infers the need for that extra requirement automatically by detecting the reference, but I made it explicit here to help illustrate that at a theoretical level each resource's configuration is itself a function that builds a configuration in terms of other objects in the configuration.

Implied Side-effects

Configuration is only one part of the story, of course. Terraform's task, after executing your configuration "functions", is to determine which actions must be taken in order to reach the described result.

Terraform's repertoire of actions derives from the core set of verbs commonly offered for objects in REST APIs: Create, Update, and Delete. The specific implementation of each of these actions depends on the resource type and is ultimately decided by the provider, but how does Terraform decide which actions are required in the first place?

Terraform maintains a sidecar data structure which is simply called the "state". The first time a user runs terraform apply, there isn't yet any state, which is equivalent to the state being empty. In this simple case, the only reasonable action for each resource in the configuration is to create the objects it requests. Terraform delegates to a provider to determine what exactly a "create" operation entails, which we could consider to be an impure function that takes a data structure describing the result of the resource's configuration function, performs various side-effects to create the requested object, and returns another data structure describing the object that was created:

// Pseudo-code for creating an aws_security_group object
function create_aws_security_group(config) {
  sg = ec2_sdk.create_security_group({
    name: config.name,
    ingress_rules: [
      {
        from_port: x.from_port,
        to_port: x.to_port,
        protocol: x.protocol,
        cidr_blocks: x.cidr_blocks,
      } for x in config.ingress
    ],
    // etc, as required by the EC2 SDK
  })

  return {
    id: sg.id,
    name: config.name,
    ingress: config.ingress,
    // (and any other attributes defined for this resource type)
  }
}

The result of this function is a data structure that is logically a superset of the given configuration data structure, filling in any values that were determined as a result of the side-effect(s). This object is then recorded in the "state" for next time.

On a subsequent run of Terraform, the state lets Terraform know that there's already a remote object representing the requested instances, and Terraform will ask the provider to check for differences between the previous state and the configuration. If none are found then no action is required. If a change is detected then the provider will decide whether it can make that change in-place (via an update) or whether the object must be replaced (a destroy followed by a create, or vice-versa).

Assuming an in-place update is possible, the provider provides another impure function to handle an update:

function update_aws_security_group(prior_state, config) {
  result = ec2_sdk.update_security_group({
    id: prior_state.id
    // etc, etc
  });

  return {
    id: prior_state.id,
    name: config.name,
    ingress: config.ingress
    // etc, etc
  };
}

If we consider the absense of state to be an state in itself, we can say that actions in Terraform generalize as a function which takes both a configuration and a prior state (which might be null) and returns a new state. In order to show plans for approval this is actually two steps in practice, which can be thought of abstractly as two functions:

function plan_change(prior_state, config) {
  // compare prior_state and config, decide what the new
  // object will look like (planned_state) and whether this
  // change requires a destroy+create rather than a single
  // update.
  return planned_state, replace_required;
}
function apply_change(prior_state, planned_state) {
  // call to a remote API to move from prior_state to
  // planned_state, and then build new_state to describe
  // the result.
  return new_state;
}

The distinction between planned_state and new_state is subtle, and is possible only because the Terraform language has a special feature: certain values within planned_state will be marked by the provider as "unknown", meaning that their concrete values will not be known until apply time. The closest analog to this in an imperative programming language would be a promise for a value determined during apply_change.

The distinction between create, update, and delete here is now more implicit: if prior_state is null then we are creating, while if config is null then we are deleting. This gives us the same three operations but implemented through a common pipeline. In practice, the providers themselves still use separate functions to apply each action type due to an abstraction provided by the plugin SDK, but the plan step is shared.

Correlating Between Configuration and State

In the sequence above I intentionally skipped a crucial detail: how does Terraform know which blocks in the configuration correspond to which objects recorded in the state?

Each instance has an identity that persists from one run to the next, which is used in both the configuration and the state. Because these identifiers are crucial to Terraform's planning mechanism, the Terraform language defines a concise reprsentation of them in the form of a resource address. This is a bit of a misnomer since the syntax actually identifies a specific instance rather than a resource as a whole, but in our examples so far resources and instances have been one-to-one because we've not been using count:

aws_security_group.example
aws_instance.example

If we were to update the aws_instance resource to include count = 2, the set of addresses in this configuration would change to reflect that resource and instance are no longer one-to-one:

resource "aws_instance" "example" {
  count = 2

  ami           = "ami-abc1234"
  instance_type = "t2.micro"

  vpc_security_group_ids = [aws_security_group.example.id]

  tags = {
    Name = "example-${count.index}"
  }
}

aws_security_group.example
aws_instance.example[0]
aws_instance.example[1]

Terraform must use an additional instance index to uniquely identify each of the two instances that the resource "aws_instance" "example" block now represents. With count, these indices are always consecutive integers from zero to the count value minus one.

When count is in use, we can say that the configuration-building function implied by this block now takes an additional argument giving that instance index:

{
  type: "aws_instance",
  name: "example",
  depends_on: ["aws_security_group.example"],
  config: function (idx, sg) {
    return {
      ami: "ami-abc1234",
      instance_type: "t2.micro",
      vpc_security_group_ids: [sg.id],
      tags: {
        Name: "example-" + idx,
      },
    },
  },
},

Now Terraform will execute this configuration "function" once for each index, and ask the provider to compare with the prior state of the corresponding instance. The index is used as part of the identifier to correlate configuration with state, so on subsequent runs Terraform can distinguish these two objects in the state and ensure the correct result from the configuration function is compared to the correct instance state.

If we increase to count = 3 and re-run Terraform, it will see that there is no aws_instance.example[2] in the state and know that it must be created, while the existing instances zero and one should remain unchanged. Likewise, if we decrease to count = 1, Terraform will treat that as intend to delete the object represented by aws_instance.example[1], while leaving instance zero unchanged.

So far so good! We can successfully correlate between configuration and state to produce a suitable set of actions. This approach has an important limitation though, which we will see in the following section.

More Complex Interactions

Although count was originally intended as a means simply to simply construct multiple similar objects for purposes like horizontal scaling, users constructing reusable modules found themselves with more complex needs. Someone in the community (sadly, I've lost track of who) realized that the count feature can be set to a non-constant expression in order to express relationships in a more intuitive way.

A straightforward example is using a list of values to both choose a number of instances and set one or more unique properties for each instance using a single input variable:

variable "instance_names" {
  type = list(string)
  default = [
    "foo",
    "bar",
    "baz",
  ]
}

resource "aws_instance" "example" {
  count = length(var.instance_names)

  ami           = "ami-abc1234"
  instance_type = "t2.micro"

  vpc_security_group_ids = [aws_security_group.example.id]

  tags = {
    Name = var.instance_names[count.index]
  }
}

Because count is a special argument handled by Terraform itself, Terraform can evaluate this expression before it evaluates the rest of the configuration for the resource. This resource's implied definition function still takes a single index each time, with Terraform using the length of the list of names to decide how many times to call the function, which might now look like this:

{
  type: "aws_instance",
  name: "example",
  depends_on: ["count.index", "var.instance_names", "aws_security_group.example"],
  config: function (idx, names, sg) {
    return {
      ami: "ami-abc1234",
      instance_type: "t2.micro",
      vpc_security_group_ids: [sg.id],
      tags: {
        Name: names[idx],
      },
    },
  },
},

The result is three instances numbered with indices zero, one and two, each identical other than the Name tag. This sort of usage is attractive because it serves to de-emphasize the numeric indices assigned by Terraform and focus instead on the one-to-one relationship between the names and the instances.

This pattern runs into trouble if a new element is added into the middle of the list of names, though:

variable "instance_names" {
  type = list(string)
  default = [
    "foo",
    "boop",
    "bar",
    "baz",
  ]
}

The instances and their names are still connected only indirectly via their indices, and so this change is understood by Terraform not as adding a new name to the list but rather as renaming the instances with indexes one and two, and then adding a new instance with index three.

Assuming these pet names have some meaning to the user, the result is confusing: what was formerly instance "bar" is now instance "boop", and to make matters worse there is still an instance "bar" but it is the one that was initially called "baz".

This situation is particularly problematic when the list values are used to populate arguments that cannot be updated in-place. In this case, Terraform will needlessly destroy and re-create existing instances rather than just adding one new one.

Because this issue arises only on subsequent updates, this problem has emerged as a bit of a "trap": users will write modules that use this pattern, but find out too late that they actually can't change the list without destroying an important remote object. Working around this requires the use of Terraform plumbing commands that can, if not used with care, cause data loss.

Custom Instance Keys with `for_each`

The key problem (pun not intended!) with this pattern is that the relationships between indices and configuration values are decided by Terraform itself, and always forced to be based on indices. In the real world, the situation is rarely this clean, since remote systems tend not to themselves track objects by index. This creates a mismatch between Terraform's own tracking mechanism and the one used by the remote system or by the user.

This problem would be solved if Terraform allowed the user to customize the correspondences between configuration values and instances. The for_each feature aims to achieve that by generalizing the idea of instance indices to instead be instance keys, and then allow the user to define which keys are used on a per-resource basis.

resource "aws_instance" "example" {
  for_each = {for n in var.instance_names: n => n}

  ami           = "ami-abc1234"
  instance_type = "t2.micro"

  vpc_security_group_ids = [aws_security_group.example.id]

  tags = {
    Name = each.key
  }
}

This more complex expression in the for_each argument is a new Terraform 0.12 feature for projecting between collection values. Those familiar with functional language or certain imperative languages might recognize this as a map comprehension, though in the Terraform language we simply call it a "for expression". The effect of this expression is to construct a map from the list variable, with the same result as if the user had hand-written the following map:

  for_each = {
    foo = foo
    bar = bar
    baz = baz
  }

Whereas count forces the instance keys to be consecutive integers starting at zero, for_each allows the user to select arbitrary strings as instance keys, leading to a new variant of resource address:

aws_instance.example["bar"]
aws_instance.example["baz"]
aws_instance.example["foo"]

When for_each is used, instead of using count.index to select an index we can use each.key and each.value to reference keys and values from the given map, which in this case are both equal. Continuing our functional pseudo-code, we might consider this new block to be similar to the following:

{
  type: "aws_instance",
  name: "example",
  depends_on: ["each.key", "aws_security_group.example"],
  config: function (key, sg) {
    return {
      ami: "ami-abc1234",
      instance_type: "t2.micro",
      vpc_security_group_ids: [sg.id],
      tags: {
        Name: key,
      },
    },
  },
},

As with count, the for_each is handled by Terraform itself as a special case. Terraform evaluates the expression and then behaves as if it were calling this configuration function once per element in the resulting collection, passing in the key and value for each element if requested.

Most importantly, though: the Terraform state uses the keys from the collection to identify each instance, so introducing our new name boop into the list simply establishes a new instance with that key:

aws_instance.example["bar"]
aws_instance.example["baz"]
aws_instance.example["boop"]
aws_instance.example["foo"]

This resource's definition does not refer to any individual indices in the list of names — indeed, the conversion to map discards any sense of ordering — and so adding a new name to the list simply declares a new instance, leaving all of the existing instances untouched.

Other `for_each` Types

The example in the previous section showed what we expect will be the most common situation where for_each is set to a map value, allowing the user to concisely set both an string key and a potentially-complex value for each instance.

The for_each argument will also accept list values, in which case it is essentially a more readable form of count where the instances are still identified by numeric indices:

resource "aws_instance" "example" {
  for_each = var.instance_names

  ami           = "ami-abc1234"
  instance_type = "t2.micro"

  vpc_security_group_ids = [aws_security_group.example.id]

  tags = {
    Name  = each.value
    Index = each.key
  }
}

The above has all of the same caveats as using count, so it should be used sparingly, but may be useful in rare situations where the ordering of the instances is significant.

More interesting for the simple use-case we've used in this article is to use a set of strings value:

variable "instance_names" {
  type = set(string)
  default = [
    "foo",
    "bar",
    "baz",
  ]
}

resource "aws_instance" "example" {
  for_each = var.instance_names

  ami           = "ami-abc1234"
  instance_type = "t2.micro"

  vpc_security_group_ids = [aws_security_group.example.id]

  tags = {
    Name  = each.value
    Index = each.key
  }
}

By changing the type of var.instance_names to be set(string) rather than list(string) we tell Terraform that the ordering of these elements is not significant and that each value must be distinct, which is the same set of constraints as for the instance keys themselves. This allows us to use the variable directly as the for_each expression without the gotcha of using numeric index keys.

In fact, internally Terraform treats a set of strings here as if it were the for expression we saw in our first example above, constructing a map where the key and value for each element is equal. A set of strings is, therefore, just a shorthand for this common case.

Conclusion

The for_each feature was too large to fit in the initial Terraform 0.12 release along with all of the other significant language changes, but the 0.12 development process did include a lot of groundwork for this feature such as making sure the state serialization format can deal with both integer and string instance keys.

We plan to complete the feature in a minor release in the v0.12 series, though the timeline for that will depend on how must post-release work is required for the other changes coming in 0.12.0. I think this will be a big help for anyone writing re-usable modules that create abstractions, and will be a logical extension of the other expression-level improvements in 0.12.0.