JSON and HCL

The README for the original incarnation of HCL (which, for convenience, I'm going to refer to as "HCL 1" in this article) included the following statement:

HCL is also fully JSON compatible. That is, JSON can be used as completely valid input to a system expecting HCL. This helps makes systems interoperable with other systems.

In practice, the design of HCL 1 and its slightly-different usage in different applications meant that this statement came with some caveats, but for most purposes it is true.

Some people reading this statement, along with similar ones like it in application-specific docs like Terraform, have then insisted that in order for it to be true it must be possible to write a generic tool to convert HCL syntax to JSON and vice-versa. In fact, Kevin van Zonneveld wrote json2hcl with that very use-case, and it broadly works! But not always, and those encountering its limitations will often conclude that HCL's claim of "full JSON compatibility" is unwarranted.

One of the key design goals of HCL 2 -- in many ways a spritual successor of HCL rather than an incremental update -- was to ensure that it would still be "JSON compatible" in the same sense that HCL 1 claimed, even though HCL 2 includes lots of additional constructs for arbitrary expressions that have no straightforward mapping to JSON. However, a generic "JSON to HCL" or "HCL to JSON" tool (agnostic of any particular application's language) is still not possible.

This article, then, attempts to explain what we mean by "JSON compatible" and, therefore, why we make that claim even though a generic one-to-one conversion isn't possible.

Information Sets

In similar vein to XML, HCL is not really a language on its own but rather a set of building blocks from which languages can be constructed. Both XML and HCL provide a higher-level vocabulary for describing languages, allowing many distinct languages to all share the same low-level lexical analysis and parsing implementation and then implement language-specific concepts in terms of the result.

In XML, languages are defined in terms of concepts such as elements, attributes, character nodes, etc. The specification of an XML-based language, such as Atom), describes the language concepts in terms of XML's own constructs, not in terms of the raw bytes that represent the data when serialized.

This set of concepts in XML is formally known as the XML Information Set, and for XML in particular it is actually a formal specification of its own to help provide a consistent set of terminology for referring to these constructs in other specifications.

Most other shared syntaxes do not have a formally-defined information set (infoset), but there is always at least an informal infoset. For example, JSON is also used to define higher-level languages such as Activity Streams, which use concepts from JSON's less-formal infoset: "object", "property", "array", "string", etc.

HCL 1's infoset is quite similar to JSON's, but makes some additions to the concepts, as well as using some slightly-different terminology:

An object contains a sequence of object items, which are comparable to JSON's properties.
Each object item has one or more keys (vs. JSON where each property has exactly one key).
An object item can either introduce a nested object or an attribute of the containing object.

A language defined in terms of the HCL 1 infoset can make use of any of these concepts. HCL 1's infoset is, some terminology differences aside, a superset of JSON's.

When defining a language on top of an existing language, our job as the language designer is to define how the underlying language infoset maps to our own infoset. In Terraform's case, for example, the application-level infoset includes concepts like "resource", "input variable", "output value"; as we design features for the Terraform language, we define how HCL 1's infoset concepts are combined to represent Terraform's infoset concepts.

The JSON support in HCL 1 follows a similar principle: it defines a mapping from the JSON infoset to the HCL 1 infoset. But because the JSON infoset is a subset of HCL's, there are necessarily structures that can be written in the native HCL syntax (which supports the full infoset) but cannot be expressed in JSON. If that is true, how can HCL 1 claim full JSON compatibility?

HCL 1's "Decoder"

The answer comes from another component of HCL 1: its decoder.

HCL's parser can interpret either native HCL syntax or JSON syntax into the HCL 1 infoset, but as we noted above the JSON syntax can only represent a subset of the infoset. Specifically:

Object items always have only one key, except for the following special case.
If a number of single-property objects are nested inside each other, the parser will chain them all together into a single object item with multiple keys, essentially flattening out the nested structure.
All object items introduce an attribute of the containing object, though that attribute may have an object as a value.

An application using HCL 1 must therefore, in principle at least, define a mapping both from the full HCL 1 infoset and from this JSON-hampered subset of it in order to support both syntaxes.

In practice, the HCL 1 decoder hides this detail for most applications. To use it, an application defines a number of Go language struct types with types and tags that describe what structure is expected:

// The root struct for HashiCorp Consul's policy language.
type Policy struct {
	ID                    string                 `hcl:"id"`
	Revision              uint64                 `hcl:"revision"`
	ACL                   string                 `hcl:"acl,expand"`
	Agents                []*AgentPolicy         `hcl:"agent,expand"`
	AgentPrefixes         []*AgentPolicy         `hcl:"agent_prefix,expand"`
	Keys                  []*KeyPolicy           `hcl:"key,expand"`
	KeyPrefixes           []*KeyPolicy           `hcl:"key_prefix,expand"`
	Nodes                 []*NodePolicy          `hcl:"node,expand"`
	NodePrefixes          []*NodePolicy          `hcl:"node_prefix,expand"`
	Services              []*ServicePolicy       `hcl:"service,expand"`
	ServicePrefixes       []*ServicePolicy       `hcl:"service_prefix,expand"`
	Sessions              []*SessionPolicy       `hcl:"session,expand"`
	SessionPrefixes       []*SessionPolicy       `hcl:"session_prefix,expand"`
	Events                []*EventPolicy         `hcl:"event,expand"`
	EventPrefixes         []*EventPolicy         `hcl:"event_prefix,expand"`
	PreparedQueries       []*PreparedQueryPolicy `hcl:"query,expand"`
	PreparedQueryPrefixes []*PreparedQueryPolicy `hcl:"query_prefix,expand"`
	Keyring               string                 `hcl:"keyring"`
	Operator              string                 `hcl:"operator"`
}

The struct tags here give the decoder crucial information about what shape of data the application is looking for, which allows the decoder to match the HCL 1 infoset objects it finds in a loose way with a ruleset that can understand both the result of HCL native syntax parse and a JSON syntax parse by applying some additional rules.

To use our terminology from earlier, the HCL 1 decoder defines a flexible (and lossy) mapping from the HCL 1 infoset to a subset of Go's type system infoset. This mapping is defined in such a way that many different input permutations can produce the same result, which allows the JSON parsing subset to be understood, but also allows for some non-canonical forms in the native syntax too. The exact structure used in the input is lost in the conversion, so the application cannot recognize these distinctions.

The json2hcl tool works because it exploits the lossiness of the HCL 1 decoder mapping: it can produce an HCL syntax representation of the subset of the information model that the JSON parser produces, and even though its result may not match the idiomatic way humans write the target language, the differences are lost during decoding anyway and so the generated HCL source can be loaded successfully.

The decoder's mapping is not sufficient for all possible languages built on the HCL 1 infoset, though. Terraform in particular does not use the decoder and instead works directly with the Go representation of the infoset, which presents some interesting challenges.

Low-level HCL 1 Processing

Terraform's language in versions prior to v0.12 (when HCL 2 was introduced) is implemented against HCL 1's low-level API, which allows it to exert more control over the interpretation of the input and thus produce better guidance for users about constructs that would be ambigous to the decoder:

# Terraform is able to reject this because it can see that "foo" is an
# additional key on the labels block, which is not expected.
locals "foo" {
  bar = "baz"
}

# If using the HCL 1 decoder directly, the above would be indistinguishable
# to Terraform from the following, which would then generally lead to confusing
# errors downstream trying to use the local value.
locals {
  foo = {
    bar = "baz"
  }
}

Because of these extra constraints, json2hcl's assumptions don't necessarily apply, and it will sometimes produce HCL input that Terraform will reject because it is written in a way that is considered equivalent by the HCL decoder but does not produce an equivalent HCL infoset structure.

Another consequence of Terraform bypassing the decoder is that it must therefore itself allow for the fact that JSON input uses an unusual subset of the HCL infoset. Over the years this has led to various bugs that appear only when JSON syntax is used as input, some of which persisted right until the final release of Terraform that was using HCL 1.

A key design goal for HCL 2 was to ensure that, unlike HCL 1, language designers could exert the level of control Terraform needs over the mapping process while still having a well-defined, standard interpretation of JSON input. To achieve that, HCL 2 has a differently-shaped abstract infoset.

The HCL 2 infoset

While HCL 1's infoset was just some small adjustments (and renames) from the JSON infoset, HCL 2 has some more substantial differences:

The top-level construct is a body, which contains both arguments and blocks. Collectively, these are body content items.
Arguments are name/value pairs where the name is always a string and the value is an "expression".
Blocks each have a type name and zero or more labels. Each block also has its own nested body.
An expression is a mapping from a set of variables and functions to a value.

The above is a description of HCL 2's syntax-agnostic infoset. The HCL 2 native syntax is defined as a pretty direct mapping from input bytes to an equivalent native syntax infoset, but the approach for JSON is different: instead of trying to adapt JSON into the HCL 2 infoset at parse time, HCL 2 instead just parses the JSON into a representation of JSON's own infoset. The mapping from JSON infoset to HCL 2 infoset is deferred.

In the HCL 2 API, mapping from the syntax-specific infosets to the abstract infoset is always driven by a schema. This is a similar idea to HCL 1's decoder, but instead of describing the desired structure using the Go type system infoset, the implementer describes it using concepts from the HCL 2 abstract infoset:

content, diags := body.Content(&hcl.BodySchema{
    Attributes: []hcl.AttributeSchema{
        {Name: "name", Required: true},
    },
    Blocks: []hcl.BlockHeaderSchema{
        {
            Type: "network_interface",
            Labels: []string{"name"},
        },
    },
})

Each input syntax has its own separate implemention of Body which abstracts over the syntax-specific infoset it came from. Each implementation's implementation of Body is responsible for mapping its syntax-specific infoset concepts onto syntax-agnostic infoset concepts, while using the given schema as guidance to resolve any ambiguity.

The native syntax infoset maps one-to-one with the abstract infoset, so its implementation is relatively straighforward. The JSON infoset is a subset of the HCL abstract infoset though, so we need the information in the schema to choose which of possibly-several abstract infoset concepts a JSON infoset concept should map to.

This compromise allows Terraform to describe the exact structure it is expecting in terms of the abstract infoset and have that enforced for the native syntax while the JSON syntax implementation instead uses it to inform the mapping.

JSON Mapping Ambiguity

The key implication of this approach is that the meaning of particular JSON input can differ based on the schema used to decode it, even if the input itself is unchanged. A JSON input that is valid under one schema may be invalid under another.

Consider the following input, for example:

{
    "name": "baz",
    "network_interface": [
        {
            "foo": {}
        },
        {
            "bar": {}
        }
    ]
}

With the schema above, that would be interpreted like the following native syntax input:

name = "baz"

network_interface "foo" {}
network_interface "bar" {}

But if the schema instead had no labels defined for that network_interface block type, it would be interpreted as the following instead:

name = "baz"

network_interface {
    foo = {}
}
network_interface {
    bar = {}
}

...and if network_interface were declared as an attribute instead of a block:

name = "baz"

network_interface = [
  {
    foo = {}
  },
  {
    bar = {}
  },
]

And this is the subtlety of the claim of HCL being "fully JSON compatible": it doesn't mean that any JSON input can be mapped mechanically into HCL syntax, but rather that for any specific schema there is no abstract infoset result that can be obtained by mapping from native syntax infoset that cannot be obtained also by a mapping from the JSON infoset.

A tool to convert from JSON syntax to native syntax is possible in principle, but in order to produce a valid result it is not sufficient to map directly from the JSON infoset to the native syntax infoset. Instead, the tool must map from JSON infoset to the application-specific infoset, and then back from there to the native syntax infoset. In other words, such a tool would need to be application-specific.

The current HCL 2 implementation lacks helpers to assist applications with implementing a mapping like this. Moving from a physical infoset to an abstract infoset is often a lossy operation, so to handle such a conversion automatically would require some special accommodations to ensure that important details are not lost in the mapping so that a reverse mapping back to a physical infoset is possible.

Such facilities may come in later releases of HCL 2, but for now the best approach is to let humans write HCL and let machines generate JSON, and let applications themselves handle the mapping from both syntaxes to the applcation's infoset, rather than trying to map between syntaxes ahead of time.

For Terraform in particular, the updated docs for Terraform v0.12 describe in detail the mapping from JSON syntax to Terraform's own infoset, using the HCL native syntax infoset concepts. As other applications migrate from HCL 1 to HCL 2, or use HCL 2 from their inception, they will hopefully also provide such documentation.