Running terraform in production: best practices and lessons learnt

Terraform allows for the automation of your cloud infrastructure. Being able to maintain that infrastructure with ease requires attention to details

Running terraform in production: best practices and lessons learnt

TL;DR: Terraform is cool, but like every tool managing the production infrastructure where your applications run, it's on you to be careful and thoughtful.

The beginning

It's very common nowadays to go to conferences and hear people mentioning terraform when talking about cloud infrastructures. As an automation geek, the first thing you do is download the terraform binary and start reading the online documentation learning all the magics of HCL, the Hashicorp Configuration Language. It takes a while to appreciate its power and expressiveness. But it's exactly at this point that a few questions arise:

  • Can I really use this to manage my production infrastructure?
  • How can I use this with my team?
  • Is this secure? Is this safe to use?
  • Are there any unspoken rules I should know before start using it?

The idea

Terraform was created for:

  • Managing your infrastructure as code
  • One syntax to rule them all (the "clouds")
  • To give idempotence to unidempotent APIs (try repeating this 3 times faster)

Flow

Terraform is just a command-line tool and a pretty easy one to use: there is no global configuration and only files in the working directory are kept into consideration by the program.
Most of the work is done with only three sub-commands:

  • terraform init checks-out or initializes the remote state, installs the required providers and modules
  • terraform plan compares the actual state with the desired state and computes the difference
  • terraform apply is the same as plan but actually performs the changes needed to converge the current state of the infrastructure to the desired one.

All of these commands are safe to run multiple times with no risk of causing damage. The output of every command is clear and self explanatory.

API mindset

When first approaching a cloud provider, the easiest thing to do is to start using the web interface: it's clean and simple and gives instant feedback on the actions taken. But after using it with real-world projects for a while, it gets frustrating having to point-and-click multiple times and not being able to automate with the GUI: this is where terrafrom shines.
But (yes, there is a but), the GUI usually reduces the complexity of the cloud provider: it wraps together many API objects.
With terraform, using the restAPI offered by your cloud provider, you are forced to understand which objects play a role in the action you are performing and which don't, those that combine naturally and those that don't.
There are no more buttons that create networks, internet gateways, security groups, routes, floating IPs, and finally the simple VM that you wanted. Now you'll have to create all these resources by yourself and associate them with each other.

Mind your state

Terraform has a concept called state: which is basically a file where it keeps track of all the resources it manages for you. The state file contains all input data used to create the infrastructure, including passwords or tokens (in clear text). Moreover, terraform only manages the resources present in its state file: if you accidentally delete the file, you will no longer be able to manage these resources via terraform. This is why it's critical to keep the state in a secure place, locked and encrypted, accessible only to who/what needs to run terraform. For this purpose having a state persistent on the local disk makes sense only in the initial terraform exploration steps. When things start to get serious, you should really start using remote state. There are multiple backends available to safely store your infrastructure state, with a few things to keep in mind when choosing yours, as it should support:

  • versioning, so that you can rollback to any given moment in time
  • locking, it's easy to see how things can go wrong when multiple people try to terraform apply at the same time. This is why only the first person in a team running the command can acquire the lock on the remote state. This is a core feature that will probably save you a headache.
  • encryption, inside the state file you have passwords and tokens stored in clear text. It's therefore mandatory to have an encrypted and secured remote state.

It might sound really scary and complex but actually using remote state is nothing more than something like this:

terraform {
  backend "gcs" {
    bucket  = "example-foobar"
    prefix  = "prod"
  }
  required_version = "> 0.11.0"
}

Keep your friends close and your credentials closer

To be able to manage your resources, terraform must have access to your authentication credentials, in the form of username/password, API tokens, service account file.
While this is possible in principle, it's highly recommended to not keep secrets in your .tf files at all. Most providers (which are the entities you're authenticating against; e.g. aws, gcp, azure) have the ability to retrieve credentials from environment variables, like AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for aws.
If your provider does not directly implement secrets discovery from the environment variables, you can easily avoid committing your clear-text credentials using the embedded capability of terraform to read generic variables from the environment. This feature is really helpful when setting up databases or encrypted object storage.

Know thy dependencies

Terraform has awareness of infrastructural dependencies: for example, when defining a VM attached to a specific network and declaring the network, including its properties, terraform knows that it first needs to create the network and then the VM, same with disk and IPs. Despite all the magic built in, however, terraform sometimes might still fail: for these special situations the depends_on keyword can be used for explicit dependency declaration.
More on this topic can be found here.

Documentation

Everything you need is neatly written here.
It may seem overwhelming to start with, but it is complete: a rare feature for an open-source project.
The list of available providers can be found here instead.

If you like all the features but don't want to manage complexity there is, as always, a managed solution by Hashicorp with their Enterprise offering providing SCM integration, policy and user management.