Terraform, AWS, and Idempotency

A bug in Terraform? Or a misunderstanding of how a particular stanza works? Or maybe even our own automation around Terraform?

Hopefully, this is only part 1 of this series as it doesn’t really have a satisfying ending so far, but still a story worth sharing. We encountered an error in Terraform that was transient but seemed to go away on its own, most likely some race condition. This post is going to walk through that failure. It’s a bit more technical than some of my other posts so those who come for the leadership thoughts might have their eyes glaze over. Your mileage may vary.

A bit about our deployment process first. We use a blue/green deployment strategy in our environment (minus the database). Automation around Terraform is responsible for bringing up the second application stack and deploying code to it. During the creation of that second stack, we received an error message we had never encountered before.

Error: error creating EC2 Launch Template: IdempotentParameterMismatch: Client token already used before. 
  status code: 400, request id: 4c6edce7-7497-4884-ab63-f215f9b82f6e
  on ../terraform-asg/main.tf line 33, in resource "aws_launch_template" "launch_template":
  33: resource "aws_launch_template" "launch_template" {

There were a few things that needed to be researched here.

The IdempotentParameterMismatch Error

When an action is idempotent, it means it can be performed multiple times without changing the result beyond the initial application. I

n practice what this usually means is if I run command X, the command is aware if it had been run before in this context and may skip the usual application of the command and instead return a status or a result. A simple example is the mkdir command when you use the -p flag (which tells mkdir to create the full path if it doesn’t exist).

If I run that on my local workstation the following happens.

mkdir -p /tmp/test
ls -l /tmp/|grep test
drwxr-xr-x  2 jeff.smith  wheel    64 Jun  9 06:31 test
mkdir -p /tmp/test

The mkdir command created the path /tmp/test and we can see it was created successfully. When I run the command again, it completes successfully with no error message.

That makes this command idempotent. No matter how many times I run this command, I know in the end I’ll get a successful result and that the path /tmp/testwill exist. Now contrast that with just mkdir.

mkdir /tmp/test2
ls -l /tmp/|grep test2
drwxr-xr-x  2 jeff.smith  wheel    64 Jun  9 06:35 test2
mkdir /tmp/test2
mkdir: /tmp/test2: File exists

Now with mkdir, on the second execution I get an error message, which is a different end result than my first execution. The directory exists, but the error code I get back is different. That makes this a non-idempotent operation. But why do we care? Well in the second example I need to do a lot more error handling for starters. And this is a basic example.

When doing something like creating infrastructure, it could result in launching more instances than you intended, which is what this IdempotentParameterMismatch error is designed to prevent.

When you make an AWS API call to create infrastructure, just about every endpoint (to my knowledge) does this is in an asynchronous fashion. What this means is your API call returns immediately but the work of actually creating the infrastructure you requested is still in progress. Because of this you typically need some sort of polling mechanism to determine when the operation has been completed.

Several AWS API endpoints support idempotency, which allows you to specify a client token to uniquely identify this request. If you make the same API infrastructure creation call and use the same client token, instead of creating a new instance, it will return the status of the previously requested instance. When creating infrastructure programmatically this can be a big safety net to avoid creating many copies of the same infrastructure. And that’s where our error comes in.

The error is stating that we have already used the client token. Digging into the documentation a bit, a more specific meaning is that we changed parameters in the request but reused the client token, meaning the API doesn’t know what our intent actually is. Do we want new infrastructure with these parameters? Or are we expecting already existing infrastructure that matches those parameters? To be safe, it throws this error.

The user is always responsible for generating the client token. But in this case, the user is actually Terraform. That created some surprise and relief on our part since it meant it most likely wasn’t any of our wrapper automation. But we still needed to be sure. The first thing we did was an attempt to find what client token was used and was it actually used twice.

Luckily the error gave us a RequestId which we used CloudTrail to look up. In the requestParameters field of that request we were able to find the ClientToken used.

"requestParameters": {
      "CreateLaunchTemplateRequest": {
        "LaunchTemplateName": "sidekiq-worker-green_staging02-launch_template20220510183630312700000005",
        "LaunchTemplateData": {
          "UserData": "",
          "SecurityGroupId": [
            {
              "tag": 1,
              "content": "sg-0743bcbd08cadba1d"
            },
            {
              "tag": 2,
              "content": "sg-6daf421e"
            },
            {
              "tag": 3,
              "content": "sg-0818108e24803a418"
            }
          ],
          "ImageId": "",
          "BlockDeviceMapping": {
            "Ebs": {
              "VolumeSize": 100
            },
            "tag": 1,
            "DeviceName": "/dev/sda1"
          },
          "IamInstanceProfile": {
            "Name": "asg-staging02-20220510183628901400000003"
          },
          "InstanceType": "m4.2xlarge"
        },
        "ClientToken": "terraform-20220510183630312700000006"
      }

It looks sufficiently random in the same format that Terraform often uses to generate random values.

It definitely didn’t appear to be something we generated. We then decided to search all requests in that time frame to see if any of them had the same client token.

Sure enough, there was a second request made that reused the same Terraform module (which we wrote) to generate a second ASG and launch template.

"requestParameters": {
      "CreateLaunchTemplateRequest": {
        "LaunchTemplateName": "biexport-worker-green_staging02-launch_template20220510183630312700000005",
        "LaunchTemplateData": {
          "UserData": "",
          "SecurityGroupId": [
            {
              "tag": 1,
              "content": "sg-0743bcbd08cadba1d"
            },
            {
              "tag": 2,
              "content": "sg-6daf421e"
            },
            {
              "tag": 3,
              "content": "sg-0818108e24803a418"
            }
          ],
          "ImageId": "",
          "BlockDeviceMapping": {
            "Ebs": {
              "VolumeSize": 200
            },
            "tag": 1,
            "DeviceName": "/dev/sda1"
          },
          "IamInstanceProfile": {
            "Name": "asg-staging02-20220510183629183700000003"
          },
          "InstanceType": "m5.xlarge"
        },
        "ClientToken": "terraform-20220510183630312700000006"
      }

As you can see, there are differences in the request, but the client token remains the same. Now we’re starting to freak out and think maybe it is our code, but we still couldn’t see how.

Generating the ClientToken

As I mentioned previously, generating the client token is the job of the user from AWS’ perspective. From our perspective, that user is Terraform. We’re not GO experts by any stretch of the imagination on our team (although we’re looking for a few good projects to take it for a spin. We have a lot of interest).

But in order to understand how the client token gets generated, we were going to have to look at the Terraform source code. After a little digging, we came across the code living in the terraform-plugin-sdkas a helper function.

func PrefixedUniqueId(prefix string) string {
    // Be precise to 4 digits of fractional seconds, but remove the dot before the
    // fractional seconds.
    timestamp := strings.Replace(
        time.Now().UTC().Format("20060102150405.0000"), ".", "", 1)

    idMutex.Lock()
    defer idMutex.Unlock()
    idCounter++
    return fmt.Sprintf("%s%s%08x", prefix, timestamp, idCounter)
}

In this function, the author is generating a timestamp accurate to the second. It’s possible that multiple executions could hit in the same second-time span, but the value also gets a counter appended to it.

The counter is in a mutex so the value of idCounter is shared across executions and the mutex prevents concurrent execution. There should be no way that this function generates the same client token twice. But that doesn’t mean that the function calling for the client token isn’t storing it and possibly reusing it.

Wrap Up

This is where our story ends for the moment. We started to look into how and where the client token was getting used, but since we felt strongly that this was going to be related to a Terraform issue of some sort, we had to shift gears for a solution. We weren’t going to run a custom patched version of Terraform. We weren’t going to upgrade on the spot. And we weren’t going to wait until a PR got approved, merged, and released, so that put us on a different remediation path.

Our current fix was to specify a depends_on argument for the two resources in conflict. Other times when the error happened, we noticed it was always these two resources in conflict, so the hope was that the depends_on flag would prevent these from being created in parallel.

So far that hope has paid off and we haven’t seen the error in any environment again. But we plan to continue to research the issue out of nothing more than morbid curiosity. It might lead us to a bug in Terraform, a misunderstanding of how a particular stanza works, or maybe even our own automation around Terraform.

Who knows? If we find it, you’ll be sure to find a Part 2 to this article!