Our Journey Migrating to AWS IMDSv2

We’re heavy customers of Amazon Compute Compute Cloud (EC2) at Slack — we run roughly 60,000 EC2 cases throughout 17 AWS areas whereas working lots of of AWS accounts. A large number of groups personal and handle our numerous cases.

The Occasion Metadata Service (IMDS) is an on-instance part that can be utilized to achieve an perception to the occasion’s present state. Because it first launched over 10 years in the past, AWS prospects used this service to collect helpful details about their cases. At Slack, IMDS is used closely as an illustration provisioning, and in addition utilized by instruments that want to grasp their operating environments.

Data uncovered by IMDS consists of IAM credentials, metrics in regards to the occasion, safety group IDs, and an entire lot extra. This data may be extremely delicate – if an occasion is compromised, an attacker might be able to use occasion metadata to achieve entry to different Slack companies on the community.

In 2019, AWS launched a brand new model of IMDS (IMDSv2) the place each request is protected by session authentication. As a part of our dedication to excessive safety requirements, Slack moved your entire fleet and instruments to IMDSv2. On this article, we’re going to talk about the pitfalls of utilizing IMDSv1 and our journey in direction of totally migrating to IMDSv2.

The v2 distinction

IMDSv1 makes use of a easy request-and-response sample that may amplify the affect of Server Side Request Forgery (SSRF) vulnerabilities — if an software deployed on an occasion is weak to SSRF, an attacker can exploit the appliance to make requests on their behalf. Since IMDSv1 helps easy GET requests, they’ll extract credentials utilizing its API.

IMDSv2 eliminates this assault vector by utilizing session-oriented requests. IMDSv2 works by requiring these two steps:

  1. Make a PUT request with the header X-aws-ec2-metadata-token-ttl-secondsheader, and obtain a token that’s legitimate for the TTL supplied within the request
  2. Use that token in a HTTP GET request with the header named X-aws-ec2-metadata-token to make any follow-up IMDS calls

With IMDSv2, reasonably than merely making HTTP GET requests, an attacker wants to use vulnerabilities to make PUT requests with headers. Then they must use the obtained credentials to make follow-up GET requests with headers to entry IMDS knowledge. This makes it far more difficult for attackers to entry IMDS by way of vulnerabilities akin to SSRF.

Our journey in direction of IMDSv2

At Slack there are a number of occasion provisioning mechanisms at play, akin to Terraform, CloudFormation and numerous in-house instruments that decision the AWS EC2 API. As a company, we rely closely on IMDS to get insights into our cases throughout provisioning and the lifecycle of those cases.

We create AWS accounts per surroundings (Sandbox, Dev and Prod) and per service workforce and typically even per software – so we’ve lots of of AWS accounts.

We’ve got a single root AWS group account. All our baby accounts are members of this group. Once we create an AWS account, the account creation course of writes details about the account (such because the account ID, proprietor particulars, and account tags) to a DynamoDB desk. Data on this desk is accessible by way of an inner API known as Archipelago for account discovery.

Determining the size of the issue

Earlier than migrating, first we wanted to grasp what number of cases in our fleet used IMDSv1. For this we used the EC2 CloudWatch metric known as MetadataNoToken that counts how usually the IMDSv1 API was used for a given occasion.

We created an software known as imds-cw-metric-collector to map these metrics and occasion IDs we collected to alert numerous service groups and purposes. The applying used our inner Archipelago API to get an inventory of our AWS accounts, the aforementioned MetadataNoToken metric, and talked to our occasion provisioning companies to gather data like proprietor IDs and Chef Roles (for cases which might be utilizing Chef to configure them). Our customized app despatched all these metrics to our Prometheus monitoring system.

A dashboard aggregated these metrics to trace all cases that made IMDSv1 calls. This data was then used to attach with service groups, and work with them to replace their companies to make use of IMDSv2.

IMDSv1 usage dashboard

Nevertheless, the listing of EC2 occasion IDs and their homeowners was solely part of the equation. We additionally wanted to grasp which processes on these cases have been making these calls to the IMDSv1 API.

At Slack, for essentially the most half, we use Ubuntu and Amazon Linux on our EC2 cases. For IMDSv1 name detection, AWS gives a software known as AWS ImdsPacketAnalyzer. We determined to construct the software and package deal it up as a Debian Linux distribution package deal (*.deb) in our APT repository. This allowed the service groups to put in this software on demand and examine IMDSv1 calls.

This labored completely for our Ubuntu 22.04 (Jammy Jellyfish) and Amazon Linux cases. Nevertheless, the ImdsPacketAnalyzer doesn’t work on our legacy Ubuntu 18.04 (Bionic Beaver) cases so we needed to resort to utilizing instruments akin to lsof and netlogs in some instances.

As a final resort on a few of our dev cases we simply turned off IMDSv1 and listed issues that have been damaged.

Cease calling IMDSv1

As soon as we had an inventory of cases and processes on these cases that have been making the IMDSv1 calls, it was time for us to get to work and replace each to make use of IMDSv2 as an alternative.

Updating our bash scripts was the simple half, as AWS gives very clear steps on switching from IMDSv1 and IMDSv2 for these. We additionally upgraded our AWS CLI to the most recent model to get IMDSv2 help. Nevertheless doing this for companies which might be written utilizing different languages was a bit extra difficult. Fortunately AWS has a comprehensive list of libraries that we needs to be utilizing to implement IMDSv2 for numerous languages. We labored with service groups to improve their purposes to IMDSv2 supported variations of libraries and roll these out throughout our fleet.

As soon as we had rolled out these adjustments, the variety of cases utilizing IMDSv1 dropped precipitously.

Turning off IMDSv1 for brand spanking new cases

Stopping our companies from utilizing the IMDSv1 API solely solved a part of the issue. We additionally wanted to show off IMDSv1 on all future cases. To unravel this downside, we turned to our provisioning instruments.

First we checked out our mostly used provisioning software, Terraform. Our workforce gives a set of ordinary Terraform modules for service groups to make use of to create issues akin to AutoScaling teams, S3 buckets, and RDS cases. These widespread modules allow us to make a change in a single place and roll it out to many groups. Service groups that simply need to construct an AutoScaling group don’t have to know the nitty-gritty configurations of Terraform to make use of considered one of these modules.

Nevertheless we didn’t need to roll out this variation to all our AWS baby accounts on the similar time, as there have been service groups that have been actively engaged on switching to IMDSv1 right now. Due to this fact we wanted a strategy to exclude these groups and their baby accounts. We got here up with a customized Terraform module known as accounts_using_imdsv1 as the answer.Then we have been in a position to make use of this module in our shared Terraform modules to maintain or terminate IMDSv1 as per the instance under:

module "accounts_using_imdsv1" {
  supply = "../slack/accounts_using_imdsv"
}

useful resource "aws_instance" "instance" {
  ami           = knowledge.aws_ami.amzn-linux-2023-ami.id
  instance_type = "c6a.2xlarge"
  subnet_id     = aws_subnet.instance.id

  metadata_options {
    http_endpoint  = "enabled"
    http_tokens    = module.accounts_using_imdsv1.is_my_account_using_imdsv1 ? "non-obligatory" : "required"
  }
}

We began with a big listing of accounts within the accounts_using_imdsv1 module as utilizing IMDSv1, however we have been slowly capable of take away them as service groups migrated to IMDSv2.

Blocking cases with IMDSv1 from launching

The following step for us was to dam launching cases with IMDSv1 enabled. For this we turned to AWS Service control policies (SCPs). We up to date our SCPs to dam launching IMDSv1 supported cases throughout all our baby accounts. Nevertheless, just like the AutoScaling group adjustments we mentioned earlier, we wished to exclude some accounts originally whereas the service homeowners have been working to change to IMDSv2. Our accounts_using_imdsv1 Terraform module got here to the rescue right here too. We have been in a position to make use of this module in our SCPs as under. We blocked the flexibility to launch cases with IMDSv1 help and in addition blocked the flexibility to activate IMDSv1 on current cases.

 # Block launching cases with IMDSv1 enabled
  assertion {
    impact = "Deny"

    actions = [
      "ec2:RunInstances",
    ]

    sources = [
      "arn:aws:ec2:*:*:instance/*",
    ]

    situation {
      take a look at     = "StringNotEquals"
      variable = "ec2:MetadataHttpTokens"
      values     = ["required"]
    }

    situation {
      take a look at          = "StringNotEquals"
      variable = "aws:PrincipalAccount"
      values     = module.accounts_using_imdsv1.accounts_list_using_imdsv1
    }
  }

  # Block turning on IMDSv1 if it is already turned off
  assertion {
    impact = "Deny"

    actions = [
      "ec2:ModifyInstanceMetadataOptions",
    ]

    sources = [
      "arn:aws:ec2:*:*:instance/*",
    ]

    situation {
      take a look at          = "StringNotEquals"
      variable = "ec2:Attribute/HttpTokens"
      values     = ["required"]
    }

    situation {
      take a look at          = "StringNotEquals"
      variable = "aws:PrincipalAccount"
      values     = module.accounts_using_imdsv1.accounts_list_using_imdsv1
    }
  }
}

How efficient are these SCPs?

SCPs are efficient with regards to blocking most IMDSv1 utilization. Nevertheless there are some locations the place they don’t work.

SCPs don’t apply to the AWS root group’s account, and solely apply to baby accounts which might be members of the group. Due to this fact, SCPs don’t stop launching cases with IMDSv1 enabled, nor turning on IMDSv1 on an current occasion within the root AWS account.

SCPs additionally don’t apply to service-linked roles. For instance, if an autoscaling group launches an occasion in response to a scaling occasion, below the hood the AutoScaling service is utilizing a service-linked IAM position managed by AWS and people occasion launches aren’t impacted by the above SCPs.

We checked out stopping groups from creating AWS Launch Templates that don’t implement IMDSv2, however AWS Launch Template coverage situation keys presently do not provide support for ec2:Attribute/HttpTokens.

What different security mechanisms are in place?

As there isn’t any 100%-foolproof strategy to cease somebody from launching an IMDSv1-enabled EC2 occasion, we put in a notification system using AWS EventBridge and Lambda.

We created two EventBridge guidelines in every of our baby accounts utilizing CloudTrail occasions for EC2 occasions. One rule captures requests to the EC2 API and the second captures responses from the EC2 API, telling us when somebody is making a EC2:RunInstances name with IMDSv1 enabled.

Rule 1: Capturing the requests

{
  "element": {
    "eventName": ["RunInstances"],
    "eventSource": ["ec2.amazonaws.com"],
    "requestParameters": {
      "metadataOptions": {
        "httpTokens": ["optional"]
      }
    }
  },
  "detail-type": ["AWS API Call via CloudTrail"],
  "supply": ["aws.ec2"]
}

Rule 2: Capturing the responses

{
  "element": {
    "eventName": ["RunInstances"],
    "eventSource": ["ec2.amazonaws.com"],
    "responseElements": {
      "instancesSet": {
        "gadgets": {
          "metadataOptions": {
            "httpTokens": ["optional"]
          }
        }
      }
    }
  },
  "detail-type": ["AWS API Call via CloudTrail"],
  "supply": ["aws.ec2"]
}

These occasion guidelines have a goal setup to level them at a central occasion bus dwelling in an account managed by our workforce.

AWS Eventbridge Targets

Occasions matching these guidelines are despatched to the central occasion bus. The Central Occasion bus captures these occasions by way of an analogous algorithm. Subsequent it sends them by an Input Transformer to format the occasion just like the next:

Enter path:

{
  "account": "$.account",
  "instanceid": "$.element.responseElements.instancesSet.gadgets[0].instanceId",
  "area": "$.area",
  "time": "$.time"
}

Enter template:

 {
  "supply" : "slack",
  "detail-type": "slack.api.postMessage",
  "model": 1,
  "account_id": "<account>",
  "channel_tag": "event_alerts_channel_imdsv1",
  "element": {
    "textual content": ":importantred: :provisioning: occasion `<instanceid> (<area>)` within the AWS account `<account>` was launched with `IMDSv1` help"
  }
}

Lastly the remodeled occasions get despatched a Lambda perform in our account.

AWS Eventbridge Targets

This Lambda perform makes use of the account ID from the occasion and our inner Archipelago API to find out the Slack Channel, then sends this occasion to Slack.

IMDSv1 Slack Alerts

This circulate seems to be like the next:

IMDSv1 Slack Alert Flow

We even have an analogous alert in place for when IMDSv1 is turned on for an current occasion.

IMDSv1 Enabled Slack Alert

What in regards to the cases with IMDSv1 enabled?

Launching new cases with IMDSv2 is cool and all, however what about our 1000’s of current cases? We wanted a strategy to implement IMDSv2 on them as nicely. As we noticed above, SCPs don’t block launching cases with IMDSv1 totally.

This is the reason we created a service known as IMDSv1 Terminator. It’s deployed on EKS and makes use of an IAM OIDC provider to acquire IAM credentials. These credentials have entry to imagine a extremely restricted position in all our baby accounts created for this very objective.

The coverage connected to the position assumed by IMDSv1 Terminator in baby accounts is as under:

{
    "Assertion": [
        {
            "Action": "ec2:ModifyInstanceMetadataOptions",
            "Condition": {
                "StringEquals": {
                    "ec2:Attribute/HttpTokens": "required"
                }
            },
            "Effect": "Allow",
            "Resource": "arn:aws:ec2:*:*:instance/*",
            "Sid": ""
        },
        {
            "Action": [
                "ec2:DescribeRegions",
                "ec2:DescribeInstances"
            ],
            "Impact": "Enable",
            "Useful resource": "*",
            "Sid": ""
        }
    ],
    "Model": "2012-10-17"
}

Just like our earlier metric collector software, this additionally makes use of the inner Archipelago API to get an inventory of our AWS accounts, lists our EC2 cases in batches and analyzes each and checks if IMDSv1 is enabled. Whether it is, the service will implement IMDSv2 on the occasion.

When the service remediates an occasion, we get notified in Slack.

IMDSv1 Terminator Slack Alert

Initially we noticed lots of of those messages for current cases, however as they have been remediated and solely new cases have been launched with IMDSv2, we stopped seeing these messages. Now if an occasion will get launched with IMDSv1 help enabled we’ve the consolation of understanding that it’ll get remediated and we’ll get notified.

This service additionally sends metrics to our Prometheus monitoring system in regards to the IMDS standing of our cases. We are able to simply visualize what AWS accounts and areas which might be nonetheless operating IMDSv1 enabled cases, if there are any.

IMDSv1 Usage Dashboard

Some final phrases

With the ability to implement IMDSv2 throughout Slack’s huge community was a difficult however rewarding expertise for the Cloud Foundations workforce. We labored with our massive variety of service groups to perform this aim, particularly our SecOps workforce who went above and past to assist us full the migration.

Wish to assist us construct out our cloud infrastructure? We’re hiring! Apply now