How do I troubleshoot build pipeline timeout errors in EC2 Image Builder?

7 minute read
0

I want to troubleshoot build pipeline timeout errors in EC2 Image Builder.

Short description

The following are common reasons that your Image Builder build pipeline might fail with step timeout errors at the LaunchBuildInstance, BootstrapBuildInstance, or ApplyBuildComponent workflow steps:

  • The build instance can't connect to AWS Systems Manager.
  • The AWS Identity and Access Management (IAM) role has incorrect permissions.
  • The private subnet doesn't have internet access.
  • A duplicate root device name exists.

Resolution

To troubleshoot build pipeline timeout errors, see the following scenarios.

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.

The timeout occurs when the build is verifying SSM Agent availability on the build instance

If the build pipeline timeout occurs when the build is verifying AWS Systems Manager Agent on the build instance, then you might receive the following errors:

"Workflow Execution ID: failed with reason: ExpectationNotMet. ssm:*CommandInvocations returned terminal state Failed in workflow step LaunchBuildInstance."

"Workflow Execution ID: failed with reason: An error occurred (InvalidInstanceId) when calling the SendCommand operation: Instances [[i-1a1b1c1d1e1f1g1h1]] not in a valid state for account in workflow step LaunchBuildInstance."

"Workflow Execution ID: failed with reason: ExpectationNotMet. ec2:DescribeInstanceStatus did not meet terminal states: [['passed']] after 100 attempts. Reason: Timeout. in workflow step LaunchBuildInstance."

These build pipeline timeout errors occur when the instance doesn't have the required IAM permissions on the pipeline infrastructure role. These errors also occur when SSM Agent can reach the endpoints.

Instance doesn't have the required IAM permissions

If your build instances doesn't have the required IAM permissions, then add the AmazonSSMManagedInstanceCore managed policy to your IAM role. Your IAM role is specified on the Image Builder infrastructure configuration. Also, make sure that the AWSServiceRoleForImageBuilder role is allowed to use the AWS Key Management Service (AWS KMS) key. This role needs access to the AWS KMS key that's specified on the image recipe blocking device. For more information, see Allows key users to use the KMS key.

SSM Agent can't reach the endpoints

If SSM Agent can't reach the endpoints, then take the following actions:

  • If you build in a public subnet with an internet gateway, then configure the subnet to automatically assign public IPv4 address.
  • If you build in a private subnet with a NAT, then configure the NAT to use a public subnet.
  • If you build in a private subnet with Amazon Virtual Private Cloud (Amazon VPC) endpoints, then set up AWS PrivateLink endpoints for Systems Manager. For more information, see How do I create VPC endpoints so that I can use Systems Manager to manage private EC2 instances without internet access?
  • Make sure that the security group and network access control lists (network ACLs) allow inbound connections for ephemeral ports (1024–65535). They must also allow outbound connections on port 443. For private subnets with PrivateLink endpoints, the security group that's attached to the Amazon VPC endpoint must allow inbound connections on port 443. These inbound connections must be allowed from the subnet or Amazon VPC CIDR address.

The timeout occurs when the build downloads the AWS CLI

When you build on a private subnet and you have either the aws-cli-version-2-linux or aws-cli-version-2-windows components, the build timeout might occur at the ApplyBuildComponents step.

For a containers build, the build timeout can occur at the BootstrapBuildInstance step. The timeout occurs when AWS CLI doesn't exist on the instance's AMI and the bootstrap script tries to install the AWS CLI over the internet.

To resolve this timeout, take the following actions:

  • Allow internet connectivity on the subnet through a NAT gateway or internet gateway.
  • Use a custom AMI that has AWS CLI installed.

The timeout occurs when there's a duplicate root device name

When you use the CreateImageRecipe API to create a recipe, errors might occur when you name the root device either /dev/xvda or /dev/sda1. To prevent a duplicate root device name in the build instance, check the root device mapping in the source AMI. If a duplicate root device name exists, then a timeout occurs with the following error:

"Workflow Execution ID: failed with reason: ExpectationNotMet. ec2:DescribeInstanceStatus did not meet terminal states: [['passed']] after 100 attempts. Reason: Timeout. in workflow step LaunchBuildInstance."

Note: When you use image recipes from the AWS Management Console, you can't create a duplicate device name. Also, AWS Nitro System instance types (or Xen instance types) are the only instance types that don't fail because of a duplicate device name.

For a root device in an AMI, run the describe-images command to confirm the device name of the source AMI. Use the same device name that's in the image recipe.

The timeout occurs when the build is getting Image Builder components

If you build on a private subnet and Image Builder fails when components are downloaded, then the following error appears:

"failed with reason: failed to download the EC2 Image Builder Component, operation error imagebuilder: GetComponent, exceeded maximum number of attempts, 3, dial tcp i/o timeout."

To resolve the preceding error, check the configuration. Or, create an interface Amazon VPC endpoint if it doesn't exist for the same VPC and subnet that's used on your Image Builder Infrastructure configuration.

The timeout occurs when the build retrieves the mirrorlist

If you build on a private subnet with an Amazon Linux based AMI, then the following timeout error might occur:

"Could not retrieve mirrorlist; error was 12: Timeout was reached."

The Amazon Linux mirrorlist is stored on Amazon Simple Storage Service (Amazon S3). Confirm that an Amazon VPC gateway endpoint exists for Amazon S3. Or, create an Amazon VPC gateway endpoint. The the Amazon S3 prefix list is automatically added to the route table when you create an endpoint. However, as a best practice, confirm that the prefix list is added.

If you're building on a non Amazon Linux AMI, then the mirrorlist isn't stored on Amazon S3. Also, a build timeout might occur when the build is getting the repository/mirrorlist. Make sure that you allow the repository address or URL to your network firewall or proxy. If the repository/mirrolist requires the internet, then allow internet connectivity on the subnet through a NAT gateway.

The timeout occurs at the ApplyBuildComponents step

If a build timeout occurs at the ApplyBuildComponents step, then the following error appears:

"Workflow Execution ID: failed with reason: ExpectationNotMet. ssm:ListCommandInvocations did not meet terminal states: [['Success']] after 1440 attempts. Reason: Timeout. in workflow step ApplyBuildComponents."

To troubleshoot this error, take the following actions:

  • Analyze the logs that are sent to the infrastructure's Amazon S3 bucket. For more information, see the bulleted item for Amazon Simple Storage Service (Amazon S3) in the Review workflow runtime logs section of Troubleshoot pipeline builds.
  • Analyze the component logs that are on the Amazon Elastic Compute Cloud (Amazon EC2) instance that you use to build or test a new image. Before you check the logs, turn off the Terminate instance on failure feature on the troubleshooting session of the infrastructure configuration.
    Note: The detailedoutput.json log file describes the reason that the component failed or timed out. The application.log file provides debug-level troubleshooting information.
  • Check the timeoutSeconds parameter value specified the YAML schema for your document. The default value is 7200. Update this value for each step in the component as needed. A value of -1 is infinite.

Related information

Why is my image build pipeline failing with the error "Step timed out while step is verifying the Systems Manager Agent availability on the target instance(s)" in Image Builder?

AWS OFFICIAL
AWS OFFICIALUpdated 22 days ago