Hello, I am running an Elastic Kubernetes Service ...
# kubernetes
a
Hello, I am running an Elastic Kubernetes Service cluster with autoscaling via Karpenter, which spins up nodes in response to resource requests from deployments that are provisioned with Pulumi. Whenever a dev wants to get a test environment, automation is deployed that runs a
pulumi up
and creates a new environment, namespaced accordingly, in that cluster. However, the
pulumi up
sometimes fails due to not enough resources existing, since a node takes several minutes to spin up and resources timeout before they become live. I get this error:
Copy code
kubernetes:core/v1:Service (wox-parcel-api-svc):
    error: 2 errors occurred:
    	* resource pr-1312/wox-parcel-api-svc was successfully created, but the Kubernetes API server reported that it failed to fully initialize or become live: 'wox-parcel-api-svc' timed out waiting to be Ready
    	* Service does not target any Pods. Selected Pods may not be ready, or field '.spec.selector' may not match labels on any Pods

  kubernetes:core/v1:Service (home-value-api-svc):
    error: 2 errors occurred:
    	* resource pr-1312/home-value-api-svc was successfully created, but the Kubernetes API server reported that it failed to fully initialize or become live: 'home-value-api-svc' timed out waiting to be Ready
    	* Service does not target any Pods. Selected Pods may not be ready, or field '.spec.selector' may not match labels on any Pods

  kubernetes:core/v1:ServiceAccount (pr-1312-homevalue):
    error: 1 error occurred:
    	* resource pr-1312/pr-1312-homevalue was successfully created, but the Kubernetes API server reported that it failed to fully initialize or become live: Timeout occurred polling for 'pr-1312-homevalue'

  kubernetes:core/v1:ServiceAccount (pr-1312-woxapi):
    error: 1 error occurred:
    	* resource pr-1312/pr-1312-woxapi was successfully created, but the Kubernetes API server reported that it failed to fully initialize or become live: Timeout occurred polling for 'pr-1312-woxapi'

  kubernetes:core/v1:Service (market-insights-api-svc):
    error: 2 errors occurred:
    	* resource pr-1312/market-insights-api-svc was successfully created, but the Kubernetes API server reported that it failed to fully initialize or become live: 'market-insights-api-svc' timed out waiting to be Ready
    	* Service does not target any Pods. Selected Pods may not be ready, or field '.spec.selector' may not match labels on any Pods
I've tried increasing
customTimeouts
on the Pulumi-provisioned resources to wait for the new node to become live, but this hasn't fixed the issue. Is there some way to get Pulumi to retry, or wait for additional compute?
c
We had a similar issue waiting for instance refreshes after an ASG update, you could adapt this for what arbitrary you want to check if instances are up:
Copy code
async function waitForInstanceRefresh(name: string): Promise<boolean> {
  if (!pulumi.runtime.isDryRun()) {
    const credentials = fromNodeProviderChain({ profile: awsProfile });
    const config = { credentials, region: awsRegion };
    const autoScalingClient = new AutoScalingClient(config);

    const refreshCommand = new DescribeInstanceRefreshesCommand({
      AutoScalingGroupName: name,
      MaxRecords: 100,
    });

    await backOff(
      async () => {
        const { InstanceRefreshes } = await autoScalingClient.send(refreshCommand);
        const inProgress = InstanceRefreshes?.filter((e) => e.Status === 'InProgress');
        if (inProgress && inProgress.length === 0) {
          return true;
        } else if (inProgress) {
          throw Error(`ASG refresh still in progress`);
        } else {
          throw Error('ASG client failed?');
        }
      },
      {
        retry: async (e, attemptNumber) => {
          await <http://pulumi.log.info|pulumi.log.info>(`checking ASG refresh for ${name}: ${attemptNumber}`);
          return true;
        },
        numOfAttempts: 30,
        startingDelay: 10000,
        maxDelay: 60000,
        delayFirstAttempt: true,
        jitter: 'none',
      },
    );
  }
  return true;
}
then call this in an
apply
on some property before the thing you want to deploy, like
asg.name.apply((name) => waitForInstanceRefresh(name));