We have a game-breaking bug with pulumi automation...
# automation-api
c
We have a game-breaking bug with pulumi automation where every once in a while the apply will hang with a few "unhandled rejection" errors. Normally we would just retry in these cases but the hanging makes it completely block further use of our application. Not quite sure if this is my issue or yours.
r
What version are you on? I know @lemon-agent-27707 recently made some fixes in this area.
l
And please provide any detail on the error, related code, etc, that you can.
c
@lemon-agent-27707 Error always looks like this:
Copy code
unhandled rejection: CONTEXT(127): resource:svg-xss-44dfd983-b768-4c01-ab49-4315ec88a39f[kubernetes:apps/v1:Deployment]
STACK_TRACE:
Error
    at Object.debuggablePromise (/app/node_modules/@pulumi/kubernetesx/node_modules/@pulumi/pulumi/runtime/debuggable.js:69:75)
    at Object.registerResource (/app/node_modules/@pulumi/kubernetesx/node_modules/@pulumi/pulumi/runtime/resource.js:219:18)
    at new Resource (/app/node_modules/@pulumi/kubernetesx/node_modules/@pulumi/pulumi/resource.js:215:24)
    at new CustomResource (/app/node_modules/@pulumi/kubernetesx/node_modules/@pulumi/pulumi/resource.js:307:9)
    at new Deployment (/app/node_modules/@pulumi/kubernetesx/node_modules/@pulumi/apps/v1/deployment.ts:109:9)
    at new Deployment (/app/node_modules/@pulumi/kubernetesx/kx.js:166:9)
    at /app/index.ts:157:41
    at Object.exam (/app/index.ts:368:6)
    at Stack.<anonymous> (/app/node_modules/@pulumi/pulumi/runtime/stack.js:86:43)
    at Generator.next (<anonymous>)
It's outputted to stderr and the apply will continue until it hangs. If I restart pulumi, it will usually work.
Using pulumi 2.25.
l
1. What version of pulumi CLI and
@pulumi/pulumi
are you using? 2. Can you share some of your code? That error points to this line:
at /app/index.ts:157:41
c
Here's the specific resource:
Copy code
const svg_xss = new k8x.Deployment("svg-xss-" + id, {
    metadata: {namespace: kubernetesWorkspace.metadata.name},
    spec: new k8x.PodBuilder({
        automountServiceAccountToken: false,
        imagePullSecrets: [{
            name: dockercfg.metadata.name,
        }],
        containers: [
            {
                env: {
                    PORT: svg_xss_port.toString(),
                    FLAG: task.flag,
                    SESSION_KEY: svg_xss_session_key.result,
                },
                name: "svg-xss-server-" + id,
                image: toContainerURL("svg_xss_server"),
                ports: {
                    http: svg_xss_port,
                },
            },
        ]
    }).asDeploymentSpec()
})
Copy code
function toContainerURL(containerName: string): pulumi.Output<string> {
    return registryEndpoint.apply(host => {
        return host + "/" + toContainerPath(containerName) + ":latest"
    })
}
Right out of our package.json:
Copy code
"dependencies": {
    "@jsier/retrier": "^1.2.4",
    "@pulumi/digitalocean": "^3.8.0",
    "@pulumi/dnsimple": "^2.7.0",
    "@pulumi/kubernetes": "^2.9.1",
    "@pulumi/kubernetesx": "^0.1.4",
    "@pulumi/pulumi": "^2.25.0",
    "@pulumi/random": "^3.2.0",
    "axios": "^0.21.1",
    "express": "^4.17.1",
    "reattempt": "^0.1.1",
    "redis": "^3.1.1",
    "rwlock": "^5.0.0"
}
l
Can you share the full error output?
c
That is literally the entire error. It's repeated twice:
Copy code
unhandled rejection: CONTEXT(127): resource:svg-xss-44dfd983-b768-4c01-ab49-4315ec88a39f[kubernetes:apps/v1:Deployment]
STACK_TRACE:
Error
    at Object.debuggablePromise (/app/node_modules/@pulumi/kubernetesx/node_modules/@pulumi/pulumi/runtime/debuggable.js:69:75)
    at Object.registerResource (/app/node_modules/@pulumi/kubernetesx/node_modules/@pulumi/pulumi/runtime/resource.js:219:18)
    at new Resource (/app/node_modules/@pulumi/kubernetesx/node_modules/@pulumi/pulumi/resource.js:215:24)
    at new CustomResource (/app/node_modules/@pulumi/kubernetesx/node_modules/@pulumi/pulumi/resource.js:307:9)
    at new Deployment (/app/node_modules/@pulumi/kubernetesx/node_modules/@pulumi/apps/v1/deployment.ts:109:9)
    at new Deployment (/app/node_modules/@pulumi/kubernetesx/kx.js:166:9)
    at /app/index.ts:157:41
    at Object.exam (/app/index.ts:368:6)
    at Stack.<anonymous> (/app/node_modules/@pulumi/pulumi/runtime/stack.js:86:43)
    at Generator.next (<anonymous>)
unhandled rejection: CONTEXT(127): resource:svg-xss-44dfd983-b768-4c01-ab49-4315ec88a39f[kubernetes:apps/v1:Deployment]
STACK_TRACE:
Error
    at Object.debuggablePromise (/app/node_modules/@pulumi/kubernetesx/node_modules/@pulumi/pulumi/runtime/debuggable.js:69:75)
    at Object.registerResource (/app/node_modules/@pulumi/kubernetesx/node_modules/@pulumi/pulumi/runtime/resource.js:219:18)
    at new Resource (/app/node_modules/@pulumi/kubernetesx/node_modules/@pulumi/pulumi/resource.js:215:24)
    at new CustomResource (/app/node_modules/@pulumi/kubernetesx/node_modules/@pulumi/pulumi/resource.js:307:9)
    at new Deployment (/app/node_modules/@pulumi/kubernetesx/node_modules/@pulumi/apps/v1/deployment.ts:109:9)
    at new Deployment (/app/node_modules/@pulumi/kubernetesx/kx.js:166:9)
    at /app/index.ts:157:41
    at Object.exam (/app/index.ts:368:6)
    at Stack.<anonymous> (/app/node_modules/@pulumi/pulumi/runtime/stack.js:86:43)
    at Generator.next (<anonymous>)
But there's no "full error". Like I said, it hangs.
The apply keeps on running for a bit whenever this happens, and then never finshes.
l
c
eh 1s
Copy code
while (true) {
    try {
        await stack.setConfig("input", {
            value: queuedInput,
            secret: false,
        })
        console.log("New state: " + queuedInput)
        console.log("Performing apply...")
        const applyOutput = await stack.up({
            onOutput: function (output) {
                console.log("[PULUMI]: " + output)
            },
        })
        console.log("Finished apply")
        appliedOutput = applyOutput.outputs
        appliedConfigTime = recvTime
        release()
        return
    } catch (err) {
        console.log("Error encountered during apply: " + err)
        try {await stack.cancel()} catch {} //This can err, in case no apply needed to be canceled
    }
}
The "Error encountered during apply" bit never executes, so I assume it's not throwing an exception.
If it helps, as far as I can remember it's always been a Deployment that the unhandled rejection error points to specifically.
l
This is a consistent error, or only happens sometimes?
c
Only happens sometimes. Usually one or two applies after the first apply. I don't think I've ever seen it happen on startup.
l
When did this start happening? Does this happen with earlier versions as well, or only after a recent upgrade.
c
We upgraded in an attempt to fix this
l
I see, so this has always happened?
c
Yes
l
The apply keeps on running for a bit whenever this happens, and then never finshes.
What do you mean by this. How did you observe this?
c
Well, it neither throws an exceptions or hits the release()/return block.
or outputs console.log("Finished apply")
So I assumed that meant the up() call was still running
l
What version of node are you running, and can you share these two lines of code (line numbers labeled if possible):
Copy code
at /app/index.ts:157:41
    at Object.exam (/app/index.ts:368:6)
By any chance are you using our kubernetes helm support in your program?
c
I am not using kubernetes helm. 368 is just
}()
, the end of hte pulumi program definition. 157 is the starting line of the Deployment I showed you earlier:
const svg_xss = new k8x.Deployment("svg-xss-" + id, {
l
c
sure.
eh 1s actually
this goes through CI so
I don't actually have the ability to do that
Let me figure this out
l
Before we go any deeper there, would you mind opening an issue that summarizes the problem? github.com/pulumi/pulumi/issues/new
c
l
I notice that your stack trace includes nested version of
@pulumi/pulumi
one at the top level, and one nested under pulumi/kubernetesx. Do you have a package.lock file or yarn.lock file so that we could see what versions are being resolved?
Might also be worth doing a run with excessive debug output set to
true
unfortunately this requires modifying js source of that variable: https://github.com/pulumi/pulumi/blob/036344679a0c2e24af3c9e4a0d7ea34adff5136b/sdk/nodejs/runtime/settings.ts#L35
Does pulumi still think the update is running at this point? Or has the update completed?
c
Sorry; internet went out
Here it is
@lemon-agent-27707 Pulumi continues to think the update is running and the pod has to cancel it upon restart. It hangs for at least 24 hours if no one manualy `kubectl rollout restart`'s
l
Ok, looks like both versions of pulumi/pulumi are 2.25. Would be helpful to see the excessive debug output when you get a moment. Also out of curiosity, what version of node are you running?
c
I'm gonna modify the dockerfile so it uses a custom node_modules dir rn
1s
@lemon-agent-27707 Hey, so I'm finally getting around to testing it like you said... We upgraded to pulumi 3.0, and I removed the process.on() block you mentioned. The issue seems to be gone.
woohoo 1
I can't tell whether or not it's because we upgraded, or because of the code change I made
r
Great news! Curious to see what you’d encounter without the code change to figure out whether it’s the upgrade or the code change, but my bet is on the code change because I don’t believe we changed anything in 3.0 that would help here.
l
Ah, interesting.
Can you let me know if you still see this issue without modifying the uncaught handler? If that is definitively the issue I will put together a patch that removes it so you don't have to maintain a fork.
c
@lemon-agent-27707 It's definitively the patch we made.
l
Can you try to precisely describe the describe the difference in behavior before and after the patch? Without the patch your program fails and with it, it completes successfully?
Or is it that without the patch your catch executes?
c
Without the patch, every so often, pulumi applies hang without erroring and must be canceled and restarted. With the patch, it doesn't, and resources are created successfully.
l
It sounds like now in some cases your
catch
block is actually executing? Is that true, or did the failures go away altogether?
c
Failures went away altogether, I believe. But I'm not sure. I can check next time I see the logs.
I take that back. It's finally propagating back up, and it turned out to be a bug in our code - but one that we can fix, because we got the real error message this time.
🎉 1
l
Ah, glad to hear. Thanks for the report!