Hi guys, question on resolving `the stack is curre...
# general
h
Hi guys, question on resolving
the stack is currently locked by 1 lock(s)
. We use the automation API, and in some cases the pulumi process is killed (due to timeout etc). So "wait" probably won't work. For
pulumi cancel
, the documentation says "this operation is _very dangerous_". And we've had some leaked resources that I suspect is because of
pulumi cancel
(resources exist, but not tracked in the corresponding pulumi stack). What's the recommended way to solve this problem (ideally automatically, without human interaction)? Thanks!
b
Hi Menghan. The best way to really resolve this is to prevent the timeout killing the process. You have leaked resources because the process was killed in the middle of provisioning and Pulumi lost track of them.
cancel
is the way to deal with locked stacks, but if the process is interrupted you’re going to leak resources
h
I see. Thanks. We want to give pulumi enough timeout to finish, but at the same time, not having a timeout is a bit concerning...
b
well, there’s 2 timeouts to consider. One is a graceful timeout, the other is a killed process timeout. If you just kill the process in the middle of it performing operations, Pulumi can’t track the current state of a resource in the cloud. The way Pulumi works is, it sends an API request to the cloud provider API and then waits/polls for responses from the API to determine if it finished. If you send the API request and then kill Pulumi in the middle of the resources provisioning, it doesn’t know if it has completed or not.
h
A graceful timeout sounds good! Is there a way to send a "graceful kill" to pulumi before the "hard kill"?
b
Pulumi will honour standard signals, so if you send a SIGTERM it will finish what it’s doing then exit
s
I would just set a timeout very high, maybe 5x expected, and try breaking up your stacks if they have long run times. Killing Pulumi will never be good - avoid at all costs
h
@billowy-army-68599 more on the SIGTERM part. Is there an API to send the signal via the pulumi automation API?
And is there an API to check how old a lock is? (So we can cancel only if the lock is old enough)
s
@happy-kitchen-33409 Can you clarify why you want to kill the Pulumi process? You are almost guaranteed to have problems, however you do this. See what @billowy-army-68599 said in his first and second responses. If the problem is that Pulumi is taking a very long time to finish, I suggest you try solving that, instead of killing Pulumi. If you want a timeout "just in case", why not set it to 5x what you expect duration to be? Then it won't run forever, but it also won't lose resources.
h
We did have a long enough timeout (at the time). But as we added resources to the pulumi stack, pulumi needs more time, and the old timeout became not long enough. And since I'm already in this rabbit hole, besides increasing the timeout (which I already did), I also want to see if there are ways to handle this gracefully. For example, if the pulumi process is still killed for reasons we didn't consider, how to make it easier to recover. We run pulumi in automated workflows, requiring manual actions to recover those will become a maintaining burden.
s
On the "graceful timeout" that @billowy-army-68599 mentioned - I think he was referring to a timeout managed within Pulumi, where if a cloud provider takes too long to respond, Pulumi times it out. That's the only really graceful timeout here. Killing Pulumi, even with SIGTERM, may leave some API operations killed 'in flight' so Pulumi doesn't know if the resource operation completed or not.
If Pulumi is killed due to resource exhaustion, it could get various signals not just SIGTERM, depending on your environment. Another option may be to move to Pulumi Deployments, where I would assume Pulumi have solved these problems.
Breaking up the stack is probably the best option to avoid ever growing run times
h
We are looking into more ways to avoid killing pulumi. Breaking stack probably won't help in our particular case, because the overall workflow timeout would cover all the resources, whether in one or more pulumi stacks. On the recovering part, is my understanding right that, there's no easy way to recover if pulumi is killed (there's probably also good timing vs bad timing...)? By Pulumi Deployments you meant pulumi cloud, right? Does it help with the "not-killing-pulumi" part? Or does it also help with recovering?
s
Not sure how you are running this but if it's a hosted CI tool, it may be feasible to have smaller stacks, each in a single workflow with the original run time limit. You may also want to look at a non-CI way of running Pulumi tasks - for example AWS Batch is a hosted service that's designed to run containers that take a long time to complete, without having to manage servers. ECS/EKS on Fargate may also be a good option if you don't need the batch queue model, perhaps for more parallelism. Just stay away from Lambda as it's strictly 15 min timeout.
Other cloud providers should have similar services of course. How are you running this? Pulumi Deployments is a Pulumi cloud service: https://www.pulumi.com/docs/pulumi-cloud/deployments/ - it's a specialised service that only runs Pulumi tasks
h
We run temporal on EKS. So we do have control of the timeout and other things. It's more about our business logic, we provision multiple resources to serve a customer request. We start a workflow which runs pulumi (and some other things, database etc) to handle a request, and this workflow has an overall timeout.
Thanks for the help @salmon-gold-74709 and @billowy-army-68599 This has been very helpful. I will bring this to my team and discuss with them.
s
You're welcome - interesting hearing about your use case