Handling Partial Failures/Timeouts With Kubernetes...
# kubernetes
c
Handling Partial Failures/Timeouts With Kubernetes and Pending Operations Just got my first prod kubernetes deployment out the door. 🎉 Now I'm running into a situation and not sure if it's easily resolved. If I deploy to a new environment and run into it failing, things can get into a stuck state. For example, Ingress timed out. Now it isn't able to proceed because it tries to recreate the ingress that was already created, but not marked as healthy. I've adjusted things to have a longer timeout, but that doesn't fix the possible failure I'd still deal with in the future. • I can't access the cluster to try and import and I don't see any import options with kubernetes object • Only thing I can see is to run a destroy on stack and redeploy but that's not ideal since everything is up except for pulumi knowing the ingress was successfully created. Impacts the downtime. • Right now I had to ask the admin to delete the ingress manually and that won't fly in any future deployments. How do I handle this? In this scenario the state track is actually causing a problem for me. Also since state tracking is involved it's causing others to question the approach since kubectl apply and such don't do state tracking. I want to avoid having to shelve this work but have to figure out some way to handle this myself in the future. Would appreciate any guidance on how I can stablize this with the limited access I have.
r
+1, I don't have an answer but I have ran into similar issues as you describe with kubernetes deployment and state management and I am worried about pitching it to the team with out a good solution for the issues you are describing.
c
Yeah I had major pushback because tracking state for kubernetes is "not idiomatic" in general. I see the benefits, but today again I'm slightly stuck. Cancelled a deployment that was hung up due to wrong image. Now I can't run deploy cause it was created, but didn't go into state. To fix I'll have to force recreate it or have access to cluster to fix.
i just reran after having to delete. This is really problematic when I don't have access to the clusters via cli/kubectl. I can only use a pipeline task. Might have to write some tasks to run commands to remove resources if this happens, really would like to figure out how to avoid this though. Only major issue i have remaining.
g
Here are a couple of options that should help you get unstuck: 1. You can skip the await logic on any k8s resource by including a
<http://pulumi.com/skipAwait|pulumi.com/skipAwait>
annotation on the resource. Pulumi will simply create/update a resource with this annotation, just like
kubectl apply
. (Note that doing this can cause problems with outputs not actually being ready for dependent resources to use, so I’d use this approach sparingly) 2. You can use the
--target
option on
pulumi destroy
to delete only the specified resource. If you have access to run pulumi operations, you could delete the ingress like this and then run another
pulumi update
. https://www.pulumi.com/docs/reference/cli/pulumi_destroy/
c
@gorgeous-egg-16927 okay so my failure scenario is something like a deployment trying to reach out and can't find the tag image. It either times out or I cancel. The resource is stuck in pending. It is not a created object that is tracked at that point. Now I have the auto generated objects that are unhealthy running in parallel to the original healthy. To me this is a pro for pulumi. But now I need to ensure that the updated tagged image is used. It won't work because the deployment is already created. I'm not sure I want to destroy it because there's a healthy one running and the auto-generated one that never got flipped over. That's the kind of scenario that seems problematic. The ingress was a one-time issue I ran into but a failed deployment is much more likely. If I had access to the cluster directly I could just delete it even if that's not ideal and rerun. However in my scenario everything is locked behind running it through a pipeline agent so any CLI specific operations I'll have to write new pipeline tasks to accept input for. I'm assuming I'll have to do that but I want to make sure that I'm not missing something.
To be honest what I kind of wanted was halloumi to fail in that deployment and clean up all of the failed resources that it did not end up using rather than leaving it halfway deployed.
Even if that's not automatically done during the next deployment that's what I would want to happen... Again that is the auto-generated deployment pods that have not yet been set as the target of the ingress because it never came up as ready
Let me know if that changes the story a little. Overall I think with gloomy I actually have a more robust methodology than kubectl but I've got to figure out a couple of these edge cases so that folks have confidence and being able to fix it quickly.