I was wondering if anyone has any methods for grac...
# kubernetes
k
I was wondering if anyone has any methods for graceful rollbacks of failed Kubernetes deployments. If we update one of our deployment images to a new version and it doesn't roll out successfully we would prefer if that change was just automatically rolled back to the previous good state. Is the current solution just to do a revert commit and apply again?
b
when you say:
we would prefer if that change was just automatically rolled back to the previous good state
Do you mean the Pulumi side of things, or the Kubernetes side of things? pulumi's await logic will mean that the old deployment itself will still be running
k
Probably both would be preferred, but in k8s at least it will have both the old pods and the new pods that are going to be stuck crashing
If it ends up failing the pulumi checks (so say if the new pods never get ready or are crashing), our preference would be for it to go back to the known good state in both pulumi and in the k8s objects
n
there are a few k8s operators that handle things like this, perhaps you could look at knative or argocd
my impression is that the proper way to approach this kind of a solution is to use a k8s operator that handles it
that being said, I'm sure you could automate it with strictly pulumi too, it's just slightly more risky as you'd be making the deployment an imperative process instead of declarative (what happens if the deployment runner crashes after a failed deploy but before cleanup?), so at the very least you'd need to ensure it's idempotent
b
@nutritious-petabyte-61303 I'm not sure what you're getting at here, Pulumi is declarative and idempotent
n
pulumi is declarative but the deployment is not
that being said, pulumi also has a k8s operator you could use: https://www.pulumi.com/docs/guides/continuous-delivery/pulumi-kubernetes-operator/
I'm not sure how idempotent pulumi truly is, but I know there are (or have been) some scenarios where it doesn't do k8s cleanup appropriately. Essentially pulumi could try applying a resource (inserting it to the cluster state but not pulumi state) and fail to deploy it (without cleaning it up from the k8s state as it was never inserted into the pulumi state)
b
Again I'm not following, what do you mean the deployment is not? If that's happening, it's a bug and we need to fix it
n
Not sure if that was clear, but I've hit a few scenarios where there's drift between the actual k8s state and the state tracked by pulumi
b
Okay, if that's happening it's a bug. Please file issue and let me know immediately, this is the first I'm hearing of it
n
Deployment is imperative because it's handled in steps, (i.e.
wait for resource X to deploy before deploying resource Y
). This is unavoidable in most scenarios, but in the context of k8s it might not make as much sense
I'd consider the deployment declarative if the pulumi operator was installed in the cluster, and the deployment would amount to "deploy this version of the pulumi program using the operator"
Most commonly I see the client running the imperative deployment process however, which means a failure during it (i.e. network issue) can leave the deployment in a weird state
Okay, if that's happening it's a bug. Please file issue and let me know immediately, this is the first I'm hearing of it
It's been a few months since I last hit the issue and I no longer remember what it was specifically, but I'll be sure to do so the next time it happens.
As a rule of thumb however, I suspect if pulumi applies any resource pushed to the k8s state also on its own state immediately before healthchecks, there wouldn't be any issue, as that represents what actually happens during a deploy (once a resource is pushed on k8s state, it exists there regardless if it fails or succeeds).
The disparities I've hit have always been caused by pulumi running healthchecks before updating its internal state to match what was pushed to k8s, and in some scenarios losing track of what was pushed (because it failed to deploy)
Also to be clear, it's been a few months since the last time I've had a broken deploy like this, so for all I know it might already be fixed.
k
I'm not sure this is exactly what I'm looking for, the operator seems more of a tool for running pulumi within kubernetes natively. It doesn't really seem to affect the actual object control
b
I think we have a different definition of imperative and declarative. My definition of declarative is that a DAG is evaluated and then it's not possible to manipulate that evaluation. It seems your version of declarative is focused on the Kubernetes reconciliation model. The fact we check for successful healthchecks doesn't change the declarative nature of Pulumi at all and you can even turn it off with a single flag in the provider
n
you'd be looking for a canary deployment using a k8s operator that can handle the rollback if it needs to be done, or some pulumi magic I'm not familiar with. I'm fairly certain Knative or ArgoCD operators would work for that (though Knative is overkill)
@billowy-army-68599 yeah, pulumi is declarative from a programming-model standpoint, but the deployment process is not declarative, if that makes sense
(well it entirely depends on how you deploy, since the operator does exist)
generally anything that needs to track intermediate state can not be declarative IMO
b
no, it doesn't make sense i'm afraid, We seem to disagree on this, but that's okay
👍 1
n
the problem arises from the dependency on previous state
that makes it imperative
b
i respectfully disagree, and we'll leave it at that
n
yeah we're going a bit offtopic here debating what is or isn't imperative or declarative 😄