(specifically what happened is that a container im...
# kubernetes
n
(specifically what happened is that a container image tag was changed and a deployment of that change was interrupted; pulumi never committed the change to its own state even though it was already pushed to the cluster)
s
could this help? https://github.com/pulumi/pulumi/issues/8058
pulumi update --refresh
n
Perhaps, I've generally stayed away from refresh because the times that I have used it, it's usually broken something else, whether that's due to state drift that should happen which I haven't explicitly ignored on pulumi configs or something else I don't know. In either case, at this point of the deployment stack lifecycle, I don't trust refresh enough to dare running it.
In future projects I'll likely try to always call refresh specifically for this reason, but it feels kind of stupid to have to do that just because pulumi doesn't do reliable state management
s
@nutritious-petabyte-61303 Sorry you ran into a problem. To help me better understand (so that I can file issues as appropriate), how was the container image tag changed?
n
just a simple version bump, nothing special; it's just the unclean interruption of the deployment that lead to the issue
seems like pulumi only writes to its own state after successful deployments, but nothing catches a previously failed one which also didn't have the chance to clean up after itself
s
Thanks. Was it Pulumi that rebuilt the container image and changed the image tag, or did that happen outside of Pulumi?
n
Build was outside of Pulumi, I just changed the image value in the deployment container spec
s
Unless I’m misunderstanding something, if the build happened outside of Pulumi and the change to the image value happened outside of Pulumi, then there was never an opportunity for Pulumi to update its state. The ideal solution (but I understand this isn’t always possible) would have been to update your Pulumi program and have it change the image value for you; then the Pulumi state would have been updated. In this sort of scenario,
pulumi refresh
is really the only solution for updating Pulumi’s state to match what was changed outside of Pulumi.
n
You're misunderstanding; pulumi manages the kubernetes resource, the image is just built outside of pulumi. Or in other words, the image was already in the image registry, and pulumi tracks what image version is mapped to the k8s deployment
In terms of the timeline it'd roughly be: • We update the image tag in the pulumi stack config manually • We run pulumi up • Pulumi sees the desired target image mismatches with the current image by checking its own state and comparing against it • Pulumi will appropriately start a deployment • Pulumi updates the deployment config in the k8s cluster to point to the new target image. THIS IS WHERE IT SHOULD ALSO UPDATE THE PULUMI STATE BUT DOES NOT. • At this point changes have been committed to the k8s cluster but not to the pulumi state • Pulumi deployment gets interrupted for any reason -> k8s state has been modified but pulumi state has not. • On subsequent Pulumi runs, Pulumi is not aware if the change that was applied to the k8s cluster, as it never committed the change to its own state due to the process being interrupted. End result is the actual state diverging from what pulumi thinks it is. If at this point we were to update the pulumi stack config to point back to the old image to roll back the deployment, pulumi will not try to apply any deployment because the config already matches its own state. Even though the cluster state differs. The solution in my opinion would be to always commit in-progress state changes to the pulumi state during or ideally before they're applied to the k8s cluster.
I've hit this specific state drift issue on multiple occasions over the years and it's the biggest reason we haven't fully automated the deployment process yet, because I know there are scenarios which are not possible to automatically recover from
@salmon-account-74572 I could be interested in contributing something on this front if it means getting the issue resolved, let me know what would be the appropriate way to organize that
s
@nutritious-petabyte-61303 Thanks for the additional detail/explanation. I thought the change to the Deployment had been made outside Pulumi, but your workflow explanation cleared that up. Yes, this definitely sounds like an issue and this behavior does not seem appropriate. Would you mind filing an issue on https://github.com/pulumi/pulumi-kubernetes? When you open the issue, mention that you’re open to contributing a fix. We are, of course, more than happy to have you contribute a fix!
n
@salmon-account-74572 I haven't opened an issue yet (bureaucracy while important isn't really my strong suit & easily gets put off), but I do want to mention I was doing some consulting recently to a client that was interested in using Pulumi as a core part of their new infrastructure management. Ultimately they opted to using Flux CD instead of Pulumi to manage the application deployments, as there wasn't enough confidence in Pulumi's state management to work appropriately in a fully automated environment, and this specific scenario was cited as one of the major reasons.
I would've loved to be able to recommend e.g. using https://www.pulumi.com/docs/using-pulumi/continuous-delivery/pulumi-kubernetes-operator/, but unfortunately as things are right now, I really can't do that
s
I’ll get this feedback to the Engineering org; thank you (seriously, we really appreciate it). If I open an issue, would you be willing to update it with this feedback? This will really help with prioritizing any work that is needed.
n
I'll make an effort to at least, but ultimately it depends on my other workload at the time. At the very least it'd offer a place to provide feedback to if/when I next run into this, so it sounds like a good idea 👍
@salmon-account-74572 did you end up opening an issue yet?
s
No, I haven’t yet. Let me do that right now before something else interrupts me. 🙂
@nutritious-petabyte-61303 If you do get a moment to weigh in on the issue, that would be great (your first-hand experience with this issue is valuable).
n
Thanks! I'll see if I have the time to now already, but if not I'll keep that around and update it once this becomes relevant for me next