Running into this issue with a new EKS deployment:...
# kubernetes
s
Running into this issue with a new EKS deployment:
Copy code
kubernetes:apps/v1:Deployment (web):
    error: 2 errors occurred:
    	* the Kubernetes API server reported that "my-application/web-v845050y" failed to fully initialize or become live: 'web-v845050y' timed out waiting to be Ready
    	* Attempted to roll forward to new ReplicaSet, but minimum number of Pods did not become live
If I watch the cluster I can see all of the pods (2) in the RS stand up and become Ready, and the ReplicaSet/Deployment reports them all as up + Ready/Up-to-Date inside of 2 minutes. This is happening for every Deployment I have configured on this cluster. Tried latest
@pulumi/kubernetes
Node module, I’m on latest Pulumi CLI binary, on EKS 1.19. I tried blowing everything up and redeploying. Nothing of note when describing the Deployment or ReplicaSet. It’s like the Pulumi client is just ignoring the state of the ReplicaSet. Any assistance would be appreciated, there’s very little out there on the Googles other than what I’ve already tried.
b
Could you open an issue for this? Ideally would with the code you’re using to repro
s
Sure.
Of course I butchered the formatting.
There we go.
g
I took a look at the info you provided and nothing is jumping out to me. I did confirm that I was able to roll forward a Deployment on a local v1.19.6 cluster, so it may have something to do with the particular configuration you’re using. If you can, it would be helpful to see the actual Deployment spec that is getting sent to the cluster. https://www.pulumi.com/docs/reference/pkg/kubernetes/apps/v1/deployment/#deployment documents the conditions we’re checking to determine readiness, so you might see if any of those could be the issue.
s
Here ya go @gorgeous-egg-16927. Thanks for taking a look, let me know if you need more data.
👍 1
Looks like all the conditions mentioned in the document are fulfilled on the Deployment object during the rollout.
It’s worth noting that: • This is a multi-tenant cluster • Other processes leverage the exact same code via CI to deploy to different namespaces • This namespace is definitely set up the same as the others • Other processes deploy fine
Checked Kube API server logs in CloudWatch logging groups, found this:
Copy code
I0330 15:01:42.143571       1 deployment_controller.go:490] "Error syncing deployment" deployment="my-application/web-rov2l78t" err="Operation cannot be fulfilled on deployments.apps \"web-rov2l78t\": the object has been modified; please apply your changes to the latest version and try again"
This happened after the pods in the RS were rolled over
@gorgeous-egg-16927 Any updates on this? It’s blocking us from retiring some aging infrastructure.
g
@billowy-army-68599 was troubleshooting this yesterday, and I believe the problem was related to the
<http://kubectl.kubernetes.io/restartedAt|kubectl.kubernetes.io/restartedAt>
annotation. I haven’t tested a workaround, but I expect some combination of removing that annotation and using the
skipAwait
annotation (https://www.pulumi.com/blog/improving-kubernetes-management-with-pulumis-await-logic/#new-annotations-to-customize-kubernetes-await-logic) would get you unblocked for now.
s
Thanks @gorgeous-egg-16927, will report status later today.
f
Were you able to get unblocked w/ the suggestion above @stocky-student-96739?
s
@billowy-army-68599 @gorgeous-egg-16927 So I actually couldn’t find an annotation matching
<http://kubectl.kubernetes.io/restartedAt|kubectl.kubernetes.io/restartedAt>
on the Deployment, ReplicaSet, or Pods. Adding the
skipAwait
notation got us past the part where we were timing out, but I don’t feel it’s a proper solution since it’s just firing and forgetting and not concerned with the ultimate state of the deployment.
@faint-table-42725 ^^
a
Hi all, I am seeing the same issues and also with EKS clusters. Issue is occurring consistently and across 4 different clusters (so not isolated to a particular cluster). I commented on the github issue before coming here but was wondering if any progress has been made with this issue?
s
I saw there were a couple of PRs to try to fix this: https://github.com/pulumi/pulumi-kubernetes/pull/1596 https://github.com/pulumi/pulumi-kubernetes/issues/1628 I’ve tried 3.5.0 and 3.4.1 and both exhibit the same behavior as before.
s
@stocky-student-96739 sorry late to this. Could you add any additional specifics to https://github.com/pulumi/pulumi-kubernetes/issues/1628? I have tried reproing this but after https://github.com/pulumi/pulumi-kubernetes/pull/1596 its not so easy for me to do so. If you are able to repro with 3.5.0 a dump of debug logs (e.g.
pulumi --logflow --verbose=9 --debug --logtostderr up --yes >& /tmp/logs
) would be very useful. Happy to setup time with you if you are concerned about sharing the detailed logs.
s
@sparse-park-68967 Thanks, will do that and post results here.
🙏 1
s
@stocky-student-96739 just checking in to see if you had a chance to repro with additional logs?
s
I haven’t yet, I’ll try to get that done this week.
Thanks for following up
s
sure thing!