12/05/2022, 8:11 AM
Hello Community, We are running pulumi-kubernetes-operator in EKS nodegroup. We are noticing that the after sometime the stacks state changes to failed due to stacks state file gets locked, while investigating further we identified that the operator is leaving the Zombie(defunct) process behind e.g.
ec2-user 31942 14939  0 Nov30 ?        00:00:00 [pulumi-resource] <defunct>
ec2-user 31943 14939  0 Nov30 ?        00:00:00 [pulumi-resource] <defunct>
ec2-user 31944 14939  0 Nov30 ?        00:00:00 [pulumi-resource] <defunct>
ec2-user 31945 14939  0 Nov30 ?        00:00:00 [pulumi-language] <defunct>
ec2-user 31946 14939  0 Nov30 ?        00:00:00 [pulumi-resource] <defunct>
ec2-user 31947 14939  0 Nov30 ?        00:00:00 [pulumi-resource] <defunct>
the count of above process keeps on increasing and reaches the cgroups limit set at
due to this there is shortage of resource in the node and the CNI(aws-node) keeps on restarting due to Livlinessprobe failure. What should we do in overcome this? Any help will be greatly appreciated, this is blocking our GA.


12/05/2022, 5:23 PM
This will happen all the time, because the operator binary is run as PID 1, but it doesn’t reap zombie processes. The prerelease v1.11.0-rc.0 has a fix — you could try running that to see if it avoids hitting the PID limit.