Is there any correlation between the memory taken ...
# kubernetes
h
Is there any correlation between the memory taken up by
pulumi up
process and the number of objects in the K8s cluster? I have a situation where
pulumi up
was crawling to halt on a 8GB worker in CircleCI, and when I tried it locally I saw it taking up 20+GB of memory. Any tips on troubleshooting this further?
c
how many resources are in your stack? and which version of Pulumi are you using? one of the more recent minor versions fixed a performance issue causing preview/up to be extra slow in case that's one of your symptoms, but not sure if the problem/solution related to memory usage.
h
Pulumi CLI version 2.22.0 Pulumi-Kubernetes plugin version 2.8.2 We have a few thousand K8s objects.
The net result was that the
pulumi up
process timed out in CI and we had to manually repair the state file.
c
ah that's quite a lot of resources for a single stack. not sure if there's a good solution for this to be honest. I think it's recommended not to have that many resources in a stack. I try to limit mine to the low hundreds at most. It's unfortunate because the development experience becomes much more cumbersome when resources are split into smaller stacks, but the limits of computational efficiency sort of require it. Also there isn't an easy way to split a large stack into smaller stacks. You'd need to handle the state files manually for that ☹️
h
Correction to my previous response: the total number of K8s objects is a few thousand but only a small subset (<100) is managed via Pulumi.
The net result was that the  
pulumi up
 process timed out in CI and we had to manually repair the state file.
This is still the symptom that I’m trying to understand better. Previously I thought it was the high memory usage that would grind the process to a halt especially in a CI environment. But when I mimicked the CI job locally, I noticed that the memory usage is reasonably small (~2GB) and fairly steady. I still see that the
pulumi up
process gets stuck a few minutes after launching. By “stuck” I mean it does not progress in the terminal logs, and I do not see any further updates in the cluster state via
kubectl
.
After killing the process after almost an hour, it shows the following:
Copy code
error: 2 errors occurred:
	* the Kubernetes API server reported that "<redacted replicaset name>" failed to fully initialize or become live: Resource operation was cancelled for "<redacted replicaset name>"
	* Attempted to roll forward to new ReplicaSet, but minimum number of Pods did not become live
…whereas via
kubectl
I could see the ReplicaSet had achieved minimum availability i.e. all `Pod`s were ready.
c
sounds to me like it's getting "stuck" waiting for a ReplicaSet to be ready. Are you absolutely positive it is Ready while Pulumi think it's not? You're not accidentally checking the wrong namespace or the wrong deployment/statefulset/etc and the pods aren't flipping between ready and notready? I haven't run into an issue with Pulumi incorrectly determining the status of a pod before
h
I am fairly certain that I was looking at the right artifacts. I am re-running
pulumi up
after I ctrl-C’d it earlier, and it might be be stuck again (I’ll know in a few more minutes). This time I’m running it with
--profiling
flag to see if it sheds any light. I could also DM you the outputs if you’d like.
c
sure I'm open to DM if you are still seeing the same problem! (also disclaimer just in case: I'm not officially affiliated with Pulumi company. I'm just a random participant in the forum)
h
Haha thanks for the disclaimer 🙂 And I really appreciate you spending time responding to my comments!
I’m still seeing the same problem btw, this time for a different resource compared to the previous iteration. I’ll just paste the relevant snippets from my terminal here in case others find it useful: From
pulumi up
output:
Copy code
...
[1/2] Waiting for app ReplicaSet be marked available
[1/2] Waiting for app ReplicaSet be marked available (0/5 Pods available)
warning: [Pod proxy-0nrzpfgd-7d5d8c469d-drfd7]: containers with unready status: [proxy]
✨ updating...⠐
In a different terminal:
Copy code
kubectl get po proxy-0nrzpfgd-7d5d8c469d-drfd7                                  
NAME                              READY   STATUS    RESTARTS   AGE
proxy-0nrzpfgd-7d5d8c469d-drfd7   1/1     Running   0          12m
I’m running
pulumi up
with
-p 1
so I presume it’s processing one resource at a time
When I killed the command, it logged the following:
Copy code
[1/2] Waiting for app ReplicaSet be marked available
[1/2] Waiting for app ReplicaSet be marked available (0/5 Pods available)
warning: [Pod proxy-0nrzpfgd-7d5d8c469d-drfd7]: containers with unready status: [proxy]
 error: 2 errors occurred:
	* the Kubernetes API server reported that "default/proxy-0nrzpfgd" failed to fully initialize or become live: Resource operation was cancelled for "proxy-0nrzpfgd"
	* Minimum number of Pods to consider the application live was not attained
…whereas
kubectl
shows the following:
Copy code
$ kubectl get deploy proxy-0nrzpfgd -n default                             
NAME             READY   UP-TO-DATE   AVAILABLE   AGE
proxy-0nrzpfgd   10/10   10           10          249d
This is why I am inclined to believe
pulumi
gets stuck checking the status but I can’t tell why or where just yet.
c
hmm this is very weird. Your kubectl is also definitely pointing to the same cluster as the CI?
actually dumb question, it should be or the matching pod name would be ridiculously coincidental
is there any weird diff when you do
pulumi refresh
?
h
Yeah it’s pointing to the same cluster. I have 70 resources managed via Pulumi, and roughly a dozen of them are K8s `Deployment`s. I noticed today that in every
pulumi up
invocation, it successfully updates one
Deployment
but gets stuck on the next
Deployment
. Other resource types (`Secret`s, `ConfigMap`s etc) do not exhibit this behavior.
c
are there possibly other pods in your cluster that: • aren't part of that deployment • have the same labels • are not Ready sort of just grasping at straws now...
if the env you're working with is purely experimental I'd ask to see what happens when you destroy the resources and recreate them with Pulumi
h
This is not an experimental stack unfortunately. One thing I’d add is this setup worked fine for many months. I’m pretty sure there is no mismatch in the pods and labels and readiness probes etc. Another observation from recent tests: although Pulumi CLI gets stuck waiting for a Deployment (or a ReplicaSet, hard to tell from logs) to progress and eventually has to be killed, the JSON state file shows the correct updated state of the file when I re-run the Pulumi CLI subsequently (i.e. the resource is NOT stuck in
pending_operations
).
c
interesting, i ran into the same issue as you today when I used a transformation on a kubernetes yaml file to set a deployment replica count. I canceled the operation, refreshed my state, and the next operation succeeded.
1