This message was deleted Pulumi Community #kubernetes

Join Slack

This message was deleted.

# kubernetes

sparse-intern-71089

03/16/2021, 12:49 AM

This message was deleted.

colossal-australia-65039

03/16/2021, 12:59 AM

how many resources are in your stack? and which version of Pulumi are you using? one of the more recent minor versions fixed a performance issue causing preview/up to be extra slow in case that's one of your symptoms, but not sure if the problem/solution related to memory usage.

hundreds-battery-67030

03/16/2021, 4:57 AM

Pulumi CLI version 2.22.0 Pulumi-Kubernetes plugin version 2.8.2 We have a few thousand K8s objects.

hundreds-battery-67030

03/16/2021, 4:58 AM

The net result was that the

pulumi up

process timed out in CI and we had to manually repair the state file.

colossal-australia-65039

03/16/2021, 4:43 PM

ah that's quite a lot of resources for a single stack. not sure if there's a good solution for this to be honest. I think it's recommended not to have that many resources in a stack. I try to limit mine to the low hundreds at most. It's unfortunate because the development experience becomes much more cumbersome when resources are split into smaller stacks, but the limits of computational efficiency sort of require it. Also there isn't an easy way to split a large stack into smaller stacks. You'd need to handle the state files manually for that ☹️

hundreds-battery-67030

03/16/2021, 6:18 PM

Correction to my previous response: the total number of K8s objects is a few thousand but only a small subset (<100) is managed via Pulumi.

The net result was that the
pulumi up
process timed out in CI and we had to manually repair the state file.

This is still the symptom that I’m trying to understand better. Previously I thought it was the high memory usage that would grind the process to a halt especially in a CI environment. But when I mimicked the CI job locally, I noticed that the memory usage is reasonably small (~2GB) and fairly steady. I still see that the

pulumi up

process gets stuck a few minutes after launching. By “stuck” I mean it does not progress in the terminal logs, and I do not see any further updates in the cluster state via

kubectl

hundreds-battery-67030

03/16/2021, 6:37 PM

After killing the process after almost an hour, it shows the following:

Copy code

error: 2 errors occurred:
	* the Kubernetes API server reported that "<redacted replicaset name>" failed to fully initialize or become live: Resource operation was cancelled for "<redacted replicaset name>"
	* Attempted to roll forward to new ReplicaSet, but minimum number of Pods did not become live

…whereas via

kubectl

I could see the ReplicaSet had achieved minimum availability i.e. all `Pod`s were ready.

colossal-australia-65039

03/16/2021, 7:41 PM

sounds to me like it's getting "stuck" waiting for a ReplicaSet to be ready. Are you absolutely positive it is Ready while Pulumi think it's not? You're not accidentally checking the wrong namespace or the wrong deployment/statefulset/etc and the pods aren't flipping between ready and notready? I haven't run into an issue with Pulumi incorrectly determining the status of a pod before

hundreds-battery-67030

03/16/2021, 7:58 PM

I am fairly certain that I was looking at the right artifacts. I am re-running

pulumi up

after I ctrl-C’d it earlier, and it might be be stuck again (I’ll know in a few more minutes). This time I’m running it with

--profiling

flag to see if it sheds any light. I could also DM you the outputs if you’d like.

colossal-australia-65039

03/16/2021, 8:01 PM

sure I'm open to DM if you are still seeing the same problem! (also disclaimer just in case: I'm not officially affiliated with Pulumi company. I'm just a random participant in the forum)

hundreds-battery-67030

03/16/2021, 8:03 PM

Haha thanks for the disclaimer 🙂 And I really appreciate you spending time responding to my comments!

hundreds-battery-67030

03/16/2021, 8:11 PM

I’m still seeing the same problem btw, this time for a different resource compared to the previous iteration. I’ll just paste the relevant snippets from my terminal here in case others find it useful: From

pulumi up

output:

Copy code

...
[1/2] Waiting for app ReplicaSet be marked available
[1/2] Waiting for app ReplicaSet be marked available (0/5 Pods available)
warning: [Pod proxy-0nrzpfgd-7d5d8c469d-drfd7]: containers with unready status: [proxy]
✨ updating...⠐

In a different terminal:

Copy code

kubectl get po proxy-0nrzpfgd-7d5d8c469d-drfd7                                  
NAME                              READY   STATUS    RESTARTS   AGE
proxy-0nrzpfgd-7d5d8c469d-drfd7   1/1     Running   0          12m

I’m running

pulumi up

with

-p 1

so I presume it’s processing one resource at a time

hundreds-battery-67030

03/16/2021, 9:08 PM

When I killed the command, it logged the following:

Copy code

[1/2] Waiting for app ReplicaSet be marked available
[1/2] Waiting for app ReplicaSet be marked available (0/5 Pods available)
warning: [Pod proxy-0nrzpfgd-7d5d8c469d-drfd7]: containers with unready status: [proxy]
 error: 2 errors occurred:
	* the Kubernetes API server reported that "default/proxy-0nrzpfgd" failed to fully initialize or become live: Resource operation was cancelled for "proxy-0nrzpfgd"
	* Minimum number of Pods to consider the application live was not attained

…whereas

kubectl

shows the following:

Copy code

$ kubectl get deploy proxy-0nrzpfgd -n default                             
NAME             READY   UP-TO-DATE   AVAILABLE   AGE
proxy-0nrzpfgd   10/10   10           10          249d

This is why I am inclined to believe

pulumi

gets stuck checking the status but I can’t tell why or where just yet.

colossal-australia-65039

03/16/2021, 9:40 PM

hmm this is very weird. Your kubectl is also definitely pointing to the same cluster as the CI?

colossal-australia-65039

03/16/2021, 9:40 PM

actually dumb question, it should be or the matching pod name would be ridiculously coincidental

colossal-australia-65039

03/16/2021, 9:41 PM

is there any weird diff when you do

pulumi refresh

hundreds-battery-67030

03/16/2021, 10:30 PM

Yeah it’s pointing to the same cluster. I have 70 resources managed via Pulumi, and roughly a dozen of them are K8s `Deployment`s. I noticed today that in every

pulumi up

invocation, it successfully updates one

Deployment

but gets stuck on the next

Deployment

. Other resource types (`Secret`s, `ConfigMap`s etc) do not exhibit this behavior.

colossal-australia-65039

03/16/2021, 11:26 PM

are there possibly other pods in your cluster that: • aren't part of that deployment • have the same labels • are not Ready sort of just grasping at straws now...

colossal-australia-65039

03/16/2021, 11:34 PM

if the env you're working with is purely experimental I'd ask to see what happens when you destroy the resources and recreate them with Pulumi

hundreds-battery-67030

03/22/2021, 6:51 PM

This is not an experimental stack unfortunately. One thing I’d add is this setup worked fine for many months. I’m pretty sure there is no mismatch in the pods and labels and readiness probes etc. Another observation from recent tests: although Pulumi CLI gets stuck waiting for a Deployment (or a ReplicaSet, hard to tell from logs) to progress and eventually has to be killed, the JSON state file shows the correct updated state of the file when I re-run the Pulumi CLI subsequently (i.e. the resource is NOT stuck in

pending_operations

colossal-australia-65039

03/24/2021, 12:45 AM

interesting, i ran into the same issue as you today when I used a transformation on a kubernetes yaml file to set a deployment replica count. I canceled the operation, refreshed my state, and the next operation succeeded.

➕ 1

18 Views

Open in Slack

Previous Next