Hi everyone, I’m working on a client project where...
# kubernetes
m
Hi everyone, I’m working on a client project where Pulumi is used to manage AWS & Kubernetes resources in the one Pulumi stack. We spin up EKS and dependencies, then install a variety of Kubernetes resources in that cluster. There are ~420 resources in the stack, ~300 of which are Kubernetes resources. We’re experiencing slow Pulumi previews & up commands. While analysing the system resources it seems that network is the bottleneck, also running on a variety of internet connections produces differing performance. I’ve configured Pulumi to direct traffic to a local Charles proxy while running a preview against a stack that is built when no changes are anticipated. I can see from this that there is one open connection to the Kubernetes API Server, the body of this connection continuously grows for ~8 minutes (typically 8mins while I’m at home but this depends on the connection) to a total body size of ~120MB. Presumably, the Kubernetes provider is sending each query over the same connection (using gRPC?) which will mean they are in sequence. I’ve looked at the events in CloudWatch for EKS while the connection is active, it looks as though Pulumi may be querying all the Kubernetes resources in the state file. If it is doing this, I’m not sure why it would need to do so, as Pulumi has refresh as a separate command. We don’t see similar querying for AWS resources. Is anyone able to explain to me why Pulumi, or the Kubernetes provider is making these requests? Happy to provide more details to identify it if necessary. Also if there is any way to perhaps parallelise this or stop it if it is indeed not necessary, then I’d very much like to know. I can’t see any such configuration parameters.
I’ve now noticed that if I set enableServerSideApply to false then these requests stop and my Pulumi preview takes far less time. 21s vs 8mins on my current network connection We can’t instantly move away from server-side apply as we’re using patch resources nor do we want to. While I’ve identified the traffic as being specific to SSA, my questions still stand, is this traffic there to do something akin to refreshing the state? And, might it possible to parallelise these requests with a code change? I shall delve into the provider code but any pointers would be welcome.
g
I’d be very interested to see more details about this. I would have expected the opposite, if anything, so I’m surprised to see this big performance difference between SSA and CSA.
m
I’ve done a bit of digging. I ran previews with
-v=9
and tailed the log file with a pipe to
grep executing
this was to look to see what functions were executed when we see the traffic, as we have log statements such as this one. I can see we have cycles of Check, Diff, Check Diff. What I found interesting was looking at the timestamps I could see the Diff function executed once every 1-2 seconds with SSA on, but with SSA off it varies, sometimes 60 executions per second. Delving further I see this which is what I expect causes SSA to be slower, which makes sense. My thinking is to build the Kubernetes provider locally with a change to accept an environment variable that will force client side diff, essentially doing the same as this. If this improves the performance considerably and still leaves the benefits or conflict resolution (not sure if it would as I don’t know the provider code at all well) then that would be good. If that experiment proves successful perhaps it’s possible to change the logic so that it first attempt a client side diff, then only if changes are detected perform a sever side diff. Do let me know if you think I’m barking up the wrong tree here. I will try to experiment as and when I have the time.
Given your response, I assume you’ve not had others complain about SSA being slow? Which does make me think maybe there is something else about our setup causing the slowness. If you’d like to know more about the codebase and environment we’re running Pulumi in, I’m happy to send over whatever detail or logs you think might be pertinent.
g
Thanks for the additional information! https://github.com/pulumi/pulumi-kubernetes/issues/2427 was just opened yesterday, but I hadn’t seen reports of SSA performance issues before that. I’m looking into it.
I’ve looked at the events in CloudWatch for EKS while the connection is active, it looks as though Pulumi may be querying all the Kubernetes resources in the state file. If it is doing this, I’m not sure why it would need to do so, as Pulumi has refresh as a separate command. We don’t see similar querying for AWS resources.
I bet this is indeed the issue. This behavior differs from our other providers because it’s making a server-side check for every resource instead of diffing against local state. It was implemented like this because it’s pretty common for other controllers to modify resources and cause drift from Pulumi’s state. I’ll do some testing to confirm. 🤔
m
It’s promising to see others are noticing this, and I’m glad you’re looking at it 🙂 I’m quite surprised it hadn’t been raised sooner, slowness has been an issue on this project for some time. Initial investigations had concluded that checkpoints and state file transfer (s3 backend) were the cause, but this seems far more of a slow down. I did try my idea to modify the provider to toggle sever side diff with an env var. What I wrote does not seem to respect the env var, but I was able to default the value to true. In doing so I was able to run Pulumi with the same increase in speed I’ve mentioned before. Also I was able to update patch resources, indicating we still get much of the benefit of server side apply, perhaps not such accurate diffs. All the logic I added was this little excerpt, at the very beginning of
tryServerSidePatch
Copy code
if k.clientSideDiffMode {
	return nil, nil, false, nil
}