How many operations can Pulumi perform in parallel...
# general
m
How many operations can Pulumi perform in parallel, and what systems specs are the most important, or settings I should be supplying, to ensure that the most number of resources can be provisioned in parallel? I've got a stack with 3000+ resources, some quite expensive AWS resources (in terms of provisioning time) and I'm seeing my stack take 2+ hours to come up. Currently that deploy is happening on a GitLab shared runner, but I'm looking at hosting my own runner. I'm happy to provision as much compute and memory as is necessary to have the most resources provisioned in parallel, but obviously that's pointless if I miss a setting that I should change to enable pulumi to work properly.
e
Default behaviour is to spin up a new goroutine (green thread) for each operation. But most of the bottlenecks are going to be network IO not compute.
m
If we're running on low memory or low CPU is there any throttling on the number of threads? For network IO we'll be running on a EC2 instance in AWS, hitting AWS so I'll monitor it but we could go up to 10G there.
e
is there any throttling on the number of threads?
Not by default, but you can set the number of worker threads with -p
Copy code
-p, --parallel int                          Allow P resource operations to run in parallel at once (1 for no parallelism). Defaults to unbounded. (default 2147483647)
m
I'm just wondering what's happening when I'm running on a small instance (currently GitLab runner) because it's very obvious that not all operations that could be done in parallel are being done in parallel.
e
it's very obvious that not all operations that could be done in parallel are being done in parallel.
So A) I wouldn't trust the display logic to show this accurately B) There is one part of the engine that is single-threaded that all operations have to go through first, so things won't ever by perfectly parallelised. Sending 100 events all at once, all 100 will get some initial processing 1-by-1 and then after that initial processing will be scaled out to worker threads for the actual cloud operations.
That initial processing should be pretty fast, especially relative to http requests to create resources in a cloud provider.
m
Ah, I mean watching the AWS console and only seeing 5 S3 buckets provisioned at a time, when we know well there will be about 50 when we're done. But perhaps That's a limit on the AWS side of things?
e
Interesting we've recently had another user comment exactly the same thing about AWS. I don't think it's an aws issue as the other user reports that using terrafrom they could see more objects being created at a time, so I suspect something in our aws plugin might be limiting things here. I'll talk to our providers team about it, definitely something for us to investigate.
m
Thanks! I will try with our own runner instead of the GitLab runner and see if I'm still getting the same behaviour
We did some initial experimentation with our CI job and different instance sizes. This quick and dirty analysis does seem to indicate that something is being influenced by the number of CPU cores, but isn't actually using them:
f
Yesterday we tested a 128 core instance (c6i.32xlarge), deploying the same stack as Alisdair reported above. That run time was 12 minutes 22 seconds, so it shaved about a minute off the 16 core machine Alisdair posted above. Findings: • Pulumi is not utilising available cores effectively as they all hover around a few percent. • Increasing core count does reduce stack up time, but the gains are quickly diminishing.