We are facing some performance bottlenecks with pu...
# general
p
We are facing some performance bottlenecks with pulumi and need some help to identify what we can do to solve these • The main issue is pulumi
previews
and
up
takes 40-50 mins for production environment (stack) • This causes all sorts of problems for us ◦ We can't apply things immidiately using puluimi (will take 40 mins) ◦ If for some reason there is error in
up
or
preview
another hour is needed ◦ Slower development cycle ◦ Merging the pulumi PR into main is so slow (since we run previews for each environment (stack) before merging) Some technical information about our production stack • We have over 21000 resources in the stack. Lesser number of resources in non-production stacks • We have a lot of edge devices (say 1000s) and we create a bunch of things for these edge devices (certs, keys, AWS IoT thing, etc). A lot of those 21000 resources are because of these edge device AWS resources • We run previews in CI but
up
is run from local laptop • When deploying production stack we do use
PULUMI_SKIP_CHECKPOINTS=1
• We are using s3 backend Some questions I have 1. Is 40-50 mins expected time here? 2. What can we do to improve the speed of previews and applies?
w
do you need to refresh each time you are updating the stack?
you could also try increasing the parallelism flag to make more read operations concurrently if you cant turn off refresh
m
To give my 2ct: I am a huge fan on micro stacks. Pulumi CLI has some good function to help splitting an existing monolithic into several stacks. Then using StackReferences to access values from one stack in another stack. Not only give this a performance boost and quicker feedback but also you can enable a separation of concern and even different ownerships of the stack (network folks, DBAs, etc.) As described here: https://www.pulumi.com/blog/iac-best-practices-structuring-pulumi-projects/ https://www.pulumi.com/blog/iac-best-practices-applying-stack-references/
p
1. Do you mean pulumi refresh? No, we don't need refresh every time we update the stack. We just run
pulumi up
hit yes 2. I will try with increased parallelism. Also, check what the current limit it
w
1. no I mean when you are running preview. Have you tried running
pulumi preview --refresh=false
?
2. the default is 16
p
1. yes, we run with
refresh = false
in CI using the action
Copy code
- name: Preview ${{ inputs.stack-name }}
        uses: pulumi/actions@v6
        with:
          pulumi-version: 3.193.0
          command: preview
          refresh: false
          stack-name: ${{ inputs.stack-name }}
          work-dir: ${{ inputs.work-dir }}
          comment-on-pr: true
          diff: true
w
so even with preview false it is taking 40-50 mins to preivew?
p
2. How high can I set the parallelism to (practically? I am running with 64 right now
w
you can increase it until you start getting rate limited by aws
p
Yes, 40-50 mins with refresh = false
w
If neither of these help, than I think the answer would be to isolate into smaller stacks
@echoing-dinner-19531 is it expected for preview to take 50 mins even if its not doing reads? seems pretty hefty even for 21k resources
e
Some parts of the system are currently (its an area of investigation to change it) sequential. So all 21000 thousand resources have to go through 1 by 1.
That could probably explain the slowness for so many resources, I don't think we've really got perf testing at that level of resources.
p
What is the typical level of resources you do your perf testing and expected times for previews (and applies)? Will be helpful if we decide to break down the stacks into smaller ones
e
The ones I know of are a few hundred resources and vary by runtime from a few seconds to maybe a minute (a lot of that can be due to package manager overhead). From talking to users I think a few thousand generally runs well, but 21000 is high.