Help tracing a slow Pulumi Up…. We’re trying to u...
# general
g
Help tracing a slow Pulumi Up…. We’re trying to understand an extremely slow
pulumi up…
run. When we run
pulumi refresh
it takes ~1m or so to get the data back - and that feels reasonable, given that we have a few hundred resources. When we then run a
pulumi up
where even only 1 resource changes, the run takes over 9 minutes. The primary API we are interfacing with is the PagerDuty API using the Pulumi PagerDuty provider. I’ve run
pulumi up --tracing…
and I have some trace files. Using the
AppDash
is a little confusing though and I am not sure how to really get much use out of it. The one thing I see is the
Profile View
at the bottom and sorting by `Cumulative Time(ms)`: When I do that, I get a MASSIVE output of
RegisterResource
calls, … like … thousands and thousands of lines. We don’t have thousands of resources. There are over 11,500 lines in this output. There are 932
RegisterResourceOutputs
calls… 1864
RegisterResource
calls
..
Copy code
%  cat pulumi.cumulative.out | awk '{print $1}' | sort | uniq -c | sort -n 
   1 /pulumirpc.Engine/SetRootResource
   1 execNodejs
   1 locateModule
   1 pf.CheckConfig
   1 pf.Configure
   1 pf.ValidateProviderConfig
   1 pulumi-plan
   1 sdkv2.CheckConfig
   1 sdkv2.Configure
   1 sdkv2.GetPluginInfo
   1 sdkv2.ValidateProviderConfig
   2 /pulumirpc.LanguageRuntime/GetPluginInfo
   2 /pulumirpc.LanguageRuntime/GetRequiredPlugins
   2 /pulumirpc.LanguageRuntime/Run
   2 Cumulative
   2 Name
   3 newPlugin
   4 /pulumirpc.ResourceProvider/CheckConfig
   4 /pulumirpc.ResourceProvider/Configure
   4 /pulumirpc.ResourceProvider/DiffConfig
   4 /pulumirpc.ResourceProvider/GetPluginInfo
   4 Time
  12 /pulumirpc.ResourceMonitor/SupportsFeature
 130 sdkv2.Invoke
 260 /pulumirpc.ResourceMonitor/Invoke
 260 /pulumirpc.ResourceProvider/Invoke
 459 sdkv2.Check
 459 sdkv2.Diff
 928 /pulumirpc.ResourceProvider/Check
 928 /pulumirpc.ResourceProvider/Diff
 932 /pulumirpc.ResourceMonitor/RegisterResourceOutputs
1864 /pulumirpc.ResourceMonitor/RegisterResource
5240 /pulumirpc.Engine/Log
I don’t see much detail though in the trace view… is there something I am missing in order to figure out why we see so many calls? Also, it feels like the calls get cumulatively slower as we go. While looking at the
--debug --verbose=7
logs, I would swear that the
pulumi up
command is making S3 calls after each and every resource update. Is that possible?
Looking through the output for example, if we grep for
state.go
and a noisy string
error copying
, we get this:
Copy code
I0416 17:55:02.213363   33154 state.go:342] error copying .pulumi/stacks/ourapp/production.json.gz to .pulumi/stacks/ourapp/production.json.gz.bak: blob (key ".pulumi/stacks/ourapp/production.json.gz -> .pulumi/stacks/ourapp/production.json.gz.bak") (code=NotFound): NoSuchKey: The specified key does not exist.
I0416 17:55:03.660768   33154 state.go:342] error copying .pulumi/stacks/ourapp/production.json.gz to .pulumi/stacks/ourapp/production.json.gz.bak: blob (key ".pulumi/stacks/ourapp/production.json.gz -> .pulumi/stacks/ourapp/production.json.gz.bak") (code=NotFound): NoSuchKey: The specified key does not exist.
I0416 17:55:04.526660   33154 state.go:342] error copying .pulumi/stacks/ourapp/production.json.gz to .pulumi/stacks/ourapp/production.json.gz.bak: blob (key ".pulumi/stacks/ourapp/production.json.gz -> .pulumi/stacks/ourapp/production.json.gz.bak") (code=NotFound): NoSuchKey: The specified key does not exist.
I0416 17:55:05.615803   33154 state.go:342] error copying .pulumi/stacks/ourapp/production.json.gz to .pulumi/stacks/ourapp/production.json.gz.bak: blob (key ".pulumi/stacks/ourapp/production.json.gz -> .pulumi/stacks/ourapp/production.json.gz.bak") (code=NotFound): NoSuchKey: The specified key does not exist....
I0416 18:03:56.379596   33154 state.go:342] error copying .pulumi/stacks/ourapp/production.json.gz to .pulumi/stacks/ourapp/production.json.gz.bak: blob (key ".pulumi/stacks/ourapp/production.json.gz -> .pulumi/stacks/ourapp/production.json.gz.bak") (code=NotFound): NoSuchKey: The specified key does not exist.
I0416 18:03:57.366905   33154 state.go:342] error copying .pulumi/stacks/ourapp/production.json.gz to .pulumi/stacks/ourapp/production.json.gz.bak: blob (key ".pulumi/stacks/ourapp/production.json.gz -> .pulumi/stacks/ourapp/production.json.gz.bak") (code=NotFound): NoSuchKey: The specified key does not exist.
I0416 18:03:59.225002   33154 state.go:342] error copying .pulumi/stacks/ourapp/production.json.gz to .pulumi/stacks/ourapp/production.json.gz.bak: blob (key ".pulumi/stacks/ourapp/production.json.gz -> .pulumi/stacks/ourapp/production.json.gz.bak") (code=NotFound): NoSuchKey: The specified key does not exist.
you can see how that goes on for 9 minutes.. there are over 450 lines like that.
e
I would swear that the
pulumi up
command is making S3 calls after each and every resource update
Yeh you might want to enable checkpoint skipping https://www.pulumi.com/blog/pulumi-release-notes-80/#skip-checkpoints-experimental-flag I don't think this needs EXPERIMENTAL set anymore, just
PULUMI_SKIP_CHECKPOINTS=true
, will skip all but the final checkpoint write so should be much faster at the cost that if something causes your pulumi process to be hard-killed halfway through you lose all track of what had been done so far.
g
Ok thank you - if we do lose track - a pulumi refresh/up usually should be fine, right?
Copy code
Resources:
    ~ 1 updated
    933 unchanged

Duration: 9s
^^ that made a difference!
e
yeh refresh will normally fix up most things, but like if things started to create and then you killed pulumi it'll have totally lost track that it was even trying to create something
b
This is how much is taking from me using GCS Bucket. So the only way to speed up is by using this
PULUMI_SKIP_CHECKPOINTS=true
?
Copy code
Resources:
    ~ 1 to update
    616 unchanged

Do you want to perform this update? yes
Updating (live):
     Type                              Name                                   Status            Info
     pulumi:pulumi:Stack               upcoupons-live                                           
     └─ gcp:utils:UpfeatAlert          live-alert-backend-requests-component                    
 ~      └─ gcp:monitoring:AlertPolicy  live-alert-backend-requests            updated (12s)     [diff: ~conditions]

Resources:
    ~ 1 updated
    616 unchanged

Duration: 6m40s