We're having a weird issue where our Pulumi stack ...
# general
j
We're having a weird issue where our Pulumi stack previews and deploys fine on a local machine but when running in the GitLab CI both
pulumi preview
and
pulumi up
hang indefinitely and the Pulumi Service shows that the update failed but with no error message. It seems like some sort of disconnection happens near the end of both preview and up. Any tips or help?
Screenshot 2023-03-01 at 16.08.44.png
This is only affecting one of our stacks (environments) within the same project too which is even stranger!
b
this is often due to a stale AWS session token, how do you auth to AWS?
j
Thanks for the reply @billowy-army-68599! We auth by injecting
AWS_ACCESS_KEY_ID
and
AWS_SECRET_ACCESS_KEY
as CI variables (which are fixed IAM users access credentials). I've auth'd in the same way locally on MacOS and can't replicate it.
b
where are you executing pulumi?
j
In CI - using a GitLab runner. Locally - using the MacOS terminal.
Weirdly in CI it completes all of the updates in AWS successfully but then just seems to hang until the Pulumi service marks it as failed.
b
how many resources in t\he stack?
j
680 resources
b
j
Struggling to get a performance trace from the CI - I think because the Pulumi command never finishes
I do see this error which appears to come from the Pulumi Service
I0301 17:10:00.972234     232 log.go:71] error renewing lease: [403] The provided update token has expired.
That's when turning debug mode on
These are the last 2 lines of the log:
Copy code
I0301 17:03:21.512658     232 log.go:71] Marshaling property for RPC[ResourceMonitor.RegisterResource(aws:autoscaling/notification:Notification,production-ecs-1-asg-notifications)]: topicArn={arn:aws:sns:eu-west-1:476250223542:production-ecs-terminations-b11fa72}
I0301 17:10:00.972234     232 log.go:71] error renewing lease: [403] The provided update token has expired.
You can see the 7 minute delay since the last AWS action. The command I'm running is:
Copy code
timeout 1800 pulumi up --skip-preview --tracing=file:./up.trace --logtostderr --logflow -v=9 2> ./out.txt
I added the timeout as otherwise the
pulumi up
command never exits.
b
could you send a support issue to support@pulumi.com - ignore the automated response
j
Will do now!
Done! #2457
We're at the point where we're considering an Enterprise plan because of the number of resources we have. This is bad timing as we should have done it before this happened - hindsight!
b
I see you’re using an individual account, what stage of the process are you in?
j
We've not kicked it off yet
b
good to know, i’ve taken the ticket and will take a look behind the scenes
j
Thankyou - at the end of this we should go through a pricing discussion
b
could you send a separate email to lbriggs[at]pulumi.com for that?
we’ll get this sorted before we have that chat
j
Have done & thanks.
b
@jolly-agent-91665 quick q: is this an ongoing problem or did it start recently?
j
This started since we created a new set of stacks for our environments. It is only affecting our production stack though and only on GitLab
It runs fine on MacOS so perhaps it's a resource thing? But it seems strange when it only affects this stack and the resources are (almost) in parity between staging, sandbox and production.
b
are the number of resources in staging/production and sandbox the same?
j
30 resources less on both of those as they omit an SSH Bastion.
Otherwise the exact same
b
can you do a deployment with an SSH bastion, just to eliminate a theory I have
j
Onto the staging stack?
b
or sandbox, whichever is preferable
j
Will do now
I'm running it now - I don't know if this is helpful but I have an active preview running against the production stack (https://app.pulumi.com/will/infrastructure/production/previews/c89c78d5-37e2-4d68-a893-5ceaa7537a35) and that's hanging in the same way for >30 minutes but it's hanging on both the Pulumi Service and the
pulumi preview
command.
b
i have an engineer investigating
okay, I think this is likely the same as: https://github.com/pulumi/pulumi/issues/7094 can you try setting the environment variables:
Copy code
export PULUMI_EXPERIMENTAL=1
export PULUMI_SKIP_CHECKPOINTS=1
export PULUMI_OPTIMIZED_CHECKPOINT_PATCH=1
j
Will try that now
That didn't fix it (for the preview at least).
What we have done is double the instance sizes on GitLab that runs the CI job and that fixed it
b
Okay that’s good information. We are having a lovely discussion behind the scenes on this, if we have any concrete fixes we’ll let you know
j
I had a quick scan of the GitHub issue and it does seem like it could be related given that we have a comparable number of resources.
Thankyou - I'm intrigued to say the least
b
@jolly-agent-91665 could you run an update locally, and capture a: • performance trace • verbose logging • a profile of cpu/mem which can be captured using
--profiling
you can use the support ticket to send them, DM me if there’s any issues sending this over securely
j
Yeah I'll grab these tomorrow (it's 8pm here!)
b
apreciate it! thanks!