Hey y all running into a scenario I d like an opinion on I h Pulumi Community #general

Hey y'all, running into a scenario I'd like an opi...

bulky-afternoon-9964

07/19/2024, 7:11 PM

Hey y'all, running into a scenario I'd like an opinion on: I have an ec2 instance that running multiple deployments at a time in containers/VMs. What I've found is that while running `pulumi up`s I'm also receiving dhcpclient updates to add secondary IP addresses to the intance's network interface. When a new network interface is added, it does a quick restart of the networking service, which updates the

eth0

(primary network interface) on the instance. This momentary hiccup causes all running deployments to drop, I'm guessing this is beacuse of a socket connection to encrypt/decrypt secrets during the deployment.

post-step event returned an error: failed to save snapshot: serializing deployment: serializing resources: failed to encrypt secret value: performing HTTP request: Post

etc... I followed that error to this problem described here: https://www.pulumi.com/docs/support/troubleshooting/#interrupted-update-recovery The solution to prevent this seems to me to be either: • Figure out a way to hot-link my primary interface that's non-disruptive (WIP) • Try to detect this as an output of the pulumi deployment and try to re-run

pulumi up

Any thoughts on how to address this problem? I do wonder if the exit code of the pulumi CLI would tell me if this deployment failed and needs to restart...

orange-policeman-59119

07/20/2024, 6:56 PM

Hey Graham, Pulumian here. My hunch having been on the IaC core (CLI) and providers side of Pulumi is that a very unstable network connection like this is going to cause woes even if we had a workaround for this first order issue. If the network is this unreliable, the map of places we'd need to add retry behavior is a fractal and many of those are going to be upstream libraries or provider code, where it might not be trivial (or safe/idempotent) to retry. I do think we should open an issue to make sure we retry some steps for serializing state files like the error you shared for resiliency, crash recovery - I can open that or if you want to do so I can ping the team on Monday. That said, it seems like whatever process is adding IP addresses is running your interface down/up scripts instead of using the lower level tools to update an interface. My advice is to solve that, ensure that you can add addresses without breaking your interface. If that poses a challenge or you don't control the code doing that, I recommend setting up a wireguard tunnel to a reliable exit node in your network. The wg0 interface you create should be reliable even when the underlying network is not, which is a trick I've used in hotels, coffee shops, even on an airplane. That wg0 interface won't drop a socket, even if the transport underneath it restarts.

5 Views

Open in Slack

Previous Next