So I think this is more of a generic IaC issue tha...
# general
p
So I think this is more of a generic IaC issue than specifically a pulumi issue, but hopefully that's still on topic enough for this channel. I've just run in to an issue where I'm updating (replacing) the security group (sg) for a service, and also updating the service to use the new sg. Now the order of the operations wants to be: 1. Create new sg and the rule(s) in the other sg(s) it needs to talk to. 2. update service to use the new sg 3. delete old sg Any other order of operations will end in tears, as trying to delete and replace the sg before the service isn't using it will fail as AWS sensibly don't let you delete an sg which is in use, if they weren't as sensible as that deleting the in use sg would result in an outage until the new sg was created and the service updated to use it, which while it might be only be a short time is longer than doing it in the correct order, and thus that is undesirable. As far as I can tell the DependsOn option is no use here because the service needs to depend on the sg, because we need to pass the sg.id into the service function call, and then that means pulumi will replace it first and then update the service, which fails as above. I have come up with a solution which is: 1. Remove the old sg from the pulumi state so it is now unmanaged. 2. run pulumi up which will now create the new sg and then update the service 3. delete the old sg by hand. But this means we're not doing things using IaC, one of the virtues of IaC is that you should be able to look at the git history of your code and see what was done, and this breaks the ability to do that. What would be nice is if pulumi would better understand the relationship between parts of infra and know that it needs to operate in the way I specified at the top. Am I missing something? Does it do that? Or am I doomed to have to make a series of manual out-of-band adjustments to my infra thus removing one of the main reasons I think IaC is good?
s
One workaround might be to update your code to create the new SG and update the service to use the new SG (but leave the code for the old SG in place). Run a
pulumi up
, and it will create the SG and update the service (and should do so in the correct order). Then, in a follow-up change, remove the code for the old SG and run
pulumi up
again. This will result in just deleting the old SG, but leaving the new SG and service unchanged (because they already match the desired state). It’s not ideal, but I think it would work.
p
Yeah that does work. I'm not doing it because the old sg wasn't created by code, I imported it and am now trying to modify it, but your suggestion works if the old sg was made by code, and leaves a nice trail in the git history.
d
Wouldn't it be a bug for the updates to apply after removal?
s
It could be considered a bug for Pulumi not to recognize the correct order of changes (create new SG, update service, remove old SG). Issues welcome! https://github.com/pulumi/pulumi
p
I'll consider opening an issue, but I think this is part of a bigger issue which I haven't quite framed in my head yet. There's something analogous with updating services which would use a stateful set in kubernetes, like ElasticSearch. Where you need to roll nodes in sequence and wait for quorum to be re-established before moving on to the next 1 (or 2 or ... depending on the quorum rules). But then there's something like a database move (say from one version or engine to another) where you'd like to snapshot the db, stand up the new db with that snapshot, test it, then shutdown the old db and snapshot it again, then apply the diff of the snapshot to the new db and switch the services to using it. I don't think it ends there, either, there's certainly other scenarios I'm not thinking of now, where the order and timing of updates and replaces is critical for a smooth transition.
I've been pondering this a while, and inshallah my subconscious will provide me with an elegant solution or at least better understanding soon.
s
A lot of what you’re alluding to here was, to my knowledge, intended to be addressed (in Kubernetes land) by _operators_: software that had domain-specific knowledge baked into it so that it (for example) knew to wait for quorum when rolling clustered nodes. In my mind, the industry recognizes the need for this sort of “higher-level” operational knowledge, but hasn’t found quite the ideal place for it to live/operate.
p
yes, that is true, and some of the operators are quite good at it. But I'm not using k8s as for small operations like ours ECS Fargate is significantly more cost and effort effective. Also k8s operators only solve the issue within k8s, I'm sure there must be issues with k8s clusters where this kind of process would be useful but it's for the node pools or similar ancillary infrastucture