for reference this is the order of magnitude in te...
# general
a
for reference this is the order of magnitude in terms of number of resources / execution time. Number of resources should roughly double within the next year
m
Without knowing more, this is a good place to start. Can you tell us more about your Pulumi program and the resources that are updating/being replaced?
s
It may also be worth evaluating if you need all those resources in a single Pulumi project. If the resources have different/disparate lifecycles (i.e., resource A might get deleted/destroyed/recreated on a very different schedule than resource B) it might make sense to split it into a different Pulumi project. We discuss some of the considerations in this blog post: https://www.pulumi.com/blog/iac-recommended-practices-structuring-pulumi-projects/
a
They are all very interconnected unfortunately (also the reason why I went with pulumi instead of cfn). Basically it’s 1 VPC per AWS region for 15 regions or so, each of which has 4 CIDR ranges, 4 subnets, all in VPC peering w/ each other w/ route tables to build one unique global address space. Security groups with inbound rules are created accordingly. Finally there is one ECS cluster per region with a Traefik service with appropriate task definition. They all source the image from a shared singular ECR repo
On top of this I have an indefinite amount of services: each one is deployed independently on a different pulumi project
m
It sounds like you have a monolithic project that declares all your base infrastructure across all regions, but maybe your need 1 project per region. Maybe you need more logical stacks, this sounds like it should be 15 stacks for 1 project.
s
If you restructure it to be 15 stacks for 1 project, then you'll have to factor out the VPC peering into its own project and that can be kind of messy (IME). Using a transit gateway with a hub VPC might simplify the architecture (and the associated Pulumi code), but that's a fairly major change. That might also allow a switch to AWSX VPC objects, which simplifies things quite a bit (no need to separately manage the VPC, subnets, routes, and route tables). All that being said, I can definitely see why the project is structured the way it is. What backend are you using for this project?
a
Thanks I’ll look at those resources you share. I did something like the example you shared with Transit Gateway for a different project. I don’t think I’ll go this route with this project as it has different requirements* and I honestly can afford the trade-off of 2 hours deployments for a change, since changes to this set of global resources is infrequent and planned. I didn’t go for different projects because everything is interconnected from the VPCs to the single ECS services with a ton of computed properties like security group ingress rules that depend on stuff defined in all other regions at a lower level in the stack. This way I can also add regions or remove regions without touching 20 other different pulumi projects. *customer has a global database service, users can come, create a global cluster and launch nodes in any region (async global replication with raft). Nodes must be able to peer with each other via private ip for replication. So I have a pulumi project for the global peered VPCs + ECS clusters in each region + traefik service nodes in each region which do the ingress for all customer nodes. Then there is a pulumi project for each EC2 instance (in a 111 ASG) which are the instances where the ECS tasks are started. Then there is a pulumi project for every ECS service, each service being a single customer node
For backend we have a mix of the new “native” controller + some of the classic one. Code for the global stack is ts, code for the two projects to create the ec2 instances and the ECS services are written in Go (to integrate with the customers’ codebase)
We went with Pulumi because it couldn’t be done with CloudFormation, Terraform or CDK
s
@adventurous-honey-48664 Thanks for the additional detail. What are you using to store the Pulumi state---Pulumi Cloud, or a DIY backend (S3/Azure Blob/Google Cloud Storage)?
a
S3
We can share a copy if it can help
s
I don’t think it’s necessary to share a copy of the state. Let me make some soft inquiries internally to see if this sort of performance concern warrants raising an issue (if you haven’t already opened one).
a
No we have not. What would be the place? GitHub?
s
Yes: https://github.com/pulumi/pulumi (you can hold off for a bit and let me have some internal conversations first)
How is the speed and latency between where you’re running the update and S3? Are you running the updates on an EC2 instance in the same region as the S3 bucket?
a
Thanks good to know though. A dev also reported some changes not being correctly tracked (like: they changed the UserData of an EC2 instance, some other things in the stack like security group ingress rules and IAM policies got replaced and not re-attached to the proper resources), so I’ll ask them to do a proper write up and open an issue for that. Might just be some issue on our side (I assume each resource has some logical name tied to the physical resource via ARN or whatever, we are creating them in nested for loops, I suspect we may be generating inconsistent resource names in some parts of the code)
No we are running the updates from our computer in Italy on premise, state bucket is in Ireland
Latency should be around 50ms
s
Bandwidth and latency between where the update is running and where the state is stored in S3 can have a pretty notable effect on update times. My first recommendation would be to try running the update from an EC2 instance in the same region (from within a VPC with the correct S3 VPC endpoints configured), and see what impact that has on update times. We use some optimized diff algorithms with Pulumi Cloud, but outside the US you’ll take a latency hit so I don’t know if that actually would help or not.
a
Ok so if I had the state on a file / on pulumi cloud self-hosted on my computer it should improve?
I might run a test where I download the state from S3, run the update against a local state file, then re-upload the state on s3
s
I’d think that simply spinning up an EC2 instance and installing Pulumi on it would be safer than making local updates to your state file. And easier, too, especially with Pulumi. 🙂
a
Will try and get back to you, thanks
s
Sounds good!
a
this is from an EC2 in the same region as the state bucket. Huge improvement. Thanks!
s
No problem, happy to help!