Also, 2nd question - we're using a TEP architectur...
# getting-started
m
Also, 2nd question - we're using a TEP architecture (tenant environment pair). Example: • ourCompany-dev • ourCompany-test • ourCopmany-stag • ourCompany-prod • customer1-prod • customer2-prod • customer3-prod ....etc Ideally, when we kick off a prod deployment, it goes to all [tenant]-prod TEPs. This is because each customer is isolated in their own AWS account. There are 5/25/150 concurrent deployment limits on the pulumi plans. We are using GHA (github actions). There are concurrency limiters in GHA based on workflow groups, so we could try to use that as a rate limiter for pulumi. (e.x: put all of our deployments into a single concurrency group that is limited at 20, for a 25 pulumi limit) but we are trying to avoid this approach at all costs to prevent staggered deployments to prod. Are the 5/25/150 deployment limits account-wide?
g
not an answer to your Q, but we have the same architecture and are using ESC for it if you ever want to swap notes
m
@gifted-balloon-26385 that would be a huge yes. I read up a bit on ESC, seems like it can have "environments extend other environments' ...which basically gives us the compositional layers of abstraction we want to inherit from: • Global application settings • Per-tenant application settings as horizontals, with the vertical being "per-service" or "per-module" settings
Do ya'll programmatically provision pulumi API tokens for each tenant or how is that handled in your system? And what mechanism do you use for HMR - or does pulumi auto-redeploy everything if its' configuration changes? Also is pulumi stack creation programmatic for ya'll too? Ie., if stack ref for new tenant does not exist, create it and set defaults. The biggest thing I'm looking forward to is being able to set AWS resource capacities on a per-tenant basis (ie., EC2 sizes, fleet numbers, etc)
Last question - how is ESC billed? is every variable in it considered a resource??? or is it per-env?
g
We’re currently using the enterprise team acls to manage tokens, but I have a feature request to get OIDC support so we can better scope access during CI https://github.com/pulumi/esc/issues/198
m
Yeah I share an INFRA/CICD architectural template for a number of small to mid sized companies i'm fractional CTO at and my recommendation for most is KISS - keep it simple. as a result the CICD system of choice is just GitHub Actions. using ESC in combination with "echo >> ENV" via a private NPX package for cicd ENV setup seems attractive but I've yet to consider the security implications of it.
All I should have to set on a per-repo basis is that repository's specific security access token
We make all of our customers AWS-account isolated for security so sometimes a clients' "prod" push will deploy to 3, or 5, or 8, etc AWS accounts. Using GHA matrices with config injected from ESC would make life easy.
A single config source seems very attractive for CI + runtime.
But like i said above, have yet to think this through all the way
g
to your other Qs -
And what mechanism do you use for HMR - or does pulumi auto-redeploy everything if its’ configuration changes?
not sure what you mean by this - we do rerun pulumi up whenever something changes in git but pulumi doesn’t modify unchanged resources
Also is pulumi stack creation programmatic for ya’ll too? Ie., if stack ref for new tenant does not exist, create it and set defaults.
right now we’re doing it manually which is fine for our pace of customer onboarding, but will likely eventually move to the pulumi automation API for this
Last question - how is ESC billed? is every variable in it considered a resource??? or is it per-env?
I think every resource is a resource - which is independent of whether you’re using ESC. but not certain you’d have to ask the pulumi team to be sure
m
HMR - hot module reload; ie., if you make a config change in ESC, how do you enforce its' automatic propagation to all affected resources?
I think if you use pulumi deployments that is supported, but we arent.
g
ah yeah that story isn’t great, right now i just re trigger a CI run (you can do this manually in the GH UI if you add a workflow_dispatch to your action) Relevant FR - if they add gitops support for ESC then you can do that automatically https://github.com/pulumi/esc/issues/185
there’s also a webhook FR which would allow this too https://github.com/pulumi/esc/issues/188
m
I almost want to ez-bake a webhook dispatcher
oh
there we go 😂
i would ideally want to emit events for 1-[n] for config changes but i think i can do that with just a ghetto poll
CI orchestrators make life easy but... there's a lot that can go very wrong, very fast there.
GHA still not having a central dashboard is 🤮
g
yeah i’m personally more in favor of the gitops approach because pulumi (without ESC) already has a nice flow of: you change a config and you can see the
pulumi preview
output in your PR before you confirm it’s good to merge. if ESC supports gitops, you could make a change in a config and then see the pulumi preview across e.g 10+ stacks before you merge that change in. the webhook thing kinda forces you to cross your fingers every time you make a change
m
i mean, the github bot is nice yeah but for our matrix jobs we just do a 0/1 return and continue on fail true based on whether preview returns changes or not
easiest (and the only sane way actually) to ensure ENV parity across all customer AWS accounts is to just kick off a matrix preview job against all of them
then we use the output artifacts from those to coalesce a single report
GHA, as a CICD system, is actually quite nice
g
what i’m saying is without the gitops feature, there’s no ESC branching so you can’t run preview on your stacks before committing the ESC change. e.g. right now, you make an ESC change, you could run
pulumi preview
on that, but because there’s no branching, if you have a different PR that merges and causes
pulumi up
to run, the latest ESC change will deploy
m
hmmm
GitOps really does not do well if your CICD isnt 1:1 with your targets unfortunately 😞
g
yea
so thats my only gripe with it at the moment 😄 but otherwise ESC has been great for us
m
and honestly having to make changes in git just to onboard a tenant would make our SaaS model impossible to maintain
That's why I define the deployment targets per branch (aka, prod-beta is all our beta customers, and prod is all our regular customer)
and just fire off matrix jobs for our customer groups in cicd
our CICD definition itself hasnt changed in quite a while
here's a really ghetto somewhat inaccurate version of our approach
when a new tenant is added, we currently: • create their AWS account • create a CICD role in that account, inject it into the appropriate CICD workflows ◦ link it to OIDC, etc. • create new stacks for it in all our projects (1 per AWS account per tenant) • create/update all necessary global resource access permissions via IAM role/set in our "master" account. • do any necessary account-specific to global infra linking (e.x: create their subdomain and register it to their accounts' primary ALB)
all our logs do coalesce into a central account unless customers opt for high security (ie., they get all their own services and use none of our shared ones, their logs do not coalesce into our primary, etc - costs extra)
pumping our CICD logs and outputs into CloudWatch has honestly been probably the #1 saving grace of trying to manage all of this
Pulumi ESC seems like it could make our lives a lot easier with layered environments.
I saw that aws:login fn thing in ESC and that seems like an acceptable starting point for security
zero chance in hell we're letting any of those AWS cicd roles access global config
the whole thing is set up in a way that if we grant any customer access to their AWS account, the system stays secure. which was a nightmare.