hi all. got asked this by a colleague - We've got ...
# getting-started
a
hi all. got asked this by a colleague - We've got an LLM project at a client where we'll have to build a few different components including networks (VPN in Ansible, VPC), users (DSs, MLEs, service accounts and associated IAM users, groups, roles, permissions), a data science workbench (data storage (S3), development compute (EC2), training + inference compute, SageMaker project), AT stack, RLHF stack. Maybe my language gave it away, but we've sort of settled on the idea of treating each of the above stacks as individual Pulumi projects which share stack references, effectively using the micro-stacks. Each of these stacks can then be (re)deployed independently into the correct environment and not interrupt any development work going on. The current structure of the project is as follows:
Copy code
.
└── ml-proj
    ├── utils/
    ├── config/
    ├── networks
    │   ├── vpn/
    │   └── vpcs.py
    ├── users
    │   ├── humans.py
    │   └── machines.py
    ├── experiment_tracker/
    ├── uat/
    ├── rlhf/
    ├── workbench_common
    │   ├── sagemaker.py
    │   ├── storage.py
    │   └── compute.py
    ├── workbench_dev/
    └── workbench_prod/
I don't know enough about the tool to know for certain that what we've decided is actually the "correct" choice so anyone able to validate my thinking would be great. Some outstanding questions are: 1. How does CI/CD work where we might want to only deploy only one or some of the stacks? Do we write a script that simply traverses the subdirectories and runs
pulumi up
on each, knowing that Pulumi should be smart enough to not cycle resources? A major disaster point would be cycling the experiment tracker and thereby losing the results of previous experiments. 2. How do we share similar but subtly different infrastructures between
dev
and
prod
? The main one here is
workbench_*
which can either be a dev workbench with one EC2 instance per data scientist to be used for development or a prod workbench which requires only a single EC2 instance for training runs. Likewise, different compute types might be required for better latency in prod, etc. We've settled on inheriting from classes in
workbench_common
and extending or overriding them in order to do this. 3. How do we handle IAM users? Ideally, we'd want a single config file where we can list users by email and which groups they should be in (
data-science
,
ml-engineering
,
superuser
) which can then be shared via stack references.
b
hi @ambitious-computer-3093 - this looks like a good pattern. I wrote something up last week that may help: https://leebriggs.co.uk/blog/2023/08/17/structuring-iac to answer your questions:
How does CI/CD work where we might want to only deploy only one or some of the stacks? Do we write a script that simply traverses the subdirectories and runs pulumi up on each, knowing that Pulumi should be smart enough to not cycle resources? A major disaster point would be cycling the experiment tracker and thereby losing the results of previous experiments.
There’s 2 options here - have a stack per environment in CI/CD and make sure that environment only runs on PR/merge or tags. The other is to use automation API to wrap the invocation and run it that way
How do we share similar but subtly different infrastructures between dev and prod? The main one here is workbench_* which can either be a dev workbench with one EC2 instance per data scientist to be used for development or a prod workbench which requires only a single EC2 instance for training runs. Likewise, different compute types might be required for better latency in prod, etc. We’ve settled on inheriting from classes in workbench_common and extending or overriding them in order to do this.
You’d use a stack and set configuration on the stack. You can be veru flexible here if you write a component, which has inputs which get taken from config
a
awesome thanks and i will pass this on
b
How do we handle IAM users? Ideally, we’d want a single config file where we can list users by email and which groups they should be in (data-science, ml-engineering, superuser) which can then be shared via stack references.
Personally I’d avoid IAM if you can, I wrote something about the right way to auth to AWS here: https://leebriggs.co.uk/blog/2022/09/05/authenticating-to-aws-the-right-way
a
thats really awesome thanks for your help
do you have any guidance on using ComponentResources in python for building up "modules" of resources for reuse and encapsulation?
b
no official guidance written except this https://www.pulumi.com/docs/concepts/resources/components/
a
perfect thanks!
s
Paging @busy-journalist-6936 who’s been working on some AI/ML-specific stuff and may have some additional input
b
I'm def into the microstack idea but have not practiced it myself before, though I am breaking up my pulumi python into smaller modules so may have more to say/share soon.