ambitious-computer-3093
08/24/2023, 11:50 PM.
└── ml-proj
├── utils/
├── config/
├── networks
│ ├── vpn/
│ └── vpcs.py
├── users
│ ├── humans.py
│ └── machines.py
├── experiment_tracker/
├── uat/
├── rlhf/
├── workbench_common
│ ├── sagemaker.py
│ ├── storage.py
│ └── compute.py
├── workbench_dev/
└── workbench_prod/
I don't know enough about the tool to know for certain that what we've decided is actually the "correct" choice so anyone able to validate my thinking would be great. Some outstanding questions are:
1. How does CI/CD work where we might want to only deploy only one or some of the stacks? Do we write a script that simply traverses the subdirectories and runs pulumi up
on each, knowing that Pulumi should be smart enough to not cycle resources? A major disaster point would be cycling the experiment tracker and thereby losing the results of previous experiments.
2. How do we share similar but subtly different infrastructures between dev
and prod
? The main one here is workbench_*
which can either be a dev workbench with one EC2 instance per data scientist to be used for development or a prod workbench which requires only a single EC2 instance for training runs. Likewise, different compute types might be required for better latency in prod, etc. We've settled on inheriting from classes in workbench_common
and extending or overriding them in order to do this.
3. How do we handle IAM users? Ideally, we'd want a single config file where we can list users by email and which groups they should be in (data-science
, ml-engineering
, superuser
) which can then be shared via stack references.billowy-army-68599
How does CI/CD work where we might want to only deploy only one or some of the stacks? Do we write a script that simply traverses the subdirectories and runs pulumi up on each, knowing that Pulumi should be smart enough to not cycle resources? A major disaster point would be cycling the experiment tracker and thereby losing the results of previous experiments.There’s 2 options here - have a stack per environment in CI/CD and make sure that environment only runs on PR/merge or tags. The other is to use automation API to wrap the invocation and run it that way
How do we share similar but subtly different infrastructures between dev and prod? The main one here is workbench_* which can either be a dev workbench with one EC2 instance per data scientist to be used for development or a prod workbench which requires only a single EC2 instance for training runs. Likewise, different compute types might be required for better latency in prod, etc. We’ve settled on inheriting from classes in workbench_common and extending or overriding them in order to do this.You’d use a stack and set configuration on the stack. You can be veru flexible here if you write a component, which has inputs which get taken from config
ambitious-computer-3093
08/24/2023, 11:54 PMbillowy-army-68599
How do we handle IAM users? Ideally, we’d want a single config file where we can list users by email and which groups they should be in (data-science, ml-engineering, superuser) which can then be shared via stack references.Personally I’d avoid IAM if you can, I wrote something about the right way to auth to AWS here: https://leebriggs.co.uk/blog/2022/09/05/authenticating-to-aws-the-right-way
ambitious-computer-3093
08/24/2023, 11:56 PMbillowy-army-68599
ambitious-computer-3093
08/25/2023, 12:02 AMsalmon-account-74572
08/28/2023, 2:31 PMbusy-journalist-6936
08/28/2023, 3:25 PM