06/03/2021, 3:54 PM
🗒️ Ok folks I’m excited about the demos coming up right now but I’m out of time to get my ready, sharing a few notes here on my progress - the idea was to build a component that wraps a SageMaker model-serving prediction endpoint with a lambda to run custom code in.
I’ve eventually had something really simple working in a local Pulumi ComponentResource with a
trained on a small dataset of red wine quality.
Going through wrapping it into mult-lang now using python boilerplate..
Pain points so far - predictably, permissions… I’ve had a very hard time tightening the AWS policies to get this right. The error messages pop up late in the process (such as in the middle of training, etc) and sometimes are not obvious. The most not-obivous one I had was Lambda failing to read a Bucket because it was encrypted by a KMS key and it did not have KMS perms, but it was reporting Access Denied on GetObject.
IN addition, Pulumi seemed to have an issue with getting the cloud state in sync after adding/removing KMS encryption on the object - something is up with the diff of s3.BucketObject. I’ll have to revisit that… But the net effect was very confusing.
For the multi-lang boilerplate so far: (1) adding a command to set the package name would be a great time saver; (2) I don’t seem to have a good heuristic on when inputs should be
or when they should be
so perhaps some guidance there. Maybe I will figure this out as I progress.
I’m still excited to finish. One of the places I’ve worked before tried to build their own model serving, and it was a minor disaster as scale, and would very likely have benefited from delegating it to AWS. Pulumi would have been handy as a tool to intermediate the science team and “ml-ops” team of getting it production-worthy.
Another learning here was around “strange” resources - SageMaker offers Prediction Job for example, which can be considered a resource, however it’s append-only in AWS, you can’t modify or delete an existing job. Pulumi currently does not model it, perhaps that’s just the reason why. These things blur the line a bit between “infrastructure” and “live objets managed by the ML platform” and I wonder what is the good division of responsibilities here.
At least I’ve reached a kind of “works on my machine” milestone 🙂