https://pulumi.com logo
g

glamorous-printer-66548

10/29/2018, 11:08 PM
@lemon-spoon-91807 I noticed that you recently made a change how docker images are tagged in pulumi docker (https://github.com/pulumi/pulumi-docker/pull/31 ). Basically it seems they’re now always tagged with the image-id which in my understanding changes whenever any layer changes. The problem I’m facing now that his kind of makes using
cacheFrom
almost impossible. In order to effectively use
cacheFrom
I need some stable tag (i.e.
latest
) or some other identifier that can be derived from logic in the code before the build runs. Prior to your change I was adding
latest
as tag to all images which in conjunction with
cacheFrom: true
enabled quite good caching. Any thoughts how I can achieve good caching now?
l

lemon-spoon-91807

10/29/2018, 11:10 PM
Hey!
I'm actually looking at this right now
but i'm trying to figure out the right thing to do here.
i could definitely use some info from you
the part i'm trying to understand is this:
if your layer changes... why would you not want the ID to change?
g

glamorous-printer-66548

10/29/2018, 11:11 PM
well I’m ok if the ID changes
but at the same I need some stable tag to use as --cache-from source
Before using pulumi we did the following in our docker build bash scripts:
Whenever we build an image, we tagged it with 2 tags: -
latest
-
<git_sha>
and then we pushed both tags. In our k8s deployment we would reference the docker image with the
<git_sha>
tag but during docker build we would pull and --cache-from=<image_name>:latest
So I think if it’s possible to push two tags, where one of them is the image-id and the other one is some table identifier like
latest
that somewhat solves my problem.
l

lemon-spoon-91807

10/29/2018, 11:15 PM
ok. let me takea look
i admit the cache-from code confuses me greatly
but i thought that's what it wasy trying to do
specifically, if you did something like:
cacheFrom: {stages: ["some_id"]}
g

glamorous-printer-66548

10/29/2018, 11:18 PM
nah unfortunately not
l

lemon-spoon-91807

10/29/2018, 11:18 PM
could you clarify? 🙂
g

glamorous-printer-66548

10/29/2018, 11:21 PM
stages
are named stages in a multi-stage Dockerfile. A multi-stage dockerfile contains multiple
FROM
clauses.. i.e. check thiose docs: https://docs.docker.com/develop/develop-images/multistage-build/#use-multi-stage-builds . If I would specify
latest
as stage name, pulumi’s docker code would attempt to build the dockerfile with
docker build . --target latest
which would fail unless the dockerfile contains a named
latest
stage. Or to cut a long story short: stages have nothing to do with tags.
I want basically multiple tags, but not multiple stages 🙂
l

lemon-spoon-91807

10/29/2018, 11:22 PM
oh. are you making a feature request effectively?
(sorry, trying to distinguish that from this being a report about a bug i may have introduced :))
g

glamorous-printer-66548

10/29/2018, 11:24 PM
aehm
Well, before your change I was table to tag the images pulumi built always with the stable tag
latest
.
After your change it will always tag things with something like
<image_id>
or
latest-<image_id>
. So basically the tag gets unstable which makes it unsuitable for caching.
l

lemon-spoon-91807

10/29/2018, 11:26 PM
i see
g

glamorous-printer-66548

10/29/2018, 11:26 PM
And everytime I make a code change now it will build the entire image completely from scratch.
l

lemon-spoon-91807

10/29/2018, 11:26 PM
i think
i need to talk to someone
g

glamorous-printer-66548

10/29/2018, 11:26 PM
which takes a LOOOT of time 😛
l

lemon-spoon-91807

10/29/2018, 11:27 PM
And everytime I make a code change now it will build the entire image completely from scratch.
g

glamorous-printer-66548

10/29/2018, 11:27 PM
yeah docker is confusing
FYI always tagging every build and push with
latest
and using the
latest
tag as --cache-from is a very simple strategy that speeds up builds in many cases, but a slightly more sophisticated and better strategy would be probably this: 1. tag and push each image with the git_sha of HEAD 2. before build pulumi should attempt to pull
<image_name>:<git_sha_HEAD>
. If it could successfully pull this, use it --cache-from . If not, attempt to pull
<image_name>:<git_sha_HEAD~1>
and use this as --cache from. Repeat this process until an image could be pulled successfully (maybe stop doing so after 5 iterations or until HEAD~5).
This strategy should make it possible to reuse a lot of cached layers for most builds.
l

lemon-spoon-91807

10/29/2018, 11:39 PM
could you potentially open an issue with that suggestion?
g

glamorous-printer-66548

10/29/2018, 11:47 PM
ok sure
fyi in https://github.com/pulumi/pulumi-docker/issues/32 @white-balloon-205 discusses on an abstract level the same thing 🙂 . Intelligently tagging and using --cache-from to speed up builds.
l

lemon-spoon-91807

10/30/2018, 12:13 AM
question that isn't quite clear to me
wouldn't this part be incorrect:
If not, attempt to pull <image_name>:<git_sha_HEAD~1>
wouldn't that discount any changes you made yourself?
This needs probably some additional thought for determining what tag to push if the git working directory was dirty at the time of build.
one thing we're considerign is not to use a git hash, but our own hash of file system contents. we already have that concept for other parts of our system (for example, it's how we know what to update when node_modules changes)
however, for docker, we feel like it would need to be opt-in. because, after all, any docker build could produce a different output, even if hte contents on disk stayed teh same.
g

glamorous-printer-66548

10/30/2018, 12:23 AM
wouldn’t that discount any changes you made yourself?
Nope, if I use
<git_sha_HEAD~1
as cache-source docker will attempt to reuse as many layers as possible from the cached image, but it will still detect local source code changes or local dockerfile changes (that are different from the cached image) and build the corresponding layers from scratch.
l

lemon-spoon-91807

10/30/2018, 12:24 AM
ok. so just so i understand as well, what is the reason for not using cache from stages like: cacheFrom: {stages: ["some_id"]}
?
g

glamorous-printer-66548

10/30/2018, 12:32 AM
Well first of all in order to use
cacheFrom: {stages: ...}
I would have to add multiple stages to each dockerfile which is some additional work (and shouldn’t be required to get good caching). Second I think after your change even when using
cacheFrom: {stages:... }
each stage will be tagged with something like
<stageName>-<image_id>
. So each of the stages doesn’t have a predictable tag to pull from either.
In general stages are a feature that imho shouldn’t be bothered with to get good remote caching. It’s imho an unrelated docker feature and that was created for a different purpose and not to improve caching.
l

lemon-spoon-91807

10/30/2018, 12:33 AM
ok. i think i have a lot to learn about this
i'm really hesitant about pulumi having any knowledge of things like git hashes
g

glamorous-printer-66548

10/30/2018, 12:34 AM
If anything multi-stage docker builds are even harder to optimize for good remote caching.
The problem is if you don’t use git hashes, but instead your own file hashing, how will you determine the hash of an ancestor commit or anchestor build?
l

lemon-spoon-91807

10/30/2018, 12:38 AM
I honestly don't know 🙂 i don't have any good answers as i don't really understand this space well enough.
g

glamorous-printer-66548

10/30/2018, 12:42 AM
Ok, now that I think about it. If you store a pulumi specific hash in the checkpoint you could read it from there. The disadvantage of this would be that you probably can’t read the hash across stacks. This means that each stack’s docker images would be cached separately which is suboptimal when having a large number of stacks of the same project or when attempting to quickly create ephemeral stacks like we do in some cases.