<@UB9JVTW07> I noticed that you recently made a ch...
# general
g
@lemon-spoon-91807 I noticed that you recently made a change how docker images are tagged in pulumi docker (https://github.com/pulumi/pulumi-docker/pull/31 ). Basically it seems they’re now always tagged with the image-id which in my understanding changes whenever any layer changes. The problem I’m facing now that his kind of makes using
cacheFrom
almost impossible. In order to effectively use
cacheFrom
I need some stable tag (i.e.
latest
) or some other identifier that can be derived from logic in the code before the build runs. Prior to your change I was adding
latest
as tag to all images which in conjunction with
cacheFrom: true
enabled quite good caching. Any thoughts how I can achieve good caching now?
l
Hey!
I'm actually looking at this right now
but i'm trying to figure out the right thing to do here.
i could definitely use some info from you
the part i'm trying to understand is this:
if your layer changes... why would you not want the ID to change?
g
well I’m ok if the ID changes
but at the same I need some stable tag to use as --cache-from source
Before using pulumi we did the following in our docker build bash scripts:
Whenever we build an image, we tagged it with 2 tags: -
latest
-
<git_sha>
and then we pushed both tags. In our k8s deployment we would reference the docker image with the
<git_sha>
tag but during docker build we would pull and --cache-from=<image_name>:latest
So I think if it’s possible to push two tags, where one of them is the image-id and the other one is some table identifier like
latest
that somewhat solves my problem.
l
ok. let me takea look
i admit the cache-from code confuses me greatly
but i thought that's what it wasy trying to do
specifically, if you did something like:
cacheFrom: {stages: ["some_id"]}
g
nah unfortunately not
l
could you clarify? 🙂
g
stages
are named stages in a multi-stage Dockerfile. A multi-stage dockerfile contains multiple
FROM
clauses.. i.e. check thiose docs: https://docs.docker.com/develop/develop-images/multistage-build/#use-multi-stage-builds . If I would specify
latest
as stage name, pulumi’s docker code would attempt to build the dockerfile with
docker build . --target latest
which would fail unless the dockerfile contains a named
latest
stage. Or to cut a long story short: stages have nothing to do with tags.
I want basically multiple tags, but not multiple stages 🙂
l
oh. are you making a feature request effectively?
(sorry, trying to distinguish that from this being a report about a bug i may have introduced :))
g
aehm
Well, before your change I was table to tag the images pulumi built always with the stable tag
latest
.
After your change it will always tag things with something like
<image_id>
or
latest-<image_id>
. So basically the tag gets unstable which makes it unsuitable for caching.
l
i see
g
And everytime I make a code change now it will build the entire image completely from scratch.
l
i think
i need to talk to someone
g
which takes a LOOOT of time 😛
l
And everytime I make a code change now it will build the entire image completely from scratch.
g
yeah docker is confusing
FYI always tagging every build and push with
latest
and using the
latest
tag as --cache-from is a very simple strategy that speeds up builds in many cases, but a slightly more sophisticated and better strategy would be probably this: 1. tag and push each image with the git_sha of HEAD 2. before build pulumi should attempt to pull
<image_name>:<git_sha_HEAD>
. If it could successfully pull this, use it --cache-from . If not, attempt to pull
<image_name>:<git_sha_HEAD~1>
and use this as --cache from. Repeat this process until an image could be pulled successfully (maybe stop doing so after 5 iterations or until HEAD~5).
This strategy should make it possible to reuse a lot of cached layers for most builds.
l
could you potentially open an issue with that suggestion?
g
ok sure
fyi in https://github.com/pulumi/pulumi-docker/issues/32 @white-balloon-205 discusses on an abstract level the same thing 🙂 . Intelligently tagging and using --cache-from to speed up builds.
l
question that isn't quite clear to me
wouldn't this part be incorrect:
If not, attempt to pull <image_name>:<git_sha_HEAD~1>
wouldn't that discount any changes you made yourself?
This needs probably some additional thought for determining what tag to push if the git working directory was dirty at the time of build.
one thing we're considerign is not to use a git hash, but our own hash of file system contents. we already have that concept for other parts of our system (for example, it's how we know what to update when node_modules changes)
however, for docker, we feel like it would need to be opt-in. because, after all, any docker build could produce a different output, even if hte contents on disk stayed teh same.
g
wouldn’t that discount any changes you made yourself?
Nope, if I use
<git_sha_HEAD~1
as cache-source docker will attempt to reuse as many layers as possible from the cached image, but it will still detect local source code changes or local dockerfile changes (that are different from the cached image) and build the corresponding layers from scratch.
l
ok. so just so i understand as well, what is the reason for not using cache from stages like: cacheFrom: {stages: ["some_id"]}
?
g
Well first of all in order to use
cacheFrom: {stages: ...}
I would have to add multiple stages to each dockerfile which is some additional work (and shouldn’t be required to get good caching). Second I think after your change even when using
cacheFrom: {stages:... }
each stage will be tagged with something like
<stageName>-<image_id>
. So each of the stages doesn’t have a predictable tag to pull from either.
In general stages are a feature that imho shouldn’t be bothered with to get good remote caching. It’s imho an unrelated docker feature and that was created for a different purpose and not to improve caching.
l
ok. i think i have a lot to learn about this
i'm really hesitant about pulumi having any knowledge of things like git hashes
g
If anything multi-stage docker builds are even harder to optimize for good remote caching.
The problem is if you don’t use git hashes, but instead your own file hashing, how will you determine the hash of an ancestor commit or anchestor build?
l
I honestly don't know 🙂 i don't have any good answers as i don't really understand this space well enough.
g
Ok, now that I think about it. If you store a pulumi specific hash in the checkpoint you could read it from there. The disadvantage of this would be that you probably can’t read the hash across stacks. This means that each stack’s docker images would be cached separately which is suboptimal when having a large number of stacks of the same project or when attempting to quickly create ephemeral stacks like we do in some cases.