Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support remote snapshotter to speed up image pulling #3731

Closed
ktock opened this issue Oct 8, 2019 · 27 comments
Closed

Support remote snapshotter to speed up image pulling #3731

ktock opened this issue Oct 8, 2019 · 27 comments
Labels
Milestone

Comments

@ktock
Copy link
Contributor

@ktock ktock commented Oct 8, 2019

Among the container's lifecycle, pulling image is one of the biggest performance bottlenecks on container startup processes. One research shows that time for pulling accounts for 76% of container startup time[FAST '16].

I know there is discussion on #2943, but it seems nobody got started the implementation? To make this progress, I implemented a patch based on the discussion with an example implementation of a remote snapshotter. Through the implementation, I found challenges around metadata snapshotter and namespaces so I fixed the design to achieve it. Could anyone give comments on it?

The overview of the design and implementation

The whole picture is described here.

To make containerd work with remote snapshotter, it needs to:

  • Skip downloading layers which can be prepared without downloading contents by a remote snapshotter(I call it remote layers here).
  • Make remote snapshots without "Unpack" operation.
  • Make the snapshot work with metadata snapshotter so that containerd can bind these snapshots to namespaces

design

I think we can achieve it by introducing additional filter(Fig 1) during the fetching process(in the Client.fetch method). Currently, there are already some filters in the method, so adding a new filter seems relatively easily.

  • The filter takes lists of download candidates(blobs) and checks if each one and the lower layers are remote layers.
  • If so, it filters out the layer from the download candidates and makes the snapshot as a remote snapshot.
  • Containerd doesn't unpack layers if the layers already exist as snapshots, so we can avoid unpacking them.

The filter talks with remote snapshotters to know whether a layer is a remote layer(Fig 2).

  1. The filter Stat() the layer with ChainID of the layer.
  2. If the snapshot doesn't exist, the filter attempts to Prepare() the layer as a remote snapshot.
  3. If possible, the remote snapshotter prepare an active snapshot with applying a label "RemoteSnapshotLabel" automatically.
  4. The filter Stat() the active snapshot to check if it is a remote snapshot by seeing the label.
  5. Only if the snapshot has the label, the filter Commit() it immediately and filter out this layer from the download candidates.
  6. If the snapshot is a remote snapshot, the snapshotter mounts the remote unpacked layer on the snapshot and commits it.

Question

I'm really keen to make the remote snapshotter real. The capability to boot containers without pulling is great.

  • Cloud I get your comments on the design and implementation? Is it acceptable for containerd?
  • If not, are there any considerations? Or are there possible alternative designs?
@ktock ktock added the kind/feature label Oct 8, 2019
@estesp
Copy link
Contributor

@estesp estesp commented Oct 8, 2019

Can you comment on the design/steps in #2968? I think that was the outcome of earlier discussions as the main core containerd change required to get this work moving.

I'll let @dmcgowan comment in case there was another issue I'm not finding. I know we tried to open up some issues to drive the work/todo list as we had a contributor interested, but that fell through.

@ehotinger
Copy link
Contributor

@ehotinger ehotinger commented Oct 8, 2019

I think the problem last time was finding what made sense to contribute for a framework. It's already possible to do it all, so what would you PR in? Just the client opts and a specific remote filesystem snapshotter?

The client interactions aren't a standard on any registry. Maybe adding some kind of filter options make sense, but they aren't totally usable on their own. You have to have some external process creating all the mounts, providing the client information that it has to skip the mounts, etc. Or the snapshotter itself has to do the mounts so it's case by case.

Originally I made #3055 just to enable the second scenario, but there are a lot of problems with it when you start using actual credentials. You don't want them to be cached in boltdb. How do you rotate creds, make sure they aren't expired, etc.

And it's hard to make the remote snapshotter implementation generic without being a total mess. Does it make sense to put NBD, CIFS, NFS, x, y, z all in one? The generic one is just 'check if something exists', but then the mounts themselves are maintained out of band from containerd so you need something extra.

Then, how do you achieve resiliency in a generic way, to ensure the mounts are available when the container runs, i.e. after a system restart when all the snapshots are previously created/cached. Does it make sense for the snapshotter to do that, i.e. right before creating the RW layer? So again it's a case-by-case implementation, or an external process?

Lastly, what about the cases where some layers are remote and others are previously unpacked. Generally local disk outperforms network so the metadata has to be per-layer and snapshotters need to re-use existing snapshots. And there's cases where you may want to choose network over local disk too.

I went through this process about a year ago and PRed most of the things I thought made sense. I was never able to reach consensus with the maintainers here or get enough momentum for its support, because of a lot of these reasons. So to start I think you should try the same, and just PR things that 'make sense' as standalone.

I personally would hate to see containerd as a dumping ground for many different snapshotter and client implementations, instead we should just be trying to focus on how to make these problems generic.

@Random-Liu
Copy link
Member

@Random-Liu Random-Liu commented Oct 8, 2019

The client interactions aren't a standard on any registry.

I remember @dmcgowan mentioned that the client interaction can make sense for reusing existing snapshots across different namespaces, which could be useful for buildkit.

@dmcgowan
Copy link
Member

@dmcgowan dmcgowan commented Oct 8, 2019

The idea was to minimize the new interactions which needed to take place between the client and the snapshotter. Generally this could be solved by attempting to Prepare a layer before attempting a fetch. The Prepare method could be used to pull "already existing" (from the backend remote snapshotter perspective) snapshots into the namespace, however this would still be a new interaction with snapshotters and needs to be defined.

@ktock
Copy link
Contributor Author

@ktock ktock commented Oct 9, 2019

@ehotinger Thank you for your comments.

what would you PR in? Just the client opts and a specific remote filesystem snapshotter?

I don't intend to make the PR immediately. Before that, I thought that I need to hear mentainer's opinion based on the code above so I opened this issue. After that, I intend to make PRs:

  • A new filter to skip downloading layers for which remote snapshotters say to do.
  • The client options to turn on the filtering.
  • The remote snapshotter implementation which is backed by normal docker registry using stargz format which is compatible with current docker images.

The client interactions aren't a standard on any registry.

As others mentioning, "Preparing a layer before fetch" idea makes sense as standard client interaction and my design is also based on it. I think it is a reasonable separation of responsibilities that:

  • the "case-by-case" parts are in the remote snapshotter, and
  • containerd has responsibilities to skip downloading layers, make remote snapshots and ensure the existence.

'check if something exists', but then the mounts themselves are maintained out of band from containerd so you need something extra.
how do you achieve resiliency in a generic way, to ensure the mounts are available when the container runs

I think ensuring the mounts being available is snapshotter's responsibility. If the snapshotter finds a layer is no longer available, it should return errors on Prepare(). But I think that we need to discuss containerd's behavior on the failure of Prepare() i.e. during creating the rootfs's RW layer.

credential

As discussed in #2968, I think using label to pass the credential down makes sense. But I think we need something for expiration and rotation etc.

Generally local disk outperforms network so the metadata has to be per-layer and snapshotters need to re-use existing snapshots. And there's cases where you may want to choose network over local disk too.

I think it is a nice feature but is it must be in scope at the current stage? I think the performance-related things are the snapshotter's responsibility and the containerd core doesn't need to care about it.

Considering the above, I think current issues are:

  • containerd side:
    • The way to manage credentials i.e. expiration, rotation, etc.
    • resiliency: The containerd's behavior on the failure of Prepare() i.e. during creating the rootfs's RW layer. (re-fetch or something?)
  • snapshotter side:
    • credential management
    • performance
    • resiliency: ensuring the mounts are available
@AkihiroSuda
Copy link
Member

@AkihiroSuda AkihiroSuda commented Oct 9, 2019

I think we can just let the remote snapshotter plugin to deal with the credential (typically via ~/.docker/config.json) and call it for a day.

In future, maybe we can consider porting over BuildKit session so that the daemon-side plugin can invoke gRPC requests against the client to fetch credential.

@lukasheinrich
Copy link

@lukasheinrich lukasheinrich commented Oct 9, 2019

@ktock thanks a lot for putting this together. This is great! I have a question regarding the layer filter. Is there a technical reason why a layer can only be taken from remote when all lower layers are also remote?

@ktock
Copy link
Contributor Author

@ktock ktock commented Oct 9, 2019

@lukasheinrich Thank you for your question!
The reason is the following:

For example, assume that the filter is creating a remote snapshot. If the parent layer isn't a remote layer the filter needs to download and unpack the parent contents and make the parent snapshot before it makes the target snapshot because Prepare() requires parent layers. But currently unpacking are separated from filtering as mentioned in Fig 1 above ("filtering" -> "downloading" -> "unpacking").

Of cause by implementing the downloading+unpacking in the filter we can achieve it but a little bit strange cases can occur like that many layers are downloaded+unpacked during filtering before the Unpack() is actually invoked i.e. when the top-most layer is only one remote layer and all others aren't.

Considering above, the restriction "a layer can only be taken from remote when all lower layers are also remote" made sense for me. Are there any cases that the restriction is critical? I think most images usually have bigger layers in the lower layers so even if the restriction is applied we can skip downloading large parts of an image.

@ehotinger
Copy link
Contributor

@ehotinger ehotinger commented Oct 9, 2019

It's very exciting to see all the effort and ideas here. This is great.

As others mentioning, "Preparing a layer before fetch" idea makes sense as standard client interaction and my design is also based on it. I think it is a reasonable separation of responsibilities that:

the "case-by-case" parts are in the remote snapshotter, and
containerd has responsibilities to skip downloading layers, make remote snapshots and ensure the existence.

As others mentioning, "Preparing a layer before fetch" idea makes sense as standard client interaction and my design is also based on it.

Sounds totally reasonable to me. But, the reason I brought this up is containerd isn't standalone when you do this. You always need custom stuff on top of containerd on a per-project basis. For example, how will you integrate this with CRI? It would be nice if I could just use a remote snapshotter with stargz, CIFS, NFS, whatever, and go through the same exact flow we have today.

I think we can just let the remote snapshotter plugin to deal with the credential (typically via ~/.docker/config.json) and call it for a day.

I don't think the concern is where this metadata is stored. It can be anywhere. I personally like it being associated to layers because then you can have different credentials for each remote layer on a per layer basis and potentially use different sources and you aren't coupled directly to username/password/token (it can be easily expanded upon), but this does increase complexity. Let's focus on the interaction model. Prepare fails, cred expired from 'generic network file system', do I need to login to a registry (stargz), get creds somehow for NFS, CIFS, etc.? Do I need to take these actions every time right before I run a container, etc.

@crosbymichael crosbymichael added this to the 1.4 milestone Oct 9, 2019
@crosbymichael
Copy link
Member

@crosbymichael crosbymichael commented Oct 9, 2019

I placed this issue in the 1.4 milestone as it's a large focus for us and the community. :)

@dmcgowan
Copy link
Member

@dmcgowan dmcgowan commented Oct 10, 2019

My WIP of this is here https://github.com/dmcgowan/containerd/tree/prepare-snapshot-target
Mostly just figuring out the metadata store side right now and making sure it is backwards compatible.

Basically it allows an ErrAlreadyExists to be returned when passing in containerd.io/snapshot.ref. The backend snapshotter may also return that and the metadata store will handle it by calling a stat to get the snapshot info then adding it to the metadata store. The ErrAlreadyExists will be returned to the client at that point to re-check whether the target snapshot is already there.

@ktock
Copy link
Contributor Author

@ktock ktock commented Oct 10, 2019

@dmcgowan Thank you very much for your work on the metadata snapshotter.
This can reduce the complexty of the filter side because it can make a remote snapshot simply only by Prepare() without Commit().

One question:

  • On filter side, it needs to check if a committed snapshotter is a remote snapsot or not. Can this be done by checking the label containerd.io/snapshot.ref?
@ktock
Copy link
Contributor Author

@ktock ktock commented Oct 10, 2019

@ehotinger Thank you for your comments.

It would be nice if I could just use a remote snapshotter with stargz, CIFS, NFS, whatever, and go through the same exact flow we have today.

Currently, this filter and snapshotter implementations don't require additional actions on cases except following:

  • Source preparation: Setting up the remote store(NFS, CVMFS, etc...). Converting images on the registry(stargz, etc...).
  • Setting an option to turn on the filter(but I think we can make the filter enable by default because it is backwards compatible)
  • creds(auth)-related events

I think the source preparation is indispensable and each snapshotter's matter so the consideration here is "creds(auth)-related events" as you mentioned:

Prepare fails, cred expired from 'generic network file system', do I need to login to a registry (stargz), get creds somehow for NFS, CIFS, etc.? Do I need to take these actions every time right before I run a container, etc.

The way of auth-related things are very case-by-case depending on the remote snapshotter(store) and it is hard to find a general way to do it.
At the current stage, I think the reasonable solution will be:

  • Introducing a new error which indicates "Authentication Failed" and users deal with the error(updating the creds, etc.) using tools or following the way specified by the spanshotter.
  • In the future, we can integrate these processes as plugins into containerd if necessary.
@ktock
Copy link
Contributor Author

@ktock ktock commented Oct 21, 2019

@dmcgowan

How about your WIP on metadata snapshotter mentioned here?😄
Recently I integrated your metadata snapshotter with my filter implementation and remote snapshotter. It seems to work fine! Do you have any plan for your metadata snapshotter?

You can try the demo as written in the README

@dprotaso
Copy link

@dprotaso dprotaso commented Oct 30, 2019

I have a few questions and I wonder what people's thoughts are:

1) What types of remote snapshotters are we looking to support?

I'm imagining two classes:

  1. read-only
  2. read/write

What would be the scope of each type. ie. for r/w does a local commit propagate to the remote?

2) For surfacing read-only data is using the snapshotter interface appropriate?

When looking at the remote-snapshotter, I saw it's using overlayfs while the 'remote' fs is pluggable. That's great. But if I wanted to use a different union fs (ie. btrfs) does that require me to write another snapshotter implementation?

Secondly, I wonder if it makes sense to integrate it as a look aside remote cache. Meaning instead of creating snapshotter we have logic to check the cache when 'unpacking'. If the cache has the layer content present it can just mount the remote snapshot.

Other options could be to chain snapshotters together, similar to HTTP middleware and thus creating an inline cache. Though this likely would mean the snapshotter interface would need to change

@dmcgowan
Copy link
Member

@dmcgowan dmcgowan commented Oct 30, 2019

@ktock I am working on the unit tests now. I stopped to get the filters branch in so I could use that. Thank you for testing it out!

@ktock
Copy link
Contributor Author

@ktock ktock commented Oct 31, 2019

@dprotaso Thank you for your questions.

  1. What types of remote snapshotters are we looking to support?

I think the initial scope is "read-only" but it is possible to include the "write" support in future works.
The ways to propagate local changes to remote are very different among filesystems so I think it is remote-snapshotter's responsibility (rather than the core of containerd). And I think it isn't impossible for remote-snapshotter to enable to propagate changes on Active snapshot to remote on each Commit() using each filesystem-specific protocol.

  1. For surfacing read-only data is using the snapshotter interface appropriate?

But if I wanted to use a different union fs (ie. btrfs) does that require me to write another snapshotter implementation?

Yes, currently you need to write your own implementation but you can reuse "remote" fs plugin for filesystem-specific parts.
I also think it's great for remote-snapshotter to support a kind of "union fs plugin".

Secondly, I wonder if it makes sense to integrate it as a look aside remote cache.

It sounds reasonable to me but I personally prefer the integration as a snapshotter because the integration cost of remote cache seems higher than simply adding "filter" and plugging remote-snapshotter into containerd. I would also like to hear mentainer's opinion here.

@lukasheinrich
Copy link

@lukasheinrich lukasheinrich commented Oct 31, 2019

just noting some developments from azure:

https://stevelasker.blog/2019/10/29/azure-container-registry-teleportation/
https://twitter.com/lukasheinrich_/status/1189794453365645312

@SteveLasker are there any plans to integrate that solution with what's discussed here?

@SteveLasker
Copy link

@SteveLasker SteveLasker commented Oct 31, 2019

The Teleport work was initially started on Moby as we wanted to get this out last year. That became too problematic as there weren't enough extension points in moby to have a clean implementation, so engineering convinced us to move to a containerd solution.
That said, we're still refactoring some of the changes to align with the containerd snapshotter and we likely would have some PRs to do cleanup on our side. As @ehotinger mentions, the implementations are very cloud specific, as we each have varied storage, auth and networking solutions. If you look at the various registry implementations, we all follow the distribution-spec, but each have unique implementations. This is the beauty of the distribution-spec design.

@dmcgowan
Copy link
Member

@dmcgowan dmcgowan commented Nov 1, 2019

See #3793 for change to core. Client changes to leverage this will be next.

@ktock
Copy link
Contributor Author

@ktock ktock commented Nov 24, 2019

I posted idea of the client changes on #3846 to enable to skip downloading layers which can be provided by backend snapshotter. Could you give comments on it?
Further additional failure handling on client-side for better resiliency will be next.

@jblomer
Copy link

@jblomer jblomer commented Dec 4, 2019

This is really fantastic, many thanks!

I'd have two questions:

  • How do you see the remote file system plugins of the remote snapshotter (CernVM-FS, NFS, etc.) being deployed? Can this be a plugin container, bundling the file system client and whatever else is required?

  • Can multiple remote file system plugins be active concurrently, and if so, can we decide which one gets priority for serving a layer (e.g. first CRFS than NFS)?

@ktock
Copy link
Contributor Author

@ktock ktock commented Dec 5, 2019

@jblomer Thank you for your great questions! So far, there hasn't been enough discussion about these topics so I would like to hear other's opinion as well, but my opinion is the following.

How do you see the remote file system plugins of the remote snapshotter (CernVM-FS, NFS, etc.) being deployed? Can this be a plugin container, bundling the file system client and whatever else is required?

Remote snapshotter sees filesystem plugins which are plugged into containerd. A current example implementation is a binary plugin and is loaded by containerd during runtime. Socket-based plugin which can be deployed as a container should be recognizable (but I haven't tried)

Can multiple remote file system plugins be active concurrently, and if so, can we decide which one gets priority for serving a layer (e.g. first CRFS than NFS)?

I think these kinds of operations need to be possible. Currently, the example implementation can recognize multiple file system plugins but the priority is undefined.

ktock added a commit to ktock/containerd that referenced this issue Dec 20, 2019
unpacking. But ChainID isn't enough for remote snapshotter (discussed in containerd#3731)
to search for layer contents from docker registry which API requires image refs
and layer digests as well.

This commit solves this issue by adding a handler to inject basic information of
images through annotations to backend snapshotters during unpacking.

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
ktock added a commit to ktock/containerd that referenced this issue Dec 20, 2019
…apshots

Currently, containerd gives ChainID to backend snapshotters during
unpacking. But ChainID isn't enough for remote snapshotter (discussed in containerd#3731)
to search for layer contents from docker registry which API requires image refs
and layer digests as well.

This commit solves this issue by adding a handler to inject basic information of
images through annotations to backend snapshotters during unpacking.

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
ktock added a commit to ktock/containerd that referenced this issue Dec 20, 2019
Currently, containerd gives ChainID to backend snapshotters during
unpacking. But ChainID isn't enough for remote snapshotter (discussed in containerd#3731)
to search for layer contents from docker registry which API requires image refs
and layer digests as well.

This commit solves this issue by adding a handler to inject basic information of
images through annotations to backend snapshotters during unpacking.

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
ktock added a commit to ktock/containerd that referenced this issue Dec 20, 2019
Currently, containerd gives ChainID to backend snapshotters during
unpacking, which enable snapshotters to search for snapshots to skip
downloading the contents. But ChainID isn't enough for remote
snapshotter (discussed in containerd#3731) to search for layer contents from
docker registry because docker registry API requires image refs and
layer digests as well.

This commit solves this issue by adding a handler to inject basic information of
images through annotations to backend snapshotters during unpacking.

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
ktock added a commit to ktock/containerd that referenced this issue Dec 21, 2019
Currently, containerd gives ChainID to backend snapshotters during
unpacking, which enable snapshotters to search for snapshots to skip
downloading the contents. But ChainID isn't enough for remote
snapshotter (discussed in containerd#3731) to search for layer contents from
docker registry because docker registry API requires image refs and
layer digests as well.

This commit solves this issue by adding a handler to inject basic information of
images through annotations to backend snapshotters during unpacking.

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
ktock added a commit to ktock/containerd that referenced this issue Dec 21, 2019
Currently, containerd gives ChainID to backend snapshotters during
unpacking, which enable snapshotters to search for snapshots to skip
downloading the contents. But ChainID isn't enough for remote
snapshotter (discussed in containerd#3731) to search for layer contents from
docker registry because docker registry API requires image refs and
layer digests as well.

This commit solves this issue by adding a handler to inject basic information of
images through annotations to backend snapshotters during unpacking.

Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
@ktock
Copy link
Contributor Author

@ktock ktock commented Dec 23, 2019

I posted a new patch #3911 about image information propagation to snapshotters, which is indispensable for remote snapshotter to search proper layers on the remote store(registry, etc.). Could you give comments on it?

@ktock
Copy link
Contributor Author

@ktock ktock commented Feb 1, 2020

New discussion thread is at containerd/project#43 . We are very welcome for your opinion/feedback!

@ktock
Copy link
Contributor Author

@ktock ktock commented Feb 19, 2020

I posted a new patch #4044 to enable using remote snapshotters without forcing snapshotter users to import snapshotter-specific handlers to the codebase.

The patch is indispensable for wide adoption of the remote snapshotter. Could you give comments on it?

@AkihiroSuda
Copy link
Member

@AkihiroSuda AkihiroSuda commented Mar 4, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.