Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime v2 (shim API) #2434

Merged
merged 5 commits into from Jul 18, 2018
Merged

Runtime v2 (shim API) #2434

merged 5 commits into from Jul 18, 2018

Conversation

@crosbymichael
Copy link
Member

crosbymichael commented Jul 3, 2018

This implements the shimv2 API proposed in #2426. The tests are using the new io.containerd.oci.v1 runtime under the new TaskManger implementing the v2 APIs.

Runtime v2

Runtime v2 introduces a first class shim API for runtime authors to integrate with containerd.
The shim API is minimal and scoped to the execution lifecycle of a container.

Binary Naming

Users specify the runtime they wish to use when creating a container.
The runtime can also be changed via a container update.

> ctr run --runtime io.containerd.runc.v1

When a user specifies a runtime name, io.containerd.runc.v1, they will specify the name and version of the runtime.
This will be trasnlated by containerd into a binary name for the shim.

io.containerd.runc.v1 -> containerd-shim-runc-v1

containerd keeps the containerd-shim-* prefix so that users can ps aux | grep containerd-shim to see running shims on their system.

Shim Authoring

This section is dedicated to runtime authors wishing to build a shim.
It will detail how the API works and different considerations when building shim.

Commands

Container information is provided to a shim in two ways.
The OCI Runtime Bundle and on the Create rpc request.

start

Each shim MUST implement a start subcommand.
This command will launch new shims.
The start command MUST accept the following flags:

  • -namespace the namespace for the container
  • -id the id of the container
  • -address the address of the containerd's main socket
  • -publish-binary the binary path to publish events back to containerd

The start command, as well as all binary calls to the shim, has the bundle for the container set as the cwd.

The start command MUST return an address to a shim for containerd to issue API requests for container operations.

The start command can either start a new shim or return an address to an existing shim based on the shim's logic.

delete

Each shim MUST implement a delete subcommand.
This command allows containerd to delete any container resources created, mounted, and/or run by a shim when containerd can no longer communicate over rpc.
This happens if a shim is SIGKILL'd with a running container.
These resources will need to be cleaned up when containerd looses the connection to a shim.
This is also used when containerd boots and reconnects to shims.
If a bundle is still on disk but containerd cannot connect to a shim, the delete command is invoked.

The delete command MUST accept the following flags:

  • -namespace the namespace for the container
  • -id the id of the container
  • -address the address of the containerd's main socket
  • -publish-binary the binary path to publish events back to containerd

The delete command will be executed in the container's bundle as its cwd.

Host Level Shim Configuration

containerd does not provide any host level configuration for shims via the API.
If a shim needs configuration from the user with host level information across all instances, a shim specific configuration file can be setup.

Container Level Shim Configuration

On the create request, there is a generic *protobuf.Any that allows a user to specify container level configuration for the shim.

message CreateTaskRequest {
	string id = 1;
	...
	google.protobuf.Any options = 10;
}

A shim author can create their own protobuf message for configuration and clients can import and provide this information is needed.

I/O

I/O for a container is provided by the client to the shim via fifo on Linux, named pipes on Windows, or log files on disk.
The paths to these files are provided on the Create rpc for the initial creation and on the Exec rpc for additional processes.

message CreateTaskRequest {
	string id = 1;
	bool terminal = 4;
	string stdin = 5;
	string stdout = 6;
	string stderr = 7;
}
message ExecProcessRequest {
	string id = 1;
	string exec_id = 2;
	bool terminal = 3;
	string stdin = 4;
	string stdout = 5;
	string stderr = 6;
}

Containers that are to be launched with an interactive terminal will have the terminal field set to true, data is still copied over the files(fifos,pipes) in the same way as non interactive containers.

Root Filesystems

The root filesytems for the containers is provided by on the Create rpc.
Shims are responsible for managing the lifecycle of the filesystem mount during the lifecycle of a container.

message CreateTaskRequest {
	string id = 1;
	string bundle = 2;
	repeated containerd.types.Mount rootfs = 3;
	...
}

The mount protobuf message is:

message Mount {
	// Type defines the nature of the mount.
	string type = 1;
	// Source specifies the name of the mount. Depending on mount type, this
	// may be a volume name or a host path, or even ignored.
	string source = 2;
	// Target path in container
	string target = 3;
	// Options specifies zero or more fstab style mount options.
	repeated string options = 4;
}

Shims are responsible for mounting the filesystem into the rootfs/ directory of the bundle.
Shims are also responsible for unmounting of the filesystem.
During a delete binary call, the shim MUST ensure that filesystem is also unmounted.
Filesystems are provided by the containerd snapshotters.

Other

Unsupported rpcs

If a shim does not or cannot implement an rpc call, it MUST return a github.com/containerd/containerd/errdefs.ErrNotImplemented error.

ttrpc

ttrpc is the only currently supported protocol for shims.
It works with standard protobufs and GRPC services as well as generating clients.
The only difference between grpc and ttrpc is the wire protocol.
ttrpc removes the http stack in order to save memory and binary size to keep shims small.
It is recommended to use ttrpc in your shim but grpc support is also in development.

Closes #2426

client.go Outdated
@@ -79,8 +79,12 @@ func New(address string, opts ...ClientOpt) (*Client, error) {
return nil, err
}
}
rt := "io.containerd.oci.v1"
if runtime.GOOS != "linxu" {

This comment has been minimized.

Copy link
@stevvooe

stevvooe Jul 3, 2018

Member

linux

Copy link
Contributor

mlaventure left a comment

Looks good so far, don't have the time to finish goign through it all atm, but I'll try to get another pass at it.

client.go Outdated
@@ -79,8 +79,12 @@ func New(address string, opts ...ClientOpt) (*Client, error) {
return nil, err
}
}
rt := "io.containerd.oci.v1"
if runtime.GOOS != "linxu" {

This comment has been minimized.

Copy link
@mlaventure

mlaventure Jul 3, 2018

Contributor

linux

func main() {
if err := shim.Run(oci.New); err != nil {
logrus.WithError(err).Error("shim run")
fmt.Fprintf(os.Stderr, "containerd-shim-process-v1: %s\n", err)

This comment has been minimized.

Copy link
@mlaventure

mlaventure Jul 3, 2018

Contributor

containerd-shim-oci-v1

@@ -871,70 +875,6 @@ func TestContainerKillAll(t *testing.T) {
}
}

func TestShimSigkilled(t *testing.T) {

This comment has been minimized.

Copy link
@mlaventure

mlaventure Jul 3, 2018

Contributor

Not handled anymore?

This comment has been minimized.

Copy link
@crosbymichael

crosbymichael Jul 3, 2018

Author Member

I looked at the test and it really wasn't testing much, killing the shim then killing the process. I don't think it was a valid test or it was and changed sometime in the past. Felt safe to remove.

This comment has been minimized.

Copy link
@mlaventure

mlaventure Jul 3, 2018

Contributor

It checks that the container process (redis) is correctly cleaned up if the shim crash/is sigkilled (e.g. by oom).
It doesn't actually kill the process, it checks that it died.

This comment has been minimized.

Copy link
@crosbymichael

crosbymichael Jul 3, 2018

Author Member

ok, my bad, i'll fix it. i missed the kill 0

This comment has been minimized.

Copy link
@estesp

estesp Jul 12, 2018

Contributor

Seems this test is still removed in the latest commit? or did I miss a move somewhere else?

This comment has been minimized.

Copy link
@crosbymichael

crosbymichael Jul 16, 2018

Author Member

ok, so I researched this more and the test is only valid for v1 when container was still running and didn't have to reconnect to a shim again. The reason is that it sets an exit handler that does the "cleanup on dead shim" code. However, this only runs if the shim process that containerd is doing a waidpid on returns and checks the shim pid.

I think its safe to remove this as it doesn't really work, it only works in the tests because the daemon isn't restarted. If we want to keep it, i can do some refactoring but it will take a little bit of work to make this work across daemon reboots with running shims. I can do that in a followup PR.

This comment has been minimized.

Copy link
@mlaventure

mlaventure Jul 16, 2018

Contributor

Makes sense.

I haven't yet got the time to compile and play around with the new API to see how it actually behave.

Having a way to start a "cleanup shim" in such scenario may be a cleaner solution for v2 though (and if that fails, things are left to the user to clean).

if wg != nil {
defer wg.Done()
}

stats, err := cg.Stat(cgroups.IgnoreNotExist)
ctx := namespaces.WithNamespace(context.Background(), t.Namespace())

This comment has been minimized.

Copy link
@mlaventure

mlaventure Jul 3, 2018

Contributor

No timeout / cancel / Sync ? We could end up with several go routine getting the stats on the same task, no?

This comment has been minimized.

Copy link
@crosbymichael

crosbymichael Jul 3, 2018

Author Member

they are sync'd in the caller, hard to have a timeout for this because under load, collection could be slower

This comment has been minimized.

Copy link
@mlaventure

mlaventure Jul 3, 2018

Contributor

oh, right. I missed the waitgroup, my bad.

IoGID int
WorkDir string

id string

This comment has been minimized.

Copy link
@mlaventure

mlaventure Jul 3, 2018

Contributor

nit, but could we bundle exported variables together?

}
}()
// create base directories
for _, d := range paths {

This comment has been minimized.

Copy link
@mlaventure

mlaventure Jul 3, 2018

Contributor

why the separation between the base dir and the last dir in the path?

This comment has been minimized.

Copy link
@crosbymichael

crosbymichael Jul 3, 2018

Author Member

base should always be created while mkdir should return an exists error if something is already there

This comment has been minimized.

Copy link
@dmcgowan

dmcgowan Jul 6, 2018

Member

What is the reason behind doing them in 2 loops if the remove all happens on error anyway? Also what does it mean if Mkdir returns an exist on the second part, seems odd that it would both return an exists error and delete the directory.


outw, err := fifo.OpenFifo(ctx, stdout, syscall.O_WRONLY, 0)
if err != nil {
return nil, err

This comment has been minimized.

Copy link
@mlaventure

mlaventure Jul 3, 2018

Contributor

cleanup stdin if created?

}
outr, err := fifo.OpenFifo(ctx, stdout, syscall.O_RDONLY, 0)
if err != nil {
return nil, err

This comment has been minimized.

Copy link
@mlaventure

mlaventure Jul 3, 2018

Contributor

missing cleanup of in/outw

@jterry75
Copy link
Contributor

jterry75 commented Jul 4, 2018

@crosbymichael - This is a really good start. I can rebase my windows side changes on this and resubmit

@ehazlett ehazlett force-pushed the crosbymichael:shimv2 branch from 272ec84 to e52f795 Jul 5, 2018
@crosbymichael crosbymichael force-pushed the crosbymichael:shimv2 branch 5 times, most recently from e3c9050 to b238640 Jul 5, 2018
@codecov-io
Copy link

codecov-io commented Jul 5, 2018

Codecov Report

Merging #2434 into master will decrease coverage by 0.03%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2434      +/-   ##
==========================================
- Coverage   44.76%   44.73%   -0.04%     
==========================================
  Files          92       93       +1     
  Lines        9483     9490       +7     
==========================================
  Hits         4245     4245              
- Misses       4555     4562       +7     
  Partials      683      683
Flag Coverage Δ
#linux 48.96% <0%> (-0.02%) ⬇️
#windows 41.03% <ø> (ø) ⬆️
Impacted Files Coverage Δ
sys/reaper.go 0% <ø> (ø) ⬆️
sys/reaper_linux.go 0% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 02579c8...d53a96f. Read the comment docs.

@crosbymichael crosbymichael force-pushed the crosbymichael:shimv2 branch from b238640 to d8f0e8e Jul 5, 2018
if werr == nil {
err2 = os.RemoveAll(work)
if err2 == nil {
return err

This comment has been minimized.

Copy link
@dmcgowan

dmcgowan Jul 6, 2018

Member

wrapping this could be useful, I don't think the message includes the path

}

func (b *binary) Delete(ctx context.Context) (*runtime.Exit, error) {
cmd, err := shimCommand(ctx, b.runtime, b.containerdAddress, b.bundle, "-debug", "delete")

This comment has been minimized.

Copy link
@dmcgowan

dmcgowan Jul 6, 2018

Member

Why is this command not given the bundle id anywhere?

This comment has been minimized.

Copy link
@crosbymichael

crosbymichael Jul 6, 2018

Author Member

its not needed and comes on the Create request

This comment has been minimized.

Copy link
@dmcgowan

dmcgowan Jul 6, 2018

Member

What does this exec with the delete arg intended to do?

args = append(args, cmdArgs...)
name := getShimBinaryName(runtime)
if !filepath.IsAbs(name) {
_, err = exec.LookPath(name)

This comment has been minimized.

Copy link
@dmcgowan

dmcgowan Jul 6, 2018

Member

Currently name will never be absolute, but is it worth considering a case where name is in the same directory as self but not in the lookup path? In that case the name might be better found at os.Join(os.Dir(self), name). Unless there is a plan here to document setting PATH on containerd for the runtime binary installation location.

This comment has been minimized.

Copy link
@crosbymichael

crosbymichael Jul 6, 2018

Author Member

i think for now, we say that it must be in PATH. i'll remove this abs code

// newShim starts and returns a new shim
func newShim(ctx context.Context, bundle *Bundle, runtime, containerdAddress string, events *exchange.Exchange, rt *runtime.TaskList) (_ *shim, err error) {
if err := identifiers.Validate(bundle.ID); err != nil {
return nil, errors.Wrapf(err, "invalid task id")

This comment has been minimized.

Copy link
@ehazlett

ehazlett Jul 6, 2018

Member

nit: either Wrap or add the ID?

}()
select {
case <-time.After(1 * time.Second):
terminate(s.shimPid)

This comment has been minimized.

Copy link
@ehazlett

ehazlett Jul 6, 2018

Member

Should we return or log this error?


func (s *shim) Exec(ctx context.Context, id string, opts runtime.ExecOpts) (runtime.Process, error) {
if err := identifiers.Validate(id); err != nil {
return nil, errors.Wrapf(err, "invalid exec id")

This comment has been minimized.

Copy link
@ehazlett

ehazlett Jul 6, 2018

Member

include the ID?

}
return runtime.State{}, errdefs.ErrNotFound
}
var status runtime.Status

This comment has been minimized.

Copy link
@ehazlett

ehazlett Jul 6, 2018

Member

Do we need to check the StatusUnknown task type or have a runtime.UnknownStatus?

@crosbymichael crosbymichael force-pushed the crosbymichael:shimv2 branch 2 times, most recently from 263acdb to 7976ef8 Jul 9, 2018
@crosbymichael crosbymichael changed the title [wip] Runtime v2 (shim API) Runtime v2 (shim API) Jul 9, 2018
@crosbymichael crosbymichael force-pushed the crosbymichael:shimv2 branch 2 times, most recently from aeb0941 to 578358e Jul 9, 2018
@@ -547,32 +585,26 @@ func getTasksMetrics(ctx context.Context, filter filters.Filter, tasks []runtime
case "id":
return t.ID(), true
case "namespace":
return t.Info().Namespace, true
// return t.Info().Namespace, true

This comment has been minimized.

Copy link
@dmcgowan

dmcgowan Jul 9, 2018

Member

The formatting here is off

This comment has been minimized.

Copy link
@dmcgowan

dmcgowan Jul 9, 2018

Member

Should we add a default case and explanation for not using now?

@crosbymichael crosbymichael force-pushed the crosbymichael:shimv2 branch 4 times, most recently from 8deea2c to 9a1e7b9 Jul 11, 2018
@containerd containerd deleted a comment from crosbymichael Jul 11, 2018
@containerd containerd deleted a comment from crosbymichael Jul 11, 2018
@crosbymichael crosbymichael force-pushed the crosbymichael:shimv2 branch from 9a1e7b9 to cf332e0 Jul 12, 2018
if err != nil {
return nil, errors.Wrapf(err, "%s", out)
}
address := string(out)

This comment has been minimized.

Copy link
@dmcgowan

dmcgowan Jul 13, 2018

Member

Should this be trimmed? Also wondering if we should just take the stdout rather than combined output.

You previously expressed that this extra connect could cause some additional time, although mostly negligible. Could we still pass in the socket and if the returned address doesn't match just remove it and create a new connection, or would that end up being even longer?

@crosbymichael crosbymichael force-pushed the crosbymichael:shimv2 branch 3 times, most recently from bb10512 to 20cf7ef Jul 16, 2018
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
@crosbymichael crosbymichael force-pushed the crosbymichael:shimv2 branch from 20cf7ef to 7e49c60 Jul 17, 2018
Copy link
Member

dmcgowan left a comment

LGTM

option go_package = "github.com/containerd/containerd/runtime/v2/runc/options;options";

message Options {
bool no_pivot_root = 1;

This comment has been minimized.

Copy link
@stevvooe

stevvooe Jul 17, 2018

Member

I know this isn't popular, but let's document the intent of these fields.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
// each container and allows reattaching to the IO and receiving the exit status
// for the container processes.
service Task {
rpc State(StateRequest) returns (StateResponse);

This comment has been minimized.

Copy link
@stevvooe

stevvooe Jul 17, 2018

Member

It would be good to document what each method does.

@stevvooe
Copy link
Member

stevvooe commented Jul 17, 2018

I don't see anything wildly controversial here. It would be good to document the methods and fields used in GRPC.

@crosbymichael
Copy link
Member Author

crosbymichael commented Jul 17, 2018

@stevvooe I was going to do documentation in a separate PR. I'll take care of proto comments there as well. I wanted to make a document for shim authors, "Building a shim 101"

@Random-Liu
Copy link
Member

Random-Liu commented Jul 18, 2018

@crosbymichael Will take a look at this PR tomorrow. Sorry for the delay.

This readme specifies shim api and authoring docs.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
@crosbymichael
Copy link
Member Author

crosbymichael commented Jul 18, 2018

Update the description and added a readme for shim authors.

Runtime v2

Runtime v2 introduces a first class shim API for runtime authors to integrate with containerd.
The shim API is minimal and scoped to the execution lifecycle of a container.

Binary Naming

Users specify the runtime they wish to use when creating a container.
The runtime can also be changed via a container update.

> ctr run --runtime io.containerd.runc.v1

When a user specifies a runtime name, io.containerd.runc.v1, they will specify the name and version of the runtime.
This will be trasnlated by containerd into a binary name for the shim.

io.containerd.runc.v1 -> containerd-shim-runc-v1

containerd keeps the containerd-shim-* prefix so that users can ps aux | grep containerd-shim to see running shims on their system.

Shim Authoring

This section is dedicated to runtime authors wishing to build a shim.
It will detail how the API works and different considerations when building shim.

Commands

Container information is provided to a shim in two ways.
The OCI Runtime Bundle and on the Create rpc request.

start

Each shim MUST implement a start subcommand.
This command will launch new shims.
The start command MUST accept the following flags:

  • -namespace the namespace for the container
  • -id the id of the container
  • -address the address of the containerd's main socket
  • -publish-binary the binary path to publish events back to containerd

The start command, as well as all binary calls to the shim, has the bundle for the container set as the cwd.

The start command MUST return an address to a shim for containerd to issue API requests for container operations.

The start command can either start a new shim or return an address to an existing shim based on the shim's logic.

delete

Each shim MUST implement a delete subcommand.
This command allows containerd to delete any container resources created, mounted, and/or run by a shim when containerd can no longer communicate over rpc.
This happens if a shim is SIGKILL'd with a running container.
These resources will need to be cleaned up when containerd looses the connection to a shim.
This is also used when containerd boots and reconnects to shims.
If a bundle is still on disk but containerd cannot connect to a shim, the delete command is invoked.

The delete command MUST accept the following flags:

  • -namespace the namespace for the container
  • -id the id of the container
  • -address the address of the containerd's main socket
  • -publish-binary the binary path to publish events back to containerd

The delete command will be executed in the container's bundle as its cwd.

Host Level Shim Configuration

containerd does not provide any host level configuration for shims via the API.
If a shim needs configuration from the user with host level information across all instances, a shim specific configuration file can be setup.

Container Level Shim Configuration

On the create request, there is a generic *protobuf.Any that allows a user to specify container level configuration for the shim.

message CreateTaskRequest {
	string id = 1;
	...
	google.protobuf.Any options = 10;
}

A shim author can create their own protobuf message for configuration and clients can import and provide this information is needed.

I/O

I/O for a container is provided by the client to the shim via fifo on Linux, named pipes on Windows, or log files on disk.
The paths to these files are provided on the Create rpc for the initial creation and on the Exec rpc for additional processes.

message CreateTaskRequest {
	string id = 1;
	bool terminal = 4;
	string stdin = 5;
	string stdout = 6;
	string stderr = 7;
}
message ExecProcessRequest {
	string id = 1;
	string exec_id = 2;
	bool terminal = 3;
	string stdin = 4;
	string stdout = 5;
	string stderr = 6;
}

Containers that are to be launched with an interactive terminal will have the terminal field set to true, data is still copied over the files(fifos,pipes) in the same way as non interactive containers.

Root Filesystems

The root filesytems for the containers is provided by on the Create rpc.
Shims are responsible for managing the lifecycle of the filesystem mount during the lifecycle of a container.

message CreateTaskRequest {
	string id = 1;
	string bundle = 2;
	repeated containerd.types.Mount rootfs = 3;
	...
}

The mount protobuf message is:

message Mount {
	// Type defines the nature of the mount.
	string type = 1;
	// Source specifies the name of the mount. Depending on mount type, this
	// may be a volume name or a host path, or even ignored.
	string source = 2;
	// Target path in container
	string target = 3;
	// Options specifies zero or more fstab style mount options.
	repeated string options = 4;
}

Shims are responsible for mounting the filesystem into the rootfs/ directory of the bundle.
Shims are also responsible for unmounting of the filesystem.
During a delete binary call, the shim MUST ensure that filesystem is also unmounted.
Filesystems are provided by the containerd snapshotters.

Other

Unsupported rpcs

If a shim does not or cannot implement an rpc call, it MUST return a github.com/containerd/containerd/errdefs.ErrNotImplemented error.

ttrpc

ttrpc is the only currently supported protocol for shims.
It works with standard protobufs and GRPC services as well as generating clients.
The only difference between grpc and ttrpc is the wire protocol.
ttrpc removes the http stack in order to save memory and binary size to keep shims small.
It is recommended to use ttrpc in your shim but grpc support is also in development.

@Random-Liu
Copy link
Member

Random-Liu commented Jul 18, 2018

The api LGTM overall. One thing is that can we include required events? e.g. TaskExit (required), OOMEvents (optional/required?) etc.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
@crosbymichael crosbymichael force-pushed the crosbymichael:shimv2 branch from f15b41a to d53a96f Jul 18, 2018
@crosbymichael
Copy link
Member Author

crosbymichael commented Jul 18, 2018

@Random-Liu ok, updated the README with events.

@dmcgowan dmcgowan merged commit 94cfce6 into containerd:master Jul 18, 2018
3 checks passed
3 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
deploy/netlify Deploy preview ready!
Details
@thaJeztah
Copy link
Contributor

thaJeztah commented Jul 18, 2018

it's merged! 🎉 🙌👏👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

10 participants
You can’t perform that action at this time.