Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: add tracing support #3057

Open
fuweid opened this issue Mar 1, 2019 · 17 comments
Open

proposal: add tracing support #3057

fuweid opened this issue Mar 1, 2019 · 17 comments
Assignees
Milestone

Comments

@fuweid
Copy link
Member

@fuweid fuweid commented Mar 1, 2019

Proposal

There are many component systems here, such as content, snapshotter, container and tasks. And the containerd package provides smart client with powerful functionality, something like Backend for frontend. For instance, the Client.Pull is simple function call, but it already combines several calls to different components in there.

If we can add the tracing in containerd smart client side client and daemon, it will be easy to figure out the whole request call chain. When we run into some issues, the tracing data can help us to locate the problem quickly.

Document

In order to make it easy to discuss, I make a google doc for this proposal.

https://docs.google.com/document/d/12l0dEa9W8OSZdUYUNZJZ7LengDlJvO5J2ZLategEdl4/edit?usp=sharing

And welcome to comment for this.

@cpuguy83
Copy link
Contributor

@cpuguy83 cpuguy83 commented Mar 1, 2019

I agree tracing would be helpful.
I am concerned about integrating with jaeger on the client. How data is exported should be left up to the consumer of the client.

It would be tremendously helpful to also propagate tracing over GRPC to the daemon.

@fuweid
Copy link
Member Author

@fuweid fuweid commented Mar 1, 2019

Hi @cpuguy83, thanks for comment.

I am concerned about integrating with jaeger on the client. How data is exported should be left up to the consumer of the client.

Yes, absolutely. I agree. For client, I think it is always up to users.
But for the ctr command, in order to make it easier to use, we can create a option to use jaeger tracer. How do you think?

It would be tremendously helpful to also propagate tracing over GRPC to the daemon.

I was thinking that the containerd components don't call each other. It seems that the client is good enough. However, it is good to add that into daemon.

@cpuguy83
Copy link
Contributor

@cpuguy83 cpuguy83 commented Mar 1, 2019

How do you think?
💯

I was thinking that the containerd components don't call each other.

Where this becomes useful is so a client can receive tracing data propagated from the daemon.

@jterry75
Copy link
Contributor

@jterry75 jterry75 commented Mar 1, 2019

@kevpar - FYI.

We have been looking into this for distributed tracing as well because the runtime v2 shim calls happen out of proc and it would be amazing to flow activity id's across the gRPC boundaries for logs correlation.

@Random-Liu
Copy link
Member

@Random-Liu Random-Liu commented Mar 2, 2019

/cc @dashpole

Who is working on tracing in Kubernetes, and also experimented tracing down to containerd.

@dashpole
Copy link

@dashpole dashpole commented Mar 2, 2019

I had an intern last fall who did a PoC of something similar. His containerd code changes are here: master...Monkeyanator:opencensus-tracing.

We ended up using OpenCensus instead of OpenTracing just because then we didn't need pick a tracing backend in open-source components, such as containerd. Then you can run the "OC Agent" to handle pushing traces to a tracing backend, and can ship the components independently.

Since containerd uses a stripped-down version of gRPC for it's shims, we explicitly added the trace context to the API, rather than try and add context propagation back in. But if we actually added it, we should do that properly. Other than that, adding tracing to containerd was pretty simple.

If you are curious how I think this should work in kubernetes, feel free to check out my proposal: kubernetes/enhancements#650.

pod_creation_trace

We used tracing to help us debug a slow start runc issue our autoscaling team sent us, so I definitely think this has value. Here is the trace from the reproduction in just containerd:
screenshot from 2018-11-20 16-47-50

@cpuguy83
Copy link
Contributor

@cpuguy83 cpuguy83 commented Mar 2, 2019

I've spent some time specifically thinking about how to propagate traces over ttrpc.
The GRPC way (through the metadata mechanism) is not great since it necessarily requires allocating, often unnecessary, objects on every request.

@fuweid
Copy link
Member Author

@fuweid fuweid commented Mar 2, 2019

Thanks for @dashpole and @cpuguy83 input.

I will update the proposal which includes the shim part and involved project. :)

@crosbymichael
Copy link
Member

@crosbymichael crosbymichael commented Mar 4, 2019

Sounds good to me

@jterry75
Copy link
Contributor

@jterry75 jterry75 commented Mar 5, 2019

It seems the industry is moving toward opencensus. @kevpar - Did some investigations on our side and thought it was the likely candidate for us to move to as well. The one thing that we dislike about it is that it seems to only export begin/end at end. Which means that if you have a panic trace you loose all data because there was no entry logging. opentracing however did show the entry trace down to the point of panic.

@kevpar
Copy link
Contributor

@kevpar kevpar commented Mar 5, 2019

opencensus does have the momentum. However, the logging support in opentracing is better, and the API is much more flexible. opencensus is working on a logging API, but they seem to be leaning towards unstructured logging, which is unfortunate.

@cpuguy83
Copy link
Contributor

@cpuguy83 cpuguy83 commented Mar 5, 2019

FYI, I created a rough tracing package here: https://github.com/virtual-kubelet/virtual-kubelet/blob/master/trace/trace.go

In VK we use opencensus and logrus... have interfaces for trace and logging... but these are primarily for the purposes of supporting logging through a span and not so much about providing multiple backing implementations of the tracing... but here is how it works:

ctx, span := trace.StartSpan(ctx)
log.G(ctx).WithField("foo", "bar").Debug("whatever")

In this case, the trace.StartSpan has injected it's own logger which is wrapping the logger already configured in ctx.
This will output a structured message both to the span and logrus (which is what we configure as the logger).

@cpuguy83
Copy link
Contributor

@cpuguy83 cpuguy83 commented Mar 5, 2019

@cpuguy83
Copy link
Contributor

@cpuguy83 cpuguy83 commented Mar 5, 2019

All this to say, I don't think we should be worried about what opencensus does wrt logging or not... use what's most convenient/has the most community support.

@cpuguy83
Copy link
Contributor

@cpuguy83 cpuguy83 commented Mar 7, 2019

+1 doc looks good.

OC is very flexible and very nice that you can write exporters or even just use an agent exporter and not have to pull in a ton of dependencies in.

@jterry75
Copy link
Contributor

@jterry75 jterry75 commented Mar 8, 2019

+1 for the doc. Thanks @fuweid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants
You can’t perform that action at this time.