A Shallow Dive Into Distributed Tracing
We at Kinvolk were excited to begin working recently with the amazing team over at LightStep on distributed tracing. I must admit that, while I was aware of the OpenTracing project and knew it was probably kind of important, I did not know a whole lot about the topic when we first started chatting about it at KubeCon in Barcelona.
Since then, through our engagement with LightStep, I’ve learned a little bit about the topic and I’d love to share some of those learnings.
Important note: I am a complete novice learning about the topic for the first time. I do not claim to be an expert in this field — hence why I’m calling this a “shallow dive”. I look forward to learning from all the smart people who I hope will point out what I’ve got wrong or oversimplified!
Observability
The first thing I learned diving into this space is that there is a broad concept of “observability”, which is traditionally viewed as a combination of three things (which are often conflated or misunderstood):
- Logs
- Metrics
- Tracing.
Logs are pretty well understood - applications have been emitting logs for many years, with varying levels of structure, and tools like fluentd provide means of processing them in large volume.
Metrics - quantified measures of application performance over time - are super useful because they provide easily understandable and immediately actionable data. Projects such as Prometheus have emerged to track metrics at scale.
Tracing is perhaps the most powerful element of observability, and what we are going to dive into with the rest of this blog post.
I should mention there’s an open question about whether you should capture absolutely every data point coming from your system (see: LightStep’s Satellite Architecture), or whether it’s better to sample (e.g. capture one of every 100 metrics). That seems to be an ideological debate that I am not sufficiently expert to weigh in on, but it would appear there are good people on both sides. In any case, the popular tracing libraries all support optional sampling so you can decide what is best for your application.
Distributed Tracing
With the advent of microservices, a single request can lead to dozens or hundreds of separate API calls over network interfaces. While each of the microservices involved may be writing its own logs, making sense of the end-to-end chain that represents the processing related to a single request requires correlating and sequencing event logs (and potentially related metrics) across all the microservices involved.
How it Works
One of the barriers to understanding distributed tracing (like any specialist field) is that it has its own terminology. A good place to start is the fundamental concept of the span.
A span is a named, timed operation representing a chunk of the workflow. Spans can reference other spans in a hierarchical manner (i.e. parent/child). A root span has no parent, and might be (for example) a web request for a particular page. Child spans might be specific pieces of work that are performed to render that page. These references generally represent causal information, i.e. this span was triggered by work performed in this other span. OpenTelemetry goes even further, allowing more complex link relationships between spans (e.g. multiple parents).
A group of spans referencing each other together make up a trace.
This example might help to visualize these relationships:
In this example (from the OpenCensus documentation), a request is made to the /messages
URL, which first triggers a user authorization step, a cache query and then (because there is a cache miss), a database lookup and populating the cache with the results. The auth
, cache.Get
, mysql.Query
and cache.Put
spans are all child spans of the /messages
span, and all the spans together comprise a single trace.
In a distributed system, a span typically encompasses more than one microservice. To enable this to work, a span context object is passed along with the regular in-process or RPC function calls, including all the information the tracing system needs to associate events with the current span.
One thing that might be obvious, but which I found quite cool, is that a span can be server or client side, enabling a distributed tracing system to present a coherent view from the front-end code running in a browser to the back-end server fulfilling the client request.
Another thing I really liked is that spans have tags (each a key/value pair), which allow for selector operations just like you would be used to with Kubernetes objects.
Manual vs Automated Instrumentation
So we’ve established that traces comprise spans, which are created in a context which is shared around a distributed system. But how does the tracing system know when a span is created, and how is the context shared across function calls (local in-process, or remote across the network)?
The basic approach is for the programmer to add a few simple API calls to their code. For example, this would start a root span:
func xyz() {
...
sp := opentracing.StartSpan("operation_name")
defer sp.Finish()
...
}
And this would create a child span of an existing span:
func xyz(parentSpan opentracing.Span, ...) {
...
sp := opentracing.StartSpan(
"operation_name",
opentracing.ChildOf(parentSpan.Context()))
defer sp.Finish()
...
}
If the span context needs to be serialized on the wire, this can be done like this:
func makeSomeRequest(ctx context.Context) ... {
if span := opentracing.SpanFromContext(ctx); span != nil {
httpClient := &http.Client{}
httpReq, _ := http.NewRequest("GET", "http://myservice/", nil)
// Transmit the span's TraceContext as HTTP headers on our
// outbound request.
opentracing.GlobalTracer().Inject(
span.Context(),
opentracing.HTTPHeaders,
opentracing.HTTPHeadersCarrier(httpReq.Header))
resp, err := httpClient.Do(httpReq)
...
}
...
}
And so on.
As you can see, adding tracing in this way is straightforward, but it is not automatic.
The “holy grail” of tracing is that is should be possible to add it to any program without requiring any work on the part of the programmer. In practice, how achievable that goal is depends on the language and libraries that are used. The OpenTracing Registry is a good way to see if a specific library already has helpers for instrumentation. Some commercial solutions offer this kind of capability, and OpenTracing has an implementation of automated tracing for Java with its Special Agent.
OpenTracing? OpenCensus? No, OpenTelemetry!
This brings us onto the topic of implementations of distributed tracing.
Modern distributed tracing can trace (pun intended, sorry) its roots back to a Google white paper on its internal system, known as Dapper (co-created by Ben Sigelman, who went on to found LightStep). This introduced terms such as span and trace context for the first time, and inspired the open source project OpenTracing which defined an API that could be implemented by multiple plug-in “tracers”.
OpenTracing was adopted by the Cloud Native Computing Foundation, the home of Kubernetes and many other related projects, and became widely adopted by projects such as Jaeger (also in the CNCF but originally by Uber), and commercial solutions such as DataDog and LightStep.
In a parallel effort, Google evolved its internal distributed tracing and metrics solution with a project known as Census, which it open sourced early in 2018 as OpenCensus. Unlike OpenTracing which defined an API that could have multiple independent implementations, OpenCensus defined both the API and implementation. It also included support for metrics as well as tracing, so had more functionality.
There were clearly pros and cons to each approach - OpenTracing enabled a vibrant ecosystem, whereas OpenCensus had a rich solution, proven in Google, that worked out of the box. OpenTracing and OpenCensus cannot be used together on the same system, leading to potential fragmentation of the tracing community.
Fortunately, the community recognized this issue and the teams got together to agree to focus their efforts on a new project. OpenTelemetry would combine the best aspects of OpenTracing and OpenCensus in one definitive standard, backed by all the major players in the industry.
We at Kinvolk are proud to be part of this important initiative, and grateful to LightStep for sponsoring and supporting our work in this area.
The Road Ahead
The immediate focus for our team working on OpenTelemetry is to help the community create a first release that meets the production needs of users _at least as well _as both OpenTracing and OpenCensus. The community has defined the ambitious goal of achieving this milestone in September 2019, with the OpenTracing and OpenCensus projects being retired by November.
In short, the next few months are crucial for uniting the developer and user communities behind a single vision for the future of distributed tracing and metrics.
Our involvement, working with LightStep and led by our CTO and co-founder Alban Crequy, is to develop various language implementations, starting with Go and Python but eventually expanding to every major language, and contributing to the special interest groups (SIGs) who are still defining the APIs.
We are also starting to look at how we can add auto-instrumentation into OpenTelemetry, to make it as simple to adopt as possible, and enable the ultimate vision of zero touch, complete observability for all cloud native applications.
This is exciting, technically challenging work that significantly advances the state of the art in open source — exactly the kind of project we at Kinvolk like to get involved in!