Tracing Discord’s Elixir Systems (Without Melting Everything)
At Discord, we want the experience of chatting with your friends, reacting to a message, or posting artisanal farm-to-channel memes to feel instantaneous. We’ve managed to achieve this at scale by leveraging Elixir’s powerful concurrency mechanisms to run each Discord server (which we call a “guild” internally) fully independently from one another.
Sometimes though, things go wrong, and a guild can’t keep up with its user activity. When this happens, the guild will feel laggy or possibly experience a complete outage. If the system degrades beyond the point it can self-heal, an on-call engineer has to intervene. Afterwards, they turn to our observability tools to understand the cause and how to stop it from recurring.
Our on-call engineer’s investigation begins by looking at metrics and logs. We have a wide array of instrumentation, including measurements of how frequently we process each user action type and how long processing takes. These often provide useful hints about bursty activity, like a flurry of hype and reactions on that sweet new game that just got shadow-dropped, but even if we find an inciting event, it’s tricky to gauge what the experience was for users. Think of it like your car’s dashboard: it can tell you what the engine temperature is, but not the consequences of it running hot.
If that doesn’t yield results, the on-call engineer turns to our custom-built tool called “guild timings.” Every time a guild processes an action, it records how much of the current minute has been spent on each action type to an in-memory store. This data is much more detailed than our metrics, but it’s emitted at such a high volume that we can’t feasibly store it all. As such, this data is rotated frequently for all but our largest guilds. Even if we retrieve the data in time, it still won’t give us a good picture of the end-to-end experience, as it doesn’t capture downstream effects.
Other teams at Discord have derived enormous value from utilizing distributed tracing (a.k.a. Application Performance Monitoring), which allows them to see how long the constituent parts of an operation took. Adding tracing to our Elixir stack took a bit of work, though. Most tracing tools work by passing information about the operation via metadata layers like HTTP headers, but Elixir’s built-in communication tools don’t have an equivalent layer out of the box.
So… we had to build our own. Despite the fact that we were changing how our services communicate with one another, we managed to integrate it without downtime.
Setting the Stage: Elixir at Discord, and Tracing Elixir, and How it Powers Discord
Whenever you do something on Discord, your action turns into a “message” in our Elixir stack, which is then forwarded to connected clients. Elixir programs consist of lightweight processes (scheduled by the runtime, not the OS) that communicate via message passing, making concurrent programming a breeze. This model allows us to trivially distribute our programs across many nodes, too.
Processes are the building blocks of Discord’s architecture. Every guild runs as an Elixir process that uses message passing to fan out actions to all connected “sessions”; each session is itself an Elixir process that forwards actions to clients. Therefore, when we talk about a user action flowing through the Elixir stack, we’re actually talking about the act of messages being passed between processes.
To capture an end-to-end performance story, we needed a solution that allows us to follow a particular message’s path throughout the system, which is precisely what tracing provides us.
A Gentle Introduction to Tracing
When a user sends a Discord message, their client sends it over HTTP to our API service, which records it in the database and sends it to the Elixir stack via gRPC. As the API service processes the message, its tracing library times each step and records the measurements in a way that lets us visualize how each step affected overall execution.
Each region of code timed by a tracing library is called a span. Every time a span starts, the library links it to the currently active one (if any), building a tree of nested spans. This tree forms a trace, which we can use to see a timeline of events during execution.
Below is part of a trace from someone sending a message on Discord. As [code]discord_api[/code] processed the message, it recorded a span called [code]message_common.dispatch_message[/code] that took 1.69 ms. It then sent the message to [code]discord-guilds[/code], which spent 357 μs fanning it out to [code]discord-sessions[/code]. From there, each session recorded its own span while forwarding the message. All these spans linked together form our complete trace!
While the final structure can look complex, spans are quite simple to record in code. If you’ve ever written code like this, you’ve effectively created spans!
A tracing library will handle a lot of the details for you, including managing timers, tracking their results, and even uploading them to your observability provider. Here’s how we’d convert our snippet to use one:
[code]def dispatch_message():
start_span(“dispatch_message”)
# …
end_span()[/code]
Most tracing libraries track the current span and trace in some kind of global state. As new spans are created, the library checks this state and marks the current span as the new span’s parent. This global state is called context, and it is the critical link that allows us to build traces that span multiple services (no pun intended). When [code]discord_api[/code] interacts with [code]discord-guilds[/code], it passes an encoded version of the context in the gRPC headers, allowing [code]discord-guilds[/code] to link any spans it generates to [code]discord_api[/code]’s trace.
This technique is extremely powerful. The resulting traces not only show the exact duration of each step, but also how execution flows. For example, the trace shown above shows us that [code]discord-guilds[/code] broadcasts the user’s message asynchronously from [code]discord_api[/code]’s [code]dispatch_message[/code]. This example is pretty obvious from the code, but in other cases, traces can help identify unexpected bottlenecks, like code you expected to run in parallel running serially.
Getting Started with Tracing Elixir
We decided to use OpenTelemetry’s Erlang/Elixir library to instrument our stack. Adding spans to our Elixir services was relatively straightforward: wrap key operations in calls to [code]with_span[/code], configure the library to report to our observability provider, and we’re done! This gave us spans within a single service, which was extremely useful on its own, but didn’t yet give us a clear picture of downstream operations.
To gain these end-to-end insights, we had to propagate trace context between services. As mentioned earlier, the API service already passed this context to the guilds service in gRPC headers. Linking with the API service’s spans just required pointing OpenTelemetry at the headers. Easy peasy!
Things were less simple for calls between our Elixir services. Elixir processes can send arbitrary messages to each other, but that arbitrary message is all that gets sent. In other words, there’s nowhere to stick metadata, like our trace context, unless it’s a part of the message itself. Capturing all steps of an operation requires that the trace context be attached to every message, but encoding it manually would be unwieldy and error-prone. We needed to develop a new solution.
Prototyping potential solutions revealed a few requirements:
- Sending and receiving messages that include trace context must be ergonomic and require as little additional code as possible. If developers need to think about packing or unpacking the trace context, it won’t be adopted for future code.
- The solution must work equally well for plain messages and for [code]GenServer[/code] operations (a common abstraction over message passing). We use both at Discord.
- We must be able to integrate the solution into our systems as part of our normal deployment process. Downtime to change the communication mechanism isn’t acceptable, so we need to support toggling it at runtime during rollout.
With these requirements in hand, we built Transport: our internal message passing library. It’s an incredibly simple library, so I want to share as much of its design here as possible; buckle in, we’re gonna read some code!
Adding Metadata to our Messages with Transport
Transport introduces a new primitive to our Elixir stack: the [code]Envelope[/code]. Whenever a service sends a message with Transport, the message is wrapped in an [code]Envelope[/code], and any extra metadata is attached to it. This primitive, plus a few small helper functions, gives us exactly what we need to fulfill our three requirements.
- Transport exposes two functions, [code]wrap_message[/code] and [code]handle_message[/code], which will wrap or unwrap a message in an [code]Envelope[/code] and handle trace context propagation.
- Transport provides drop-in replacements for [code]GenServer[/code]’s [code]call[/code] and [code]cast[/code] functions (which provide request-response and fire-and-forget semantics, respectively) that similarly wrap messages before performing the call or cast.
- An application is free to mix calls to [code]GenServer[/code] functions with calls to Transport functions; [code]handle_message[/code] will handle messages sent by both. This gave us the freedom to add runtime configuration options to control which mechanism each service used, allowing for a gradual, zero-downtime rollout.
Integrating Transport into a service is pretty straightforward. Services add [code]handle_message[/code] into their GenServer’s [code]handle_call[/code]/[code]handle_cast[/code]/[code]handle_info[/code] callbacks to automatically unpack
[code]Envelope[/code]s and extract trace context before proceeding with message handling.
Here’s what it looks like in the guilds service:
[code]defmodule DiscordGuilds.Guild do alias Discord.Transport alias Discord.Transport.Envelope require OpenTelemetry.Tracer
use GenServer
# …
def handle_call(message, from, state) do Transport.handle_message(message, fn %Envelope{message: message} -
span_name = span_name(:call, message)
OpenTelemetry.Tracer.with_span span_name do do_handle_call(message, from, state)
end end)
end
# handle_cast/handle_info follow nearly identical implementations
# …
end[/code]
The [code]Envelope[/code] structure powering this API is shockingly uncomplicated. Here’s the entire definition, omitting a couple of bells and whistles we’ve added in for optimizations (oooooo spooky foreshadowing).
[code]defmodule Discord.Transport.Envelope do @enforce_keys [:message]
defstruct [
:message, trace_carrier: []
]
def wrap_message(message) do %__MODULE__{
message: message, # Captures the active context from global environment # (the “process dictionary” in Elixir terms)
trace_carrier: :otel_propagator_text_map.inject()
}
end end[/code]
When a caller passes a message to [code]wrap_message[/code], it’s wrapped in an [code]Envelope[/code], and a serialized version of the active trace context (if any) is added to the [code]trace_carrier[/code] field. This Envelope can be sent to other processes without any additional processing.
Processes that receive an [code]Envelope[/code] need to unpack its trace context so they can continue the trace, and then process the message. However, not all messages a process receives will be wrapped in an [code]Envelope[/code]; messages can still be sent by code that doesn’t use Transport!
To solve this, the [code]handle_message[/code] function accepts any message and normalizes it, wrapping plain messages in metadata-less [code]Envelopes[/code] or passing through messages already wrapped in one. Afterwards, it unpacks the trace context (if any), calls a user-provided handler with the [code]Envelope[/code], and then cleans up the context.
This normalization technique was critical for our rollout. Our Elixir services use plenty of message passing, and we couldn’t change all of them to use this new scheme overnight. Even if we could, we can’t apply new versions to all nodes instantaneously when we deploy services. Therefore, services needed to handle messages from both old and new versions, including ones that didn’t use Transport.
[code]defmodule Discord.Transport do alias Discord.Transport.Envelope
def handle_message(%Envelope{} = envelope, func) do # Elixir trick: we do this dance to make sure that all fields are present # on the received Envelope if we receive it from another version of # software (provided no required fields are added).
envelope = struct(Envelope, Map.from_struct(envelope))
# Extract the context ctx_token = :otel_propagator_text_map.extract(envelope.trace_carrier)
result = func.(envelope)
# Tear down the context when we’re done OpenTelemetry.Ctx.detach(ctx_token)
result end
def handle_message(other, func) do func.(%Envelope{message: other})
end
def call(destination, message, timeout \\ 5000) do message = Envelope.wrap_message(message)
GenServer.call(destination, message, timeout)
end
def cast(destination, message) do message = Envelope.wrap_message(message)
GenServer.cast(destination, message)
end end[/code]
A small aside: you might notice we don’t provide a drop-in replacement for Elixir’s [code]send[/code] function, despite having ones for [code]GenServer[/code]’s [code]call[/code] and [code]cast[/code]. Several of our services already have their own [code]send[/code] abstractions, so we opted to add calls to [code]Envelope.wrap_message[/code] there instead.
Capturing a span has a non-zero monetary cost and performance overhead. To limit this effect, teams that implement tracing often sample the operations they capture, typically by picking a flat percentage of operations (e.g. randomly selecting 1% of operations).
The simplest strategy, and the one we employ, is called head sampling. With this approach, we decide whether or not an operation will be traced when the first span (the “root” span) starts; if it’s sampled, all child spans will be sampled, and vice versa. This sampling decision is stored in each span’s entry in the trace context.
Head sampling works great for a lot of request/response-based systems, but Discord’s fanout model means we can quickly explode the number of spans. Let’s imagine we sample someone sending a message to a guild with a million online users. The API service would capture the message creation, then the guilds service would capture the fanout, and then all one million sessions would capture forwarding the message to the client. To keep the volume manageable, we had to find a way to limit the number of spans while still capturing useful performance data.
Our first approach was to apply an additional layer of random sampling in the sessions service. If a session wasn’t selected by this second layer, no spans would be captured for it, even if it was part of a sampled trace (specifically, we would set the “sampled” flag in the trace context to false). This didn’t lose us much data, since the act of forwarding a message to clients should be basically the same across all sessions in a given fanout operation. Unfortunately, this flat sampling rate was a little aggressive, and traces from smaller fanouts often lacked spans from sessions. To fix this, we decided to adjust the rate based on fanout size, which meant we’d have to somehow pass that information to sessions.
This is where our shiny new [code]Envelope[/code] primitive came in handy. We added a new optional field to the definition, called [code]approximate_num_recipients[/code], which the guild sets before sending the message to sessions. Upon receiving the message, the sessions each check a lookup table to determine the rate for the second layer of sampling. Messages with a single recipient keep their sampling status 100% of the time, messages with 100 recipients 10% of the time, and so on, until we reach our final rate of 0.1% for messages with 10,000+ recipients.
This approach was hugely successful! We were able to get a good picture of how a fanout performed, without submitting zillions of spans to our providers.
Rolling out tracing to our stack gave us an immediate observability win, but instrumenting certain components added meaningful overhead. Tracing does a good job of capturing application performance, but it can’t observe its own overhead; the tracer itself doesn’t capture spans about its own execution. Even if it did, such spans would generate even more overhead. We needed different tools to diagnose where the source of the problem was.
After deploying our first set of traces to the guilds service, some of our busiest guilds (we’re talking the ones with millions of members) were unable to keep up with user activity. When this happened, we captured stacktraces of the slow processes every 100 ms or so. This low-cost technique is a great way to get an idea of what a process is doing. In our case, a large share of the stacktraces showed the process unpacking trace context!
Every time we send a message with Transport that contains trace context, OpenTelemetry encodes the context as a string, which the recipient must parse to unpack. Our sampling rate was less than 1%, so more than 99% of unpacking operations just told us not to capture child spans. Could we skip unneeded work? What if we only sent the context when capturing a sampled trace?
Skipping propagation of unsampled trace contexts modified our head sampling semantics slightly, but didn’t lose traces we otherwise would have captured. It created a new problem, though: intermediate operations could no longer tell if they were part of an unsampled trace and might independently opt themselves into sampling, producing incomplete traces. This is a reasonable trade-off; we can still answer end-to-end questions by querying for traces containing spans from all services. After making this change, we didn’t see a recurrence of the guild performance problems, and we continued instrumenting more components.
This came back to bite us when we deployed instrumentation of fanout to the sessions service. Capturing spans while forwarding messages to clients increased CPU usage by 10 percentage points. What gives? In general, capturing a span is quite cheap. Unfortunately, nothing a computer does is free, and even the cheapest operations can be time-sucks at scale. Fanouts can get quite large, so there’s bound to be a decent number of sessions that make the independent decision to start sampled traces.
We spent a long time trying to make span capture cheaper, but eventually we asked ourselves the same question: can we just avoid doing this work? The fix was simple: if a session receives a post-fanout message, we forbid it from capturing root spans. In other words, the session can continue existing traces, but not make new ones. This single change won back nearly all of our performance overhead in the sessions service.
Our end-to-end picture wouldn’t be complete if we couldn’t link spans between the API service and the Elixir stack. As mentioned before, the API service already sent trace context as part of gRPC headers, so all we had to do was point OpenTelemetry at these headers. Unfortunately, doing so immediately doubled our CPU usage.
By sheer happenstance, one of the team’s engineers had just spent our annual internal Hack Week experimenting with OpenTelemetry’s (at the time, in progress) Erlang/Elixir profiler, which gives us aggregate information about how much CPU time is spent in each function (perhaps a topic for another post!). After enabling it on an affected node, we quickly saw that 75% of our time handling gRPC requests was spent unpacking context. Given our existing head sampling trade-off, could we build a sieve to avoid unpacking unnecessary trace contexts?
Trace contexts are encoded as a string of several hexadecimal sections, as shown above. Importantly, the last byte of this string is a set of flags on the trace, which we can read very quickly. If the least significant bit is 1, we know the trace is sampled, and we can perform the full, more expensive, unpacking of the trace context. If it’s zero, we simply won’t pass the context along to the Elixir process. This technique brought our CPU usage to nominal levels, and we were ready to call this project done!
Tracing has closed a large observability gap for us. We can now see what parts of the platform become slow, why they are slow, and what the downstream effects are. Integrating it wasn’t as easy as it might have been in a traditional HTTP microservice, but we found a way to do it that could survive Discord’s massive scale.
At the start of this post, we discussed how a guild can occasionally fail to keep up with its user activity. In a recent incident of this failure mode, our new tool helped us quantify the user impact. Traces from members of that guild showed us that it took almost sixteen minutes for their sessions to connect to the affected guild process. This meant that those users saw a message in their client saying that one of the guilds they’re a member of was offline due to an outage; they weren’t even able to click into the guild during that time.
This kind of session connect latency is extremely rare, thankfully, but our investment in tracing has been incredibly helpful for investigating these issues when they crop up. As we expand our instrumentation of the Elixir stack, we can continue to answer questions that we simply couldn’t before, and use the data to deliver better experiences for our users.
If you can imagine a place where talking “trace” is a perk of the work, check out our careers page from time to time! We’re always hiring!
( Press Release Image: https://photos.webwire.com/prmedia/7/351505/351505-1.jpg )
WebWireID351505
This news content was configured by WebWire editorial staff. Linking is permitted.
News Release Distribution and Press Release Distribution Services Provided by WebWire.
