Running and Observing a Live Elixir System

Running and Observing a Live Elixir System - narrated by Hulde

0:00

/324.6

Once your application has left the comfort of the REPL and is running on a server, the focus shifts from writing code to understanding the behavior of that code while it is alive. In a highly concurrent, distributed environment the usual “print‑and‑see” approach quickly becomes insufficient. This article walks you through the essential techniques for inspecting, debugging and profiling a production‑grade Elixir system without resorting to fragile step‑by‑step debuggers.

Why Observability Matters

Fault tolerance is a guarantee, not a hope. Even the most carefully designed OTP trees can crash under unexpected load or data.
Resource consumption grows unpredictably. Memory leaks, excessive reductions, or CPU‑bound loops may only surface after hours of traffic.
Distributed nodes complicate the picture. A problem that appears on one node may be caused by a message sent from another.

Being able to answer “what went wrong?” quickly is the difference between a brief outage and a prolonged service degradation.

1. Debugging in a Concurrent World

Classic line‑by‑line debuggers assume a single thread of execution. When hundreds of processes run in parallel, stopping one of them freezes only a tiny slice of the system, leaving the rest blind to the pause. Instead of halting execution, we rely on instrumentation that records state without changing program flow.

1.1 `IO.inspect/2` – A Quick‑and‑Dirty Probe

The simplest way to peek at a value is to wrap the expression in IO.inspect/2. Because it returns the original value, you can insert it anywhere without altering the surrounding pipeline.


defmodule Warehouse.Inventory do
  # Computes the total number of items across all categories.
  # Adding an inspect will reveal the intermediate map.
  def total_counts(items) do
    items
    |> Enum.reduce(%{}, fn {category, qty}, acc ->
      Map.update(acc, category, qty, &(&1 + qty))
    end)
    |> IO.inspect(label: "Inventory after aggregation")
  end
end

When total_counts/1 runs, you’ll see a nicely labeled output in the console, helping you verify that the aggregation behaves as expected.

1.2 `IEx.pry/0` – Live Interaction from the Shell

For a more interactive experience, IEx.pry/0 temporarily hands control of the current process back to the IEx shell. While the process is paused, you can examine bindings, evaluate expressions, and even change variable values.


defmodule ChatServer do
  def handle_message(%{user: user, text: text} = msg) do
    # Pause execution whenever a message contains the word "debug"
    if String.contains?(text, "debug"), do: IEx.pry()
    broadcast(user, text)
    {:ok, msg}
  end
end

When a client sends a “debug” message, the server process stops, and you can type v() in the IEx session to list all variables, or any arbitrary Elixir code to explore the current state.

1.3 Automated Tests – Your First Line of Defense

Unit and integration tests surface many bugs before they ever reach production. A well‑structured test suite reduces the need for ad‑hoc debugging in the field.

2. Structured Logging with `Logger`

In production you should replace informal IO.inspect calls with the robust Logger framework. Logger supports multiple back‑ends, log levels, metadata and runtime configuration changes.

2.1 Setting Up a Logger Backend for JSON Logs

Suppose you run a financial transaction service and need to ship logs to an external log aggregation system. You can add the logger_json_backend library (or write your own) and configure it in config/runtime.exs:


import Config

config :logger,
  level: :info,
  backends: [{LoggerJSONBackend, :json}]

config :logger, :json,
  metadata: [:request_id, :module],
  format: "$time $metadata $message\\n"

Then, throughout your code, emit structured events:


defmodule Payments.Gateway do
  require Logger

  def charge(user_id, amount) do
    Logger.info("Attempting charge",
      request_id: UUID.uuid4(),
      user_id: user_id,
      amount: amount
    )

    # …charge logic…

    Logger.info("Charge successful",
      request_id: UUID.uuid4(),
      user_id: user_id,
      amount: amount
    )
  end
end

The resulting JSON lines can be ingested by tools like Elastic, Splunk, or Grafana Loki, making post‑mortem analysis far more efficient.

2.2 Custom Backend Example – Sending Logs to a Remote HTTP Endpoint

Below is a minimal custom backend that forwards each log entry to a remote webhook. This is useful for alerting systems that expect HTTP payloads.


defmodule RemoteLogBackend do
  @behaviour :gen_event

  def init(_args) do
    {:ok, %{url: "https://log-collector.example.com/ingest"}}
  end

  def handle_event({level, _gl, {Logger, msg, ts, meta}}, state) do
    body = %{
      level: level,
      timestamp: ts,
      message: to_string(msg),
      metadata: meta
    }
    :httpc.request(:post, {state.url, [], 'application/json', Jason.encode!(body)}, [], [])
    {:ok, state}
  end

  def handle_event(_, state), do: {:ok, state}
  def handle_call(_, state), do: {:ok, :ok, state}
  def code_change(_, state, _), do: {:ok, state}
  def terminate(_, _), do: :ok
end


config :logger, backends: [:console, RemoteLogBackend]

3. Interacting with a Running Node

One of the most powerful features of the BEAM VM is the ability to attach to a live node and execute code against it. Two common patterns are:

Remote shells (iex --name … --cookie …) for ad‑hoc inspection.
Running :observer or custom GUI tools that visualize system state.

3.1 Attaching a Remote Shell

Start your production service (e.g., a catalog app) as a release:


_build/prod/rel/catalog/bin/catalog start

Open a hidden IEx node that shares the same cookie:


iex --hidden --name monitor@127.0.0.1 --cookie secret_cookie

From this shell you can query process information:


iex(monitor@127.0.0.1)1> :erlang.system_info(:process_count)
iex(monitor@127.0.0.1)2> Process.list() |> Enum.take(5)
iex(monitor@127.0.0.1)3> Process.info(self(), :dictionary)

These calls give you an instant snapshot of the VM’s health without stopping the service.

3.2 Using `:observer` Across Nodes

The built‑in :observer GUI provides a visual overview of process counts, memory usage, and ETS tables. To monitor a remote node, the target must include the :runtime_tools application. Add it to mix.exs:


defmodule Catalog.MixProject do
  def application do
    [
      extra_applications: [:logger, :runtime_tools]
    ]
  end
end

After recompiling the release, start the observer on the monitor node:


iex --hidden --name observer@127.0.0.1 --cookie secret_cookie -S mix
iex(observer@127.0.0.1)1> :observer.start()

In the observer window, select Nodes → catalog@127.0.0.1. You’ll now see live charts of memory, CPU, and process mailboxes, all without installing any extra tooling on the production host.

3.3 Web‑Based Observability – `Wobserver`

When a graphical environment isn’t available (e.g., on a headless VM), a web‑based observer like Wobserver can be added to your release. It runs a small Phoenix endpoint and mirrors the functionality of :observer over HTTP.


# In mix.exs
defp deps do
  [
    {:wobserver, "~> 0.2"},
    # …other deps…
  ]
end

# In your application start
def start(_type, _args) do
  children = [...]
  opts = [strategy: :one_for_one, name: MyApp.Supervisor]
  Supervisor.start_link(children, opts)
  :wobserver.start()
end

Now you can point a browser to http://host:4000/wobserver and explore the same metrics you’d get from the desktop observer.

4. Tracing Execution Flow

Tracing gives you visibility into the exact sequence of function calls, message passes, and state changes. BEAM provides two major tracing facilities:

:sys.trace/2 – Low‑overhead per‑process tracing.
:dbg/:erlang.trace/3 – System‑wide tracing with pattern matching.

4.1 Tracing a Single GenServer with `:sys.trace/2`

Imagine a Cache.Server GenServer that stores user profiles. To watch every call it receives, you can enable tracing from a remote shell:


iex(monitor@127.0.0.1)1> pid = Cache.Server.whereis(:user_cache)
iex(monitor@127.0.0.1)2> :sys.trace(pid, true)

Now each incoming handle_call/3 and outgoing reply prints to the shell:


*DBG* {cache_server, :user_cache} got call {:fetch, 42} from #PID<0.123.0>
*DBG* {cache_server, :user_cache} sent {:ok, %User{id: 42, name: "Alice"}} to #PID<0.123.0>

When you’re done, turn tracing off with :sys.trace(pid, false). Remember that tracing adds I/O overhead; use it sparingly on a production node.

4.2 System‑Wide Tracing with `:dbg`

For more comprehensive analysis—e.g., tracking all calls to the Catalog.Product module across many processes—use the :dbg library. First, start a dedicated tracer node:


iex --name tracer@127.0.0.1 --cookie secret_cookie --hidden

Then configure the tracer:


iex(tracer@127.0.0.1)1> :dbg.tracer()
iex(tracer@127.0.0.1)2> :dbg.n(:'catalog@127.0.0.1')
iex(tracer@127.0.0.1)3> :dbg.p(:all, [:call])
iex(tracer@127.0.0.1)4> :dbg.tp(Catalog.Product, [])

Explanation of the commands:

:dbg.tracer() – Starts the tracing engine on the current node.
:dbg.n/1 – Connects the tracer to the target node (the running service).
:dbg.p(:all, [:call]) – Requests tracing of all :call events.
:dbg.tp(Module, []) – Sets a trace pattern for every function in Catalog.Product.

When a client requests a product price, the tracer node prints lines like:


(1234.567.0) call 'Elixir.Catalog.Product':price(%Product{id: 99, name: "Laptop", price: 1299})
(1234.567.0) return from 'Elixir.Catalog.Product':price/1 -> 1299

When you have collected enough data, stop tracing with:


iex(tracer@127.0.0.1)5> :dbg.stop_clear()

4.3 The `Recon` Library – A Handy Toolbox

Recon aggregates many common tracing, inspection and statistics functions into a single dependency. For example, Recon.Trace.calls/2 can retrieve a summary of the most frequently invoked functions on a node.


defmodule Diagnostics do
  @moduledoc false

  def top_calls(node) do
    :rpc.call(node, Recon.Trace, :calls, [100, :all])
    |> Enum.take(10)
    |> Enum.map(fn {{mod, fun, _arity}, count} ->
      "#{inspect(mod)}.#{fun} → #{count} calls"
    end)
  end
end

Running Diagnostics.top_calls(:"catalog@127.0.0.1") from a remote shell gives you a quick view of hotspots without manual tracing setup.

5. Benchmarking and Profiling

Understanding performance involves measuring execution time and identifying hot paths. Elixir ships with a few simple tools, while the Erlang ecosystem supplies dedicated profilers.

5.1 Quick Timing with `:timer.tc/1`

Wrap a function call with :timer.tc/1 to receive the elapsed microseconds along with the function’s return value.


{time_us, result} = :timer.tc(fn ->
  Catalog.search("smartphone")
end)

IO.puts("Search took #{time_us / 1_000} ms")

5.2 Structured Benchmarks with `Benchee`

For more systematic micro‑benchmarks, add Benchee to your mix.exs and define a suite:


defmodule Benchmarks.SearchBench do
  use Benchee.Script

  def run do
    Benchee.run(%{
      "search_by_name" => fn -> Catalog.search("camera") end,
      "search_by_category" => fn -> Catalog.search_by(:electronics) end
    })
  end
end

Running mix run bench/search_bench.exs prints a nicely formatted table with average runtimes, standard deviations and memory usage.

5.3 Profiling with `cprof`, `eprof`, `fprof`

The Erlang VM ships three built‑in profilers, each with a different focus:

cprof – Counts the number of function calls.
eprof – Measures execution time per function.
fprof – Provides a detailed call‑graph with time and memory breakdown.

Invoke them through Mix tasks:


mix profile.cprof --module Catalog.Product
mix profile.eprof --module Catalog.Product
mix profile.fprof --module Catalog.Product

Each task generates a profile.log file that can be inspected with erl -man cprof or visualized using fprof's UI.

6. Common Pitfalls to Avoid

Leaving IO.inspect in production. It writes to STDOUT, bypasses log rotation, and can reveal sensitive data.
Running tracing continuously. Traces generate a lot of output and can degrade throughput. Enable them only for the duration of an investigation.
Neglecting OTP supervision trees. Directly killing a process without a supervisor can lead to orphaned workers.
Hard‑coding node names and cookies. Store them in environment variables or configuration files; otherwise deployments become fragile.
Not including :runtime_tools in releases. Without it you won’t be able to start :observer or remote tracing on the production node.

7. Summary

Debugging concurrent Elixir systems relies on instrumentation (IO.inspect, IEx.pry) and exhaustive test suites, not step‑by‑step breakpoints.
Replace ad‑hoc prints with structured Logger calls, possibly routing logs to JSON back‑ends or remote HTTP collectors.
Remote shells and :observer (or web‑based alternatives like Wobserver) give you live access to VM metrics, process info, and memory statistics.
Tracing (:sys.trace/2, :dbg) lets you watch function calls and message flow, but must be used judiciously to avoid performance impact.
Benchmarking with :timer.tc or Benchee and profiling with cprof/eprof/fprof are essential for spotting bottlenecks.
Avoid common mistakes such as leaving debug prints, over‑tracing, and forgetting required OTP applications in releases.

Armed with these tools and practices, you’ll be able to keep a finger on the pulse of any Elixir system, diagnose failures swiftly, and maintain performance even as your application scales across many nodes.

Why Observability Matters

1. Debugging in a Concurrent World

1.1 IO.inspect/2 – A Quick‑and‑Dirty Probe

1.2 IEx.pry/0 – Live Interaction from the Shell

1.3 Automated Tests – Your First Line of Defense

2. Structured Logging with Logger

2.1 Setting Up a Logger Backend for JSON Logs

2.2 Custom Backend Example – Sending Logs to a Remote HTTP Endpoint

3. Interacting with a Running Node

3.1 Attaching a Remote Shell

3.2 Using :observer Across Nodes

3.3 Web‑Based Observability – Wobserver

4. Tracing Execution Flow

4.1 Tracing a Single GenServer with :sys.trace/2

4.2 System‑Wide Tracing with :dbg

4.3 The Recon Library – A Handy Toolbox