Supervising Processes in Elixir: From Simple Workers to Dynamic Trees

When you build a concurrent system with Elixir, the real magic happens not only in the code that processes data, but also in the way you organize those processes. Supervisors, child specifications, the Registry, and dynamic supervision together give you a fault‑tolerant architecture where errors are contained, resources are cleaned up, and the whole system can be started or stopped reliably.

In this article we will walk through the essential concepts, illustrate each of them with fresh examples from a fictitious “smart‑home” domain, and explore the trade‑offs you may encounter when choosing restart strategies or temporary workers.

Why Supervision Matters

Imagine you are writing a platform that controls temperature sensors, light switches, and alarm notifications for a set of houses. Each house has its own HouseController process, which stores the latest state and talks to individual device processes. If one temperature sensor crashes, you don’t want the whole platform to go down – you only want that sensor to be restarted, leaving the other devices and the house controller untouched.

Supervisors make that possible. They form a supervision tree, a hierarchical network where each supervisor is responsible for the lifecycle of its direct children. When a child stops unexpectedly, the supervisor decides what to do: restart it, shut down the whole branch, or ignore the failure altogether.

Child Specifications and the `:type` Field

A child spec tells a supervisor how to start a child process. It is a map (or a struct) that contains at least three keys:

:id – a unique identifier used by the supervisor.
:start – a tuple {module, function, args} that the supervisor will invoke to start the child.
:type – tells the supervisor whether the child is a :worker (a generic process) or a :supervisor (a nested supervisor).

The :type field is optional; it defaults to :worker. When you are building a tree that contains supervisors inside supervisors, you must explicitly set :type: :supervisor so the OTP framework knows it should treat that child as a supervisor.

Example: A Supervisor That Starts a Sensor Pool


defmodule SmartHome.SensorPool do
  use Supervisor

  @pool_size 4

  def start_link() do
    Supervisor.start_link(__MODULE__, [], name: __MODULE__)
  end

  @impl true
  def init(_args) do
    children = for idx <- 1..@pool_size do
      %{
        id: {:sensor_worker, idx},
        start: {SmartHome.SensorWorker, :start_link, [idx]},
        type: :worker,
        restart: :transient
      }
    end

    Supervisor.init(children, strategy: :one_for_one)
  end
end

Here each SensorWorker is a plain worker, so we leave :type as the default :worker. If we wanted to embed another supervisor (for example, a MetricsSupervisor that aggregates sensor stats), we would add type: :supervisor to that child map.

Choosing a Restart Strategy

A supervisor’s strategy decides what to do when one of its children terminates. Elixir provides four built‑in strategies:

:one_for_one – Restart only the crashed child. This is the most common and isolates failures the best.
:one_for_all – If any child crashes, the supervisor terminates all of its children and then restarts them all. Use when children are tightly coupled (e.g., they share a socket).
:rest_for_one – When a child crashes, the supervisor terminates that child and all children that were started **after** it (its “younger siblings”). This works well when later processes depend on earlier ones.
:simple_one_for_one – A legacy strategy for dynamically added identical workers. It’s now superseded by DynamicSupervisor and is rarely used.

Let’s see a concrete scenario where :rest_for_one shines.

Example: A Streaming Service with a Decoder and a Renderer


defmodule Media.StreamSupervisor do
  use Supervisor

  def start_link() do
    Supervisor.start_link(__MODULE__, [], name: __MODULE__)
  end

  @impl true
  def init(_args) do
    children = [
      %{id: :decoder, start: {Media.Decoder, :start_link, []}, restart: :transient},
      %{id: :renderer, start: {Media.Renderer, :start_link, []}, restart: :transient}
    ]

    # The renderer needs a working decoder.  If the decoder dies,
    # we also want to stop the renderer and restart both.
    Supervisor.init(children, strategy: :rest_for_one)
  end
end

If the Decoder crashes, the supervisor will first terminate the Renderer (because it was started later) and then restart Decoder followed by Renderer. The Renderer never runs with a dead Decoder.

Temporary vs Transient vs Permanent Workers

Every child spec also contains a :restart option that determines when the supervisor should attempt a restart. The three values are:

:permanent (default) – The child is always restarted, regardless of why it exited.
:transient – The child is restarted only if it exits abnormally (i.e., the exit reason is not :normal, :shutdown or {:shutdown, term}).
:temporary – The child is never restarted, even if it crashes. It’s useful for “fire‑and‑forget” tasks such as handling an incoming TCP connection.

Choosing the right restart mode allows you to keep the supervision tree from thrashing.

Example: A One‑Shot Email Sender


defmodule Notifications.EmailWorker do
  use GenServer, restart: :temporary

  def start_link(email) do
    GenServer.start_link(__MODULE__, email, name: via(email))
  end

  defp via(email), do: {:via, Registry, {Notifications.Registry, {:email_worker, email}}}

  @impl true
  def init(email) do
    send(self(), {:send, email})
    {:ok, email}
  end

  @impl true
  def handle_info({:send, email}, _state) do
    # Imagine we call an external API that might raise.
    EmailAPI.send(email)
    {:stop, :normal, nil}
  end
end

The :temporary restart type means that if send/1 raises an exception, the process will simply die – we’ll retry the next time we need to send an email, rather than having a supervisor keep trying to restart a broken worker.

The Registry: Naming Without Pids

In a distributed system you rarely want to store PIDs manually. The Registry module gives you a key‑value store that automatically tracks the PID of the process you register. By using {module, identifier} tuples as keys you avoid collisions across unrelated parts of the system.

Creating a Registry for Smart‑Home Devices


defmodule SmartHome.DeviceRegistry do
  use Registry,
    keys: :unique,
    name: __MODULE__
end

Now any process can register itself like this:


defmodule SmartHome.LightSwitch do
  use GenServer

  def start_link(id) do
    GenServer.start_link(__MODULE__, %{}, name: via(id))
  end

  defp via(id), do: {:via, Registry, {SmartHome.DeviceRegistry, {:light, id}}}
end

Clients retrieve the PID via:


{:ok, pid} = Registry.lookup(SmartHome.DeviceRegistry, {:light, "kitchen"})

Because the lookup happens right before a request, you always get the most recent PID – even after a restart.

Dynamic Supervision: Starting Workers On‑Demand

When the number of children cannot be known ahead of time (think: a new house joins the system, or a user opens a new chat room), you need a supervisor that can add children dynamically. DynamicSupervisor provides exactly that.

Setting Up a Dynamic Supervisor for House Controllers


defmodule SmartHome.HouseSupervisor do
  use DynamicSupervisor

  def start_link() do
    DynamicSupervisor.start_link(__MODULE__, [], name: __MODULE__)
  end

  @impl true
  def init(_args) do
    DynamicSupervisor.init(strategy: :one_for_one)
  end

  # Public API --------------------------------------------------------------
  def start_house_controller(house_id) do
    child_spec = %{
      id: {:house_controller, house_id},
      start: {SmartHome.HouseController, :start_link, [house_id]},
      restart: :transient,
      type: :worker
    }

    DynamicSupervisor.start_child(__MODULE__, child_spec)
  end
end

Clients call SmartHome.HouseSupervisor.start_house_controller/1 whenever a new house joins the network. The supervisor creates a fresh HouseController process and registers it in the DeviceRegistry (or another registry you define).

Finding or Starting a House Controller

Often you want a function that returns the PID of a controller, creating it if it does not exist yet. The pattern looks like this:


defmodule SmartHome.ControllerCache do
  # Returns a PID for the given house, starting the controller if needed.
  def get_controller(house_id) do
    case Registry.lookup(SmartHome.DeviceRegistry, {:controller, house_id}) do
      [{pid, _}] -> pid
      [] ->
        case SmartHome.HouseSupervisor.start_house_controller(house_id) do
          {:ok, pid} -> pid
          {:error, {:already_started, pid}} -> pid
        end
    end
  end
end

The function first attempts a registry lookup. If no entry exists, it asks the dynamic supervisor to start the child. The call to DynamicSupervisor.start_child/2 is serialized inside the supervisor, meaning there is no race condition where two processes simultaneously try to start the same controller.

Graceful Shutdown with the `:shutdown` Option

When you stop a supervisor, it asks each child to shut down gracefully. A child can specify how long it is allowed to take using the :shutdown key in its child spec:

Integer (milliseconds) – the supervisor waits that amount of time before sending a :kill signal.
:brutal_kill – immediate termination.
:infinity – wait forever (useful for processes that must clean up resources like file handles).

Example:


%{
  id: :persistor,
  start: {SmartHome.Persistor, :start_link, []},
  shutdown: 10_000,   # Allow up to 10 seconds to flush data to disk.
  restart: :permanent,
  type: :worker
}

Putting It All Together: A Mini‑Architecture Diagram

Below is a textual representation of the supervision tree we have built:


SmartHome.SystemSupervisor
│
├─ SmartHome.DeviceRegistry   (Registry, not a child of a supervisor)
│
├─ SmartHome.SensorPool       (Supervisor, 4 SensorWorker children)
│
├─ SmartHome.HouseSupervisor  (DynamicSupervisor)
│   └─ (zero or more HouseController workers, started on demand)
│
└─ SmartHome.NotificationsSupervisor (Supervisor)
    └─ EmailWorker children (temporary)

Each branch isolates its own failure domain:

If a SensorWorker crashes, only that worker is restarted (:one_for_one).
If a HouseController crashes, the dynamic supervisor restarts just that controller; the rest of the system stays up.
If an EmailWorker raises, it is not restarted (temporary), preventing a tight‑loop of crashes.

Common Pitfalls and How to Avoid Them

Skipping the Registry lookup.Hard‑coding a PID after a first lookup will break as soon as the process restarts. Always resolve the PID right before you need it.
Choosing the wrong restart mode.Marking a worker as :permanent when it runs a one‑off task leads to endless restarts. Use :temporary for short‑lived jobs.
Over‑using :one_for_all.This strategy can cause a cascade of restarts when only one worker truly misbehaved. Prefer :one_for_one unless strong coupling exists.
Neglecting graceful shutdown.If a worker holds external resources (e.g., a TCP socket) and you don’t set :shutdown, the supervisor may kill it abruptly, leaving the resource in an undefined state.
Relying on spawn_link instead of OTP‑compliant processes.Plain spawn_link processes aren’t OTP‑compliant, so they miss out on built‑in crash reporting and supervision hooks. Prefer GenServer.start_link/3, Task.start_link/1, etc.

Best‑Practice Checklist

Every long‑running process should be started via a supervisor.
Use Registry for name‑based lookups; never store PIDs in static structures.
Pick the most restrictive restart strategy that still supports your system’s semantics.
Mark “fire‑and‑forget” jobs as :temporary to avoid unnecessary restarts.
Define a sensible :shutdown timeout for stateful workers that need to persist data.
Prefer DynamicSupervisor for collections of workers whose cardinality is unknown at compile time.

Summary

Supervision trees give you fine‑grained control over how failures ripple through a system. By combining:

Explicit :type definitions in child specs,
Appropriate restart strategies (:one_for_one, :one_for_all, :rest_for_one),
Restart modes (:permanent, :transient, :temporary),
The Registry for dynamic name‑to‑PID resolution,
And DynamicSupervisor for on‑demand worker creation,

you can construct an Elixir application where errors are isolated, resources are cleaned up, and the entire system can be gracefully started or stopped with a single call.

Remember: let processes crash when they encounter unexpected conditions, and let the supervisor hierarchy take care of recovery. When you need to handle a known error, do it explicitly in the process’s callbacks. With these patterns in place, your concurrent applications will become both resilient and easier to reason about.