Fault-Tolerant Systems: Error Handling, Links, and Supervisors in Elixir

When you write Elixir code you quickly discover that the language is built on a runtime called the BEAM. One of the defining strengths of the BEAM is its ability to keep a system running even when parts of it fail. In this article we’ll explore the foundations of fault‑tolerance in Elixir, from the low‑level runtime errors that can crash a process to the high‑level supervision trees that automatically bring those processes back to life.

Why Fault‑Tolerance Matters

Imagine a web‑service that aggregates live sensor data from dozens of IoT devices. The service is expected to be online 24/7, but occasional network glitches, hardware hiccups, or bugs in a data‑parsing routine are inevitable. If any one part of the system crashes and brings the whole application down, users will experience an outage.

Fault‑tolerance means designing the system so that the failure of a single component does not cascade into a full‑scale crash. Instead, the component is isolated, its failure is detected, and it is replaced (or its work is delegated) without human intervention. This “let‑it‑crash” philosophy is a cornerstone of the Erlang/Elixir ecosystem.

Runtime Errors: The Building Blocks of Failure

Before we can manage failures we need to understand how they are represented. In the BEAM there are three distinct categories of runtime errors:

:error – raised when the VM encounters an exception (e.g., a bad match, undefined function, arithmetic error).
:exit – a deliberate termination of a process, emitted by exit/1 or by the VM when a process finishes normally.
:throw – a non‑local return value used for flow‑control tricks (generally discouraged).

Each error carries a type (one of the three atoms above) and a value, which may be a struct, a plain term, or a tuple that includes a stack trace.

Illustrating the three error types

Below we use a small “order‑router” module to provoke each error type. The examples are deliberately simple so that the focus stays on the error mechanism itself.


defmodule OrderRouter do
  # 1️⃣ Raise a regular exception (type :error)
  def divide_price(total, count) do
    total / count   # division by zero will raise
  end

  # 2️⃣ Explicitly exit the current process (type :exit)
  def abort(reason) do
    exit(reason)
  end

  # 3️⃣ Throw a value that can be caught far up the call stack (type :throw)
  def abort_transaction() do
    throw({:abort, "transaction failed"})
  end
end

Running the functions produces:


iex> OrderRouter.divide_price(100, 0)
** (ArithmeticError) bad argument in arithmetic expression

iex> OrderRouter.abort(:database_unavailable)
** (exit) :database_unavailable

iex> OrderRouter.abort_transaction()
** (throw) {:abort, "transaction failed"}

Handling Errors with `try`

The try … catch … after special form lets you intercept any of the three error types. The syntax resembles classic try/catch blocks, but with a twist: the catch clause receives both the error type and the value, which you can pattern‑match on.


defmodule ErrorDemo do
  # Helper that runs a supplied function inside a try/catch block
  def safe_invoke(fun) do
    try do
      fun.()
    catch
      :error, %ArithmeticError{} = err ->
        {:error, "Math went wrong: #{err.message}"}
      :exit, reason ->
        {:exit, "Process exited with reason: #{inspect(reason)}"}
      :throw, {:abort, msg} ->
        {:abort, "Aborted: #{msg}"}
      type, value ->
        {:unknown, {type, value}}
    after
      IO.puts("Cleanup: closing DB connection, releasing locks …")
    end
  end
end

Testing the wrapper with the OrderRouter functions yields predictable, nicely formatted results:


iex> ErrorDemo.safe_invoke(fn -> OrderRouter.divide_price(10, 0) end)
Cleanup: closing DB connection, releasing locks …
{:error, "Math went wrong: bad argument in arithmetic expression"}

iex> ErrorDemo.safe_invoke(fn -> OrderRouter.abort(:out_of_memory) end)
Cleanup: closing DB connection, releasing locks …
{:exit, "Process exited with reason: :out_of_memory"}

iex> ErrorDemo.safe_invoke(fn -> OrderRouter.abort_transaction() end)
Cleanup: closing DB connection, releasing locks …
{:abort, "Aborted: transaction failed"}

Notice how the after block runs no matter which branch is taken, making it perfect for releasing resources.

Processes as Isolated Fault Domains

In traditional object‑oriented languages, a crash in one object often corrupts shared memory, potentially affecting unrelated parts of the system. In the BEAM, each Process runs in its own isolated heap. If a process raises an exception and terminates, the rest of the system carries on untouched.

Let’s see the isolation in action with a “sensor‑collector” scenario.


defmodule SensorCollector do
  # Simulated sensor that may raise an error
  def start(id) do
    spawn(fn ->
      Process.flag(:trap_exit, false) # default: do not trap exits
      loop(id)
    end)
  end

  defp loop(id) do
    # Randomly crash to simulate hardware failure
    if :rand.uniform() < 0.2 do
      raise "sensor #{id} died unexpectedly"
    else
      IO.puts("sensor #{id} reports #{:rand.uniform(100)}")
      Process.sleep(500)
      loop(id)
    end
  end
end

# Spawn three independent sensors
pid1 = SensorCollector.start(:a)
pid2 = SensorCollector.start(:b)
pid3 = SensorCollector.start(:c)

If any of the sensors crashes, the other two continue to emit readings. The console output demonstrates that a single failure does not bring the entire collector down.

Links: Propagating Crash Notifications

Process isolation is powerful, but often you actually *want* to know when a sibling crashes. That’s where links come in. A link creates a bidirectional monitoring relationship: when one linked process terminates abnormally, the other receives an exit signal.

By default, receiving an exit signal (with a reason other than :normal) will cause the receiving process to terminate as well. This “fail fast” behavior is useful when a group of processes jointly implement a feature – a failure in one part means the whole feature should be restarted.

Creating a link with `spawn_link/1`


defmodule LinkedWorker do
  def start do
    spawn_link(fn ->
      Process.sleep(200)
      raise "boom!"
    end)
  end
end

parent = self()
IO.puts("Parent PID: #{inspect(parent)}")
linked = LinkedWorker.start()
# The parent and linked process are now connected.

When the linked child raises, the parent receives an exit signal and terminates as well (unless it is trapping exits, which we’ll discuss next).

Trapping Exits: Turning Crashes into Messages

If you want to keep a process alive after a linked child crashes, you can trap exits. When a process sets the flag :trap_exit to true, exit signals are delivered as regular messages in the format {:EXIT, from_pid, reason} rather than causing the process to die.


defmodule SupervisorLite do
  def start do
    spawn(fn ->
      # Enable exit trapping
      Process.flag(:trap_exit, true)

      # Start a linked child that will fail
      child = spawn_link(fn ->
        Process.sleep(100)
        exit(:unexpected_failure)
      end)

      # Wait for the exit notification
      receive do
        {:EXIT, ^child, reason} ->
          IO.puts("Child #{inspect(child)} terminated: #{inspect(reason)}")
          # Continue doing work
          loop()
      end
    end)
  end

  defp loop do
    IO.puts("SupervisorLite is still alive")
    Process.sleep(500)
    loop()
  end
end

pid = SupervisorLite.start()

Now the supervising process stays alive, logs the failure, and can decide how to react – for example, by restarting the child.

Monitors: One‑Way Crash Notifications

Links are always two‑way: both parties know about each other’s fate. Sometimes you only need a unidirectional “watcher”. Monitors provide exactly that: a process can monitor another without being linked back.


defmodule MonitorDemo do
  def watch do
    pid = spawn(fn ->
      Process.sleep(150)
      exit(:boom)
    end)

    ref = Process.monitor(pid)

    receive do
      {:DOWN, ^ref, :process, ^pid, reason} ->
        IO.puts("Observed termination: #{inspect(reason)}")
    after
      500 ->
        IO.puts("No termination observed")
    end
  end
end

MonitorDemo.watch()

The monitoring process receives a {:DOWN, ref, :process, pid, reason} tuple when the target terminates. The monitored process itself is unaffected – it can keep running even if the monitor dies.

Supervisors: Automatic Recovery Engines

Links, trapping exits, and monitors give you the primitives to detect failures. To turn these lower‑level tools into something that can be reused across a whole application, Elixir provides the Supervisor behaviour. A supervisor is itself a process that:

starts child processes (workers) under its control,
links to those children (so it gets notified of their termination),
traps exits, turning crash notifications into messages,
decides—according to a restart strategy—whether to restart a child.

Let’s build a miniature supervision tree for a “payment gateway”. The system consists of two workers:

Gateway.Server – receives payment requests (a GenServer).
Gateway.Database – a simple process that persists transaction logs.

Defining the workers


defmodule Gateway.Server do
  use GenServer

  # Public API
  def start_link(_arg) do
    IO.puts("[Gateway.Server] starting")
    GenServer.start_link(__MODULE__, %{}, name: __MODULE__)
  end

  def charge(amount) do
    GenServer.call(__MODULE__, {:charge, amount})
  end

  # Callbacks
  @impl true
  def init(state) do
    {:ok, state}
  end

  @impl true
  def handle_call({:charge, amount}, _from, state) do
    # Simulate a random crash
    if :rand.uniform() < 0.1 do
      raise "simulated payment processor failure"
    else
      # Forward to DB worker
      :ok = Gateway.Database.log({:charge, amount})
      {:reply, {:ok, amount}, state}
    end
  end
end

defmodule Gateway.Database do
  use GenServer

  def start_link(_arg) do
    IO.puts("[Gateway.Database] starting")
    GenServer.start_link(__MODULE__, [], name: __MODULE__)
  end

  def log(entry) do
    GenServer.call(__MODULE__, {:log, entry})
  end

  @impl true
  def init(log) do
    {:ok, log}
  end

  @impl true
  def handle_call({:log, entry}, _from, log) do
    # Persist in memory, could be a real DB call
    {:reply, :ok, [entry | log]}
  end
end

Building the supervisor


defmodule Gateway.Supervisor do
  use Supervisor

  # Called by the application start‑up
  def start_link(_arg) do
    Supervisor.start_link(__MODULE__, :ok, name: __MODULE__)
  end

  @impl true
  def init(:ok) do
    children = [
      # Simple child spec map
      %{
        id: Gateway.Server,
        start: {Gateway.Server, :start_link, [:ignore]},
        restart: :transient,
        shutdown: 5000,
        type: :worker
      },
      %{
        id: Gateway.Database,
        start: {Gateway.Database, :start_link, [:ignore]},
        restart: :permanent,
        shutdown: 5000,
        type: :worker
      }
    ]

    # The chosen strategy here is :one_for_one – only the failing child is restarted.
    Supervisor.init(children, strategy: :one_for_one)
  end
end

Running the system is as easy as launching the supervisor:


iex> Gateway.Supervisor.start_link(:ignore)
[Gateway.Server] starting
[Gateway.Database] starting
{:ok, #PID<0.164.0>}

iex> Gateway.Server.charge(42)
{:ok, 42}

If the Gateway.Server crashes (e.g., due to the simulated failure), the supervisor receives the exit signal, spawns a fresh Gateway.Server process, and the system continues serving requests.

Child Specification – The Blueprint for Workers

A supervisor does not magically know how to start a child. It relies on a child specification, a map that tells the supervisor:

:id – an arbitrary term used to reference the child.
:start – a tuple {module, function, args} that the supervisor invokes to start the process. The function must link to the caller (i.e., GenServer.start_link/3).
:restart – the restart policy (:permanent, :transient, or :temporary).
:type – :worker or :supervisor (important for nested supervision trees).
:shutdown – timeout or :brutal_kill used when the supervisor itself is stopping.

Since Elixir 1.5 there’s a helper Supervisor.child_spec/2 that can automatically generate a compliant map from a module that implements the start_link/1 callback. This reduces boilerplate and helps keep specifications in sync with the implementation.


children = [
  {Gateway.Server, []},
  {Gateway.Database, []}
]

Supervisor.start_link(children, strategy: :one_for_one, name: MyApp.Supervisor)

Elixir expands each tuple into a child spec map behind the scenes, using default values for the omitted keys.

Restart Strategies – When and How to Bring Back a Child

The :strategy option given to Supervisor.init/2 controls the overall behaviour when a child terminates. The most common strategies are:

:one_for_one – only the crashing child is restarted. This is the default for many simple trees.
:one_for_all – if any child crashes, *all* children are terminated and then restarted. Useful when children depend on shared state.
:rest_for_one – children are started in the order given; if a child crashes, that child and all later children are restarted.
:simple_one_for_one – a dynamic supervisor that handles many homogeneous workers (e.g., a pool of connection handlers).

Choosing the right strategy is a design decision. In the “payment gateway” example, :one_for_one makes sense because the database worker is stateless in memory, and the server can be safely restarted without impacting the other.

Common Pitfalls and How to Avoid Them

Relying on PIDs after a restart – Because a restarted child receives a new PID, any cached PID becomes stale. Use registered names (via the name: option) or a lookup function to retrieve the current PID.
Infinite restart loops – A child that crashes instantly on start will cause the supervisor to endlessly restart it, eventually hitting the max restart intensity and terminating the whole tree. Mitigate this by fixing bugs, using proper back‑off strategies, or setting restart: :temporary for children that aren’t critical.
Blocking the supervisor – Do not perform long‑running work inside a supervisor’s init/1 callback. It should only start children and return quickly.
Misusing throw for flow control – While throw can be caught, it’s considered a “goto‑like” feature. Prefer returning tuples ({:ok, value} or {:error, reason}) or using raise for genuine exceptional situations.

Putting It All Together – A Mini‑Application

Below is a complete self‑contained example that assembles the concepts we’ve covered. The scenario simulates a “chat room” service where each room is a GenServer. A supervisor oversees the dynamic creation of rooms and automatically restarts any that crash.


defmodule ChatRoom do
  use GenServer

  # Public API -------------------------------------------------------
  def start_link(name) do
    GenServer.start_link(__MODULE__, %{name: name, members: []}, name: via_tuple(name))
  end

  def join(room_name, user) do
    GenServer.cast(via_tuple(room_name), {:join, user})
  end

  def send_message(room_name, user, text) do
    GenServer.cast(via_tuple(room_name), {:msg, user, text})
  end

  # Registry helper --------------------------------------------------
  defp via_tuple(name) do
    {:via, Registry, {ChatApp.Registry, name}}
  end

  # Callbacks --------------------------------------------------------
  @impl true
  def init(state), do: {:ok, state}

  @impl true
  def handle_cast({:join, user}, %{members: members}=state) do
    IO.puts("[#{state.name}] #{user} joined")
    {:noreply, %{state | members: [user | members]}}
  end

  @impl true
  def handle_cast({:msg, user, text}, state) do
    # Introduce a random crash to show supervision
    if :rand.uniform() < 0.05 do
      raise "room #{state.name} hit an unexpected error!"
    else
      broadcast(state, "[#{user}] #{text}")
      {:noreply, state}
    end
  end

  defp broadcast(%{members: members}=state, msg) do
    Enum.each(members, fn member ->
      IO.puts("→ #{member} receives: #{msg}")
    end)
  end
end

defmodule ChatApp.RoomSupervisor do
  use DynamicSupervisor

  def start_link(_arg) do
    DynamicSupervisor.start_link(__MODULE__, :ok, name: __MODULE__)
  end

  @impl true
  def init(:ok) do
    DynamicSupervisor.init(strategy: :one_for_one)
  end

  def start_room(name) do
    spec = %{id: name, start: {ChatRoom, :start_link, [name]}, restart: :transient}
    DynamicSupervisor.start_child(__MODULE__, spec)
  end
end

defmodule ChatApp.Application do
  use Application

  def start(_type, _args) do
    children = [
      {Registry, keys: :unique, name: ChatApp.Registry},
      {ChatApp.RoomSupervisor, []}
    ]

    opts = [strategy: :one_for_one, name: ChatApp.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Run the application:


iex> {:ok, _} = ChatApp.Application.start(:normal, [])
iex> ChatApp.RoomSupervisor.start_room("elixir")
{:ok, #PID<0.221.0>}

iex> ChatRoom.join("elixir", "alice")
[elixir] alice joined
:ok

iex> ChatRoom.send_message("elixir", "alice", "Hello world!")
→ alice receives: [alice] Hello world!
:ok

If the random crash is triggered, the dynamic supervisor detects the termination, starts a fresh ChatRoom process, and future messages are handled again. Because the room is registered via Registry, callers never need to store PIDs—they simply reference the room by name.

Summary

Fault‑tolerance in Elixir is built on lightweight, isolated processes.
Runtime errors come in three flavors (:error, :exit, :throw) and can be caught with try … catch … after.
Links propagate crash notifications; trapping exits converts those signals into ordinary messages.
Monitors provide a one‑way “watch” mechanism without the bidirectional coupling of links.
A Supervisor ties everything together: it starts children, links to them, traps exits, and restarts them according to a configurable strategy.
Child specifications describe how a supervisor should start, restart, and identify each worker.
Common mistakes include holding on to stale PIDs, creating endless restart loops, and using throw for regular control flow.

By mastering these primitives you gain the ability to design systems that keep running even when parts fail. The “let‑it‑crash” mantra may feel counter‑intuitive at first, but once you see a supervisor automatically resurrect a misbehaving component, the value of this approach becomes crystal clear. Happy fault‑tolerant coding!