When you write Elixir code you quickly discover that the language is built on a runtime called the BEAM. One of the defining strengths of the BEAM is its ability to keep a system running even when parts of it fail. In this article we’ll explore the foundations of fault‑tolerance in Elixir, from the low‑level runtime errors that can crash a process to the high‑level supervision trees that automatically bring those processes back to life.
Why Fault‑Tolerance Matters
Imagine a web‑service that aggregates live sensor data from dozens of IoT devices. The service is expected to be online 24/7, but occasional network glitches, hardware hiccups, or bugs in a data‑parsing routine are inevitable. If any one part of the system crashes and brings the whole application down, users will experience an outage.
Fault‑tolerance means designing the system so that the failure of a single component does not cascade into a full‑scale crash. Instead, the component is isolated, its failure is detected, and it is replaced (or its work is delegated) without human intervention. This “let‑it‑crash” philosophy is a cornerstone of the Erlang/Elixir ecosystem.
Runtime Errors: The Building Blocks of Failure
Before we can manage failures we need to understand how they are represented. In the BEAM there are three distinct categories of runtime errors:
:error– raised when the VM encounters an exception (e.g., a bad match, undefined function, arithmetic error).:exit– a deliberate termination of a process, emitted byexit/1or by the VM when a process finishes normally.:throw– a non‑local return value used for flow‑control tricks (generally discouraged).
Each error carries a type (one of the three atoms above) and a value, which may be a struct, a plain term, or a tuple that includes a stack trace.
Illustrating the three error types
Below we use a small “order‑router” module to provoke each error type. The examples are deliberately simple so that the focus stays on the error mechanism itself.
defmodule OrderRouter do
# 1️⃣ Raise a regular exception (type :error)
def divide_price(total, count) do
total / count # division by zero will raise
end
# 2️⃣ Explicitly exit the current process (type :exit)
def abort(reason) do
exit(reason)
end
# 3️⃣ Throw a value that can be caught far up the call stack (type :throw)
def abort_transaction() do
throw({:abort, "transaction failed"})
end
end
Running the functions produces:
iex> OrderRouter.divide_price(100, 0)
** (ArithmeticError) bad argument in arithmetic expression
iex> OrderRouter.abort(:database_unavailable)
** (exit) :database_unavailable
iex> OrderRouter.abort_transaction()
** (throw) {:abort, "transaction failed"}
Handling Errors with try
The try … catch … after special form lets you intercept any of the three error types. The syntax resembles classic try/catch blocks, but with a twist: the catch clause receives both the error type and the value, which you can pattern‑match on.
defmodule ErrorDemo do
# Helper that runs a supplied function inside a try/catch block
def safe_invoke(fun) do
try do
fun.()
catch
:error, %ArithmeticError{} = err ->
{:error, "Math went wrong: #{err.message}"}
:exit, reason ->
{:exit, "Process exited with reason: #{inspect(reason)}"}
:throw, {:abort, msg} ->
{:abort, "Aborted: #{msg}"}
type, value ->
{:unknown, {type, value}}
after
IO.puts("Cleanup: closing DB connection, releasing locks …")
end
end
end
Testing the wrapper with the OrderRouter functions yields predictable, nicely formatted results:
iex> ErrorDemo.safe_invoke(fn -> OrderRouter.divide_price(10, 0) end)
Cleanup: closing DB connection, releasing locks …
{:error, "Math went wrong: bad argument in arithmetic expression"}
iex> ErrorDemo.safe_invoke(fn -> OrderRouter.abort(:out_of_memory) end)
Cleanup: closing DB connection, releasing locks …
{:exit, "Process exited with reason: :out_of_memory"}
iex> ErrorDemo.safe_invoke(fn -> OrderRouter.abort_transaction() end)
Cleanup: closing DB connection, releasing locks …
{:abort, "Aborted: transaction failed"}
Notice how the after block runs no matter which branch is taken, making it perfect for releasing resources.
Processes as Isolated Fault Domains
In traditional object‑oriented languages, a crash in one object often corrupts shared memory, potentially affecting unrelated parts of the system. In the BEAM, each Process runs in its own isolated heap. If a process raises an exception and terminates, the rest of the system carries on untouched.
Let’s see the isolation in action with a “sensor‑collector” scenario.
defmodule SensorCollector do
# Simulated sensor that may raise an error
def start(id) do
spawn(fn ->
Process.flag(:trap_exit, false) # default: do not trap exits
loop(id)
end)
end
defp loop(id) do
# Randomly crash to simulate hardware failure
if :rand.uniform() < 0.2 do
raise "sensor #{id} died unexpectedly"
else
IO.puts("sensor #{id} reports #{:rand.uniform(100)}")
Process.sleep(500)
loop(id)
end
end
end
# Spawn three independent sensors
pid1 = SensorCollector.start(:a)
pid2 = SensorCollector.start(:b)
pid3 = SensorCollector.start(:c)
If any of the sensors crashes, the other two continue to emit readings. The console output demonstrates that a single failure does not bring the entire collector down.
Links: Propagating Crash Notifications
Process isolation is powerful, but often you actually *want* to know when a sibling crashes. That’s where links come in. A link creates a bidirectional monitoring relationship: when one linked process terminates abnormally, the other receives an exit signal.
By default, receiving an exit signal (with a reason other than :normal) will cause the receiving process to terminate as well. This “fail fast” behavior is useful when a group of processes jointly implement a feature – a failure in one part means the whole feature should be restarted.
Creating a link with spawn_link/1
defmodule LinkedWorker do
def start do
spawn_link(fn ->
Process.sleep(200)
raise "boom!"
end)
end
end
parent = self()
IO.puts("Parent PID: #{inspect(parent)}")
linked = LinkedWorker.start()
# The parent and linked process are now connected.
When the linked child raises, the parent receives an exit signal and terminates as well (unless it is trapping exits, which we’ll discuss next).
Trapping Exits: Turning Crashes into Messages
If you want to keep a process alive after a linked child crashes, you can trap exits. When a process sets the flag :trap_exit to true, exit signals are delivered as regular messages in the format {:EXIT, from_pid, reason} rather than causing the process to die.
defmodule SupervisorLite do
def start do
spawn(fn ->
# Enable exit trapping
Process.flag(:trap_exit, true)
# Start a linked child that will fail
child = spawn_link(fn ->
Process.sleep(100)
exit(:unexpected_failure)
end)
# Wait for the exit notification
receive do
{:EXIT, ^child, reason} ->
IO.puts("Child #{inspect(child)} terminated: #{inspect(reason)}")
# Continue doing work
loop()
end
end)
end
defp loop do
IO.puts("SupervisorLite is still alive")
Process.sleep(500)
loop()
end
end
pid = SupervisorLite.start()
Now the supervising process stays alive, logs the failure, and can decide how to react – for example, by restarting the child.
Monitors: One‑Way Crash Notifications
Links are always two‑way: both parties know about each other’s fate. Sometimes you only need a unidirectional “watcher”. Monitors provide exactly that: a process can monitor another without being linked back.
defmodule MonitorDemo do
def watch do
pid = spawn(fn ->
Process.sleep(150)
exit(:boom)
end)
ref = Process.monitor(pid)
receive do
{:DOWN, ^ref, :process, ^pid, reason} ->
IO.puts("Observed termination: #{inspect(reason)}")
after
500 ->
IO.puts("No termination observed")
end
end
end
MonitorDemo.watch()
The monitoring process receives a {:DOWN, ref, :process, pid, reason} tuple when the target terminates. The monitored process itself is unaffected – it can keep running even if the monitor dies.
Supervisors: Automatic Recovery Engines
Links, trapping exits, and monitors give you the primitives to detect failures. To turn these lower‑level tools into something that can be reused across a whole application, Elixir provides the Supervisor behaviour. A supervisor is itself a process that:
- starts child processes (workers) under its control,
- links to those children (so it gets notified of their termination),
- traps exits, turning crash notifications into messages,
- decides—according to a restart strategy—whether to restart a child.
Let’s build a miniature supervision tree for a “payment gateway”. The system consists of two workers:
Gateway.Server– receives payment requests (a GenServer).Gateway.Database– a simple process that persists transaction logs.
Defining the workers
defmodule Gateway.Server do
use GenServer
# Public API
def start_link(_arg) do
IO.puts("[Gateway.Server] starting")
GenServer.start_link(__MODULE__, %{}, name: __MODULE__)
end
def charge(amount) do
GenServer.call(__MODULE__, {:charge, amount})
end
# Callbacks
@impl true
def init(state) do
{:ok, state}
end
@impl true
def handle_call({:charge, amount}, _from, state) do
# Simulate a random crash
if :rand.uniform() < 0.1 do
raise "simulated payment processor failure"
else
# Forward to DB worker
:ok = Gateway.Database.log({:charge, amount})
{:reply, {:ok, amount}, state}
end
end
end
defmodule Gateway.Database do
use GenServer
def start_link(_arg) do
IO.puts("[Gateway.Database] starting")
GenServer.start_link(__MODULE__, [], name: __MODULE__)
end
def log(entry) do
GenServer.call(__MODULE__, {:log, entry})
end
@impl true
def init(log) do
{:ok, log}
end
@impl true
def handle_call({:log, entry}, _from, log) do
# Persist in memory, could be a real DB call
{:reply, :ok, [entry | log]}
end
end
Building the supervisor
defmodule Gateway.Supervisor do
use Supervisor
# Called by the application start‑up
def start_link(_arg) do
Supervisor.start_link(__MODULE__, :ok, name: __MODULE__)
end
@impl true
def init(:ok) do
children = [
# Simple child spec map
%{
id: Gateway.Server,
start: {Gateway.Server, :start_link, [:ignore]},
restart: :transient,
shutdown: 5000,
type: :worker
},
%{
id: Gateway.Database,
start: {Gateway.Database, :start_link, [:ignore]},
restart: :permanent,
shutdown: 5000,
type: :worker
}
]
# The chosen strategy here is :one_for_one – only the failing child is restarted.
Supervisor.init(children, strategy: :one_for_one)
end
end
Running the system is as easy as launching the supervisor:
iex> Gateway.Supervisor.start_link(:ignore)
[Gateway.Server] starting
[Gateway.Database] starting
{:ok, #PID<0.164.0>}
iex> Gateway.Server.charge(42)
{:ok, 42}
If the Gateway.Server crashes (e.g., due to the simulated failure), the supervisor receives the exit signal, spawns a fresh Gateway.Server process, and the system continues serving requests.
Child Specification – The Blueprint for Workers
A supervisor does not magically know how to start a child. It relies on a child specification, a map that tells the supervisor:
- :id – an arbitrary term used to reference the child.
- :start – a tuple
{module, function, args}that the supervisor invokes to start the process. The function must link to the caller (i.e.,GenServer.start_link/3). - :restart – the restart policy (
:permanent,:transient, or:temporary). - :type –
:workeror:supervisor(important for nested supervision trees). - :shutdown – timeout or :brutal_kill used when the supervisor itself is stopping.
Since Elixir 1.5 there’s a helper Supervisor.child_spec/2 that can automatically generate a compliant map from a module that implements the start_link/1 callback. This reduces boilerplate and helps keep specifications in sync with the implementation.
children = [
{Gateway.Server, []},
{Gateway.Database, []}
]
Supervisor.start_link(children, strategy: :one_for_one, name: MyApp.Supervisor)
Elixir expands each tuple into a child spec map behind the scenes, using default values for the omitted keys.
Restart Strategies – When and How to Bring Back a Child
The :strategy option given to Supervisor.init/2 controls the overall behaviour when a child terminates. The most common strategies are:
- :one_for_one – only the crashing child is restarted. This is the default for many simple trees.
- :one_for_all – if any child crashes, *all* children are terminated and then restarted. Useful when children depend on shared state.
- :rest_for_one – children are started in the order given; if a child crashes, that child and all later children are restarted.
- :simple_one_for_one – a dynamic supervisor that handles many homogeneous workers (e.g., a pool of connection handlers).
Choosing the right strategy is a design decision. In the “payment gateway” example, :one_for_one makes sense because the database worker is stateless in memory, and the server can be safely restarted without impacting the other.
Common Pitfalls and How to Avoid Them
- Relying on PIDs after a restart – Because a restarted child receives a new PID, any cached PID becomes stale. Use registered names (via the
name:option) or a lookup function to retrieve the current PID. - Infinite restart loops – A child that crashes instantly on start will cause the supervisor to endlessly restart it, eventually hitting the max restart intensity and terminating the whole tree. Mitigate this by fixing bugs, using proper back‑off strategies, or setting
restart: :temporaryfor children that aren’t critical. - Blocking the supervisor – Do not perform long‑running work inside a supervisor’s
init/1callback. It should only start children and return quickly. - Misusing
throwfor flow control – Whilethrowcan be caught, it’s considered a “goto‑like” feature. Prefer returning tuples ({:ok, value}or{:error, reason}) or usingraisefor genuine exceptional situations.
Putting It All Together – A Mini‑Application
Below is a complete self‑contained example that assembles the concepts we’ve covered. The scenario simulates a “chat room” service where each room is a GenServer. A supervisor oversees the dynamic creation of rooms and automatically restarts any that crash.
defmodule ChatRoom do
use GenServer
# Public API -------------------------------------------------------
def start_link(name) do
GenServer.start_link(__MODULE__, %{name: name, members: []}, name: via_tuple(name))
end
def join(room_name, user) do
GenServer.cast(via_tuple(room_name), {:join, user})
end
def send_message(room_name, user, text) do
GenServer.cast(via_tuple(room_name), {:msg, user, text})
end
# Registry helper --------------------------------------------------
defp via_tuple(name) do
{:via, Registry, {ChatApp.Registry, name}}
end
# Callbacks --------------------------------------------------------
@impl true
def init(state), do: {:ok, state}
@impl true
def handle_cast({:join, user}, %{members: members}=state) do
IO.puts("[#{state.name}] #{user} joined")
{:noreply, %{state | members: [user | members]}}
end
@impl true
def handle_cast({:msg, user, text}, state) do
# Introduce a random crash to show supervision
if :rand.uniform() < 0.05 do
raise "room #{state.name} hit an unexpected error!"
else
broadcast(state, "[#{user}] #{text}")
{:noreply, state}
end
end
defp broadcast(%{members: members}=state, msg) do
Enum.each(members, fn member ->
IO.puts("→ #{member} receives: #{msg}")
end)
end
end
defmodule ChatApp.RoomSupervisor do
use DynamicSupervisor
def start_link(_arg) do
DynamicSupervisor.start_link(__MODULE__, :ok, name: __MODULE__)
end
@impl true
def init(:ok) do
DynamicSupervisor.init(strategy: :one_for_one)
end
def start_room(name) do
spec = %{id: name, start: {ChatRoom, :start_link, [name]}, restart: :transient}
DynamicSupervisor.start_child(__MODULE__, spec)
end
end
defmodule ChatApp.Application do
use Application
def start(_type, _args) do
children = [
{Registry, keys: :unique, name: ChatApp.Registry},
{ChatApp.RoomSupervisor, []}
]
opts = [strategy: :one_for_one, name: ChatApp.Supervisor]
Supervisor.start_link(children, opts)
end
end
Run the application:
iex> {:ok, _} = ChatApp.Application.start(:normal, [])
iex> ChatApp.RoomSupervisor.start_room("elixir")
{:ok, #PID<0.221.0>}
iex> ChatRoom.join("elixir", "alice")
[elixir] alice joined
:ok
iex> ChatRoom.send_message("elixir", "alice", "Hello world!")
→ alice receives: [alice] Hello world!
:ok
If the random crash is triggered, the dynamic supervisor detects the termination, starts a fresh ChatRoom process, and future messages are handled again. Because the room is registered via Registry, callers never need to store PIDs—they simply reference the room by name.
Summary
- Fault‑tolerance in Elixir is built on lightweight, isolated processes.
- Runtime errors come in three flavors (
:error,:exit,:throw) and can be caught withtry … catch … after. - Links propagate crash notifications; trapping exits converts those signals into ordinary messages.
- Monitors provide a one‑way “watch” mechanism without the bidirectional coupling of links.
- A
Supervisorties everything together: it starts children, links to them, traps exits, and restarts them according to a configurable strategy. - Child specifications describe how a supervisor should start, restart, and identify each worker.
- Common mistakes include holding on to stale PIDs, creating endless restart loops, and using
throwfor regular control flow.
By mastering these primitives you gain the ability to design systems that keep running even when parts fail. The “let‑it‑crash” mantra may feel counter‑intuitive at first, but once you see a supervisor automatically resurrect a misbehaving component, the value of this approach becomes crystal clear. Happy fault‑tolerant coding!