architecture

Taming Complex Behaviour with Explicit State Machines

AYA · PublishedMarch 18, 2026

There is a category of bug that every experienced engineer recognises on sight. It lives in a cluster of boolean fields: isLoading, hasError, isComplete, isRetrying. The fields were added one at a time, each solving a real problem. Months later the combination space has exploded and nobody is sure which combinations are supposed to be possible. A bug report arrives. You open the code and realise the system briefly entered a state the author never intended (isLoading: true and isComplete: true simultaneously), and that impossible state is precisely what caused the crash.

This article is about why that pattern is structurally fragile, how explicit state machines prevent it, and what the practical difference looks like in real code.

The Real Cost of Ad-Hoc Flags

Consider a simple network request lifecycle. You need to show a loading indicator while the request is in-flight, display results on success, and handle errors gracefully with a retry option. Straightforward enough.

A first pass often produces something like this:

// Fragment from a real codebase -- names changed
let isLoading = false;
let hasError  = false;
let data      = null;
let errorMsg  = "";

Each field answers a yes/no question. But behaviour is not a collection of yes/no questions. It is a sequence of exclusive situations. The loader is either idle, or fetching, or showing results, or showing an error. It cannot logically be fetching and showing results at the same time. Yet the boolean model does not enforce this. Four booleans produce 16 theoretical combinations. At most four or five of those combinations are meaningful. The other eleven or twelve represent impossible states your code silently allows.

The bugs that emerge from those impossible states are notoriously hard to reproduce. They often require a specific timing: a fast user clicking retry just as a previous request resolves, or a network timeout arriving a few milliseconds after a success callback fires. Unit tests rarely exercise these racing conditions. They surface in production, on a customer’s device, at an inconvenient time.

Beyond bugs, the cognitive cost compounds. Every engineer who reads the code has to hold the full matrix of flag combinations in mind to reason about any change. Reviewers miss implicit invariants. Features take longer to add because every addition must be checked against an ever-growing set of possible flag combinations. The codebase becomes fragile not because of any single bad decision, but because the underlying model does not match reality.

What a State Machine Is

A finite state machine is a formal model with three things:

A finite set of states, each representing a discrete situation the system can be in.
A finite set of events, things that happen and might trigger a change.
A set of transitions: rules that say “when in state X and event Y occurs, move to state Z”.

The crucial property is that the system is always in exactly one state. Not two, not zero. Exactly one. Impossible combinations cannot exist because the model does not have a concept of combination. You are either fetching or you are not. You are either in an error state or you are not. The states are mutually exclusive by construction.

This is not a new idea. Digital circuit designers have used state machines since the 1950s. Protocol specifications (TCP, Bluetooth, USB) are written as state machines. The model is mature, well-understood, and directly applicable to software.

A Concrete Example: Connection Lifecycle

Take a slightly more complex scenario: a WebSocket connection with reconnection logic. This is a real problem that trips up many implementations. Let us model it explicitly.

The states:

Disconnected: no connection, no active attempt
Connecting: connection attempt in progress
Connected: live connection, messages flowing
Reconnecting: previous connection lost, waiting to retry
Failed: too many retries, giving up

The events:

CONNECT: user or system requests a connection
OPEN: underlying socket reports it is open
CLOSE: socket closed unexpectedly
ERROR: a hard error occurred
RETRY: reconnect timer fires
MAX_RETRIES_EXCEEDED: retry count exhausted
DISCONNECT: deliberate disconnect requested

The transitions (a representative subset):

Current State	Event	Next State
Disconnected	CONNECT	Connecting
Connecting	OPEN	Connected
Connecting	ERROR	Reconnecting
Connected	CLOSE	Reconnecting
Connected	DISCONNECT	Disconnected
Reconnecting	RETRY	Connecting
Reconnecting	MAX_RETRIES_EXCEEDED	Failed
Failed	CONNECT	Connecting

Now encode this in TypeScript. One clean approach is a simple reducer:

type State =
  | "Disconnected"
  | "Connecting"
  | "Connected"
  | "Reconnecting"
  | "Failed";

type Event =
  | { type: "CONNECT" }
  | { type: "OPEN" }
  | { type: "CLOSE" }
  | { type: "ERROR" }
  | { type: "RETRY" }
  | { type: "MAX_RETRIES_EXCEEDED" }
  | { type: "DISCONNECT" };

function transition(state: State, event: Event): State {
  switch (state) {
    case "Disconnected":
      if (event.type === "CONNECT") return "Connecting";
      return state;

    case "Connecting":
      if (event.type === "OPEN")  return "Connected";
      if (event.type === "ERROR") return "Reconnecting";
      return state;

    case "Connected":
      if (event.type === "CLOSE")      return "Reconnecting";
      if (event.type === "DISCONNECT") return "Disconnected";
      return state;

    case "Reconnecting":
      if (event.type === "RETRY")                return "Connecting";
      if (event.type === "MAX_RETRIES_EXCEEDED") return "Failed";
      return state;

    case "Failed":
      if (event.type === "CONNECT") return "Connecting";
      return state;
  }
}

This function is pure. It takes a state and an event, returns the next state. The logic for every situation is in one place. If you are in Connecting and a CLOSE event arrives (which is not listed), the function returns the current state unchanged. That is the correct, safe default: events that do not apply to the current state are silently ignored. No impossible combination can sneak in.

The contrast with the boolean approach is sharp. In the flag-based version, you might have a setLoading(false) call in an error handler that races with a setData(result) call in a success handler. Each touches independent fields. The state machine version has a single transition function. Two events arriving near-simultaneously will each run transition() in sequence. One will produce a new state; the second will transition from that new state. The history is coherent.

What This Solves in Practice

Impossible states become unrepresentable. With booleans, isConnecting && isConnected is a valid TypeScript value. With the state machine above, it does not exist. The type system enforces it.

Edge cases are explicit. When you write the transition table you are forced to ask: what should happen if ERROR fires while we are already Reconnecting? You answer it deliberately instead of leaving it to chance. That question probably never got asked in the flag-based version.

Testing becomes straightforward. A state machine is a pure function. You pass in a (state, event) pair and assert the output. You can enumerate all meaningful transitions and test each one. You can also write property tests that feed arbitrary sequences of events and verify that the current state is always one of the defined valid states, which the type system makes trivially checkable.

Debugging improves dramatically. When something goes wrong you want to know: what state was the system in, and what event caused the transition to the bad state? If your state machine logs every transition (state, event, next_state) you have a complete, readable audit trail. Debugging a boolean cluster gives you a snapshot of seven flags; debugging a state machine gives you a history.

Onboarding new engineers is faster. The transition table is documentation. A new hire can read it and understand the full lifecycle in minutes. Equivalent documentation for the boolean approach would need to enumerate all valid combinations, a list that lives nowhere and is accurate for exactly as long as nobody makes a change.

When to Reach for a State Machine

Not every piece of logic benefits from an explicit state machine. A function that formats a date, a component that toggles a dropdown: these do not need this treatment. The signal that a state machine is warranted is usually the presence of two or more booleans that are logically related, or behaviour that depends on sequencing (what happened before the current event matters).

Good candidates: authentication flows, payment checkout sequences, media player controls, form wizard steps, real-time connection management, device pairing flows, background job lifecycle, file upload progress with retry. These are all sequences of exclusive situations where things go wrong when two situations overlap.

Libraries like XState provide a full implementation with extended state (context), hierarchical states, parallel regions, and tooling. For many problems the minimal reducer pattern above is sufficient and adds no dependencies.

The key step is not the library. It is the discipline of naming your states before writing code. Draw the diagram. Write the transition table. Ask where each event is valid and where it should be ignored. That exercise surfaces assumptions you did not know you were making, and those assumptions are exactly where the bugs live.

For the Non-Technical Reader

Here is the short version.

Software behaviour is often described in terms of conditions: “if the user has clicked submit and the form is not loading and there is no error, then show the result.” Each condition is a separate switch, and the number of possible combinations of switches grows exponentially. Most of those combinations mean nothing, but the code does not know that.

State machines reframe the question. Instead of conditions, you describe situations: the system is either waiting for input, processing it, showing a result, or showing an error. At any moment it is in exactly one of these situations, and you define precisely what causes it to move from one situation to another. The impossible combinations simply do not exist in the model.

The practical outcome: fewer production bugs, faster debugging when bugs do occur, and code that is easier to change safely. For organisations building software over years rather than months, that compounds significantly.

Closing

The case for explicit state machines is not academic. It is pragmatic: the pattern forces you to think about your problem more carefully before writing code, and that investment pays back through fewer incidents, faster feature development, and a codebase that is still maintainable three years later.

If you are working on a system where complex lifecycle behaviour is a concern and want a second opinion on the architecture, we are happy to look at it.

Contact AYA

← All notes

Bring us the problem you cannot get wrong.

Tell us what you are building. We will tell you how we would approach it, where the real risks are, and whether we are the right team to take it on.

Start a project