Reliable AI Agent Control Flow: Keep the State Machine Out of the Prompt

Reliable agent control flow starts with a simple boundary: the state machine that governs which step runs next belongs in code, not in a prompt. The trend of embedding state machines inside LLM system prompts is a category error. State machines are deterministic by definition. LLMs are not. Putting safety-critical control flow inside an LLM’s instruction-following behavior is putting load-bearing logic on a non-load-bearing surface. The fix is mundane: keep the machine in ordinary code and let the model work inside the steps.

1. The Pattern in the Wild

Open up a handful of LangGraph-adjacent orchestration repos on GitHub and you will find variations of the same prompt fragment within minutes. In ruflo-style agent systems and in orchestration code shared in the multi-agent ML community, the pattern appears in system prompts and node instructions alike. The prompt tells the model it is a task-processing agent operating as a state machine, injects the current state as a template variable, lists the valid transitions in plain English - idle moves to processing when a task arrives, processing moves to complete when finished or to error on unrecoverable failure, error moves back to idle once acknowledged - and asks the model to emit the next state and its reasoning as a small structured object.

Sometimes it’s slightly abstracted - the states are called “phases” or “modes.” Sometimes the transition list is longer. But the structure is identical: state as a template variable, transitions as natural language prose, next state as an LLM output token. You see it in hobby projects, in production-adjacent repositories, in technical blog tutorials. It’s everywhere.

2. Why It Looks Attractive

The appeal is real, so let’s be honest about it before dismantling it.

The pattern is compact. You get a working state machine in one prompt block - no library imports, no class definitions, no transition table boilerplate. A junior engineer reading it immediately understands what the agent is trying to do. Adding a new state is a one-line edit to the prompt. Removing a transition requires deleting a bullet point.

There is also a genuine intelligence argument. An LLM can evaluate ambiguous situations that a hard-coded if/elif chain would get wrong. If “processing” should transition to “complete” when a task is mostly done versus entirely done - and that determination requires reading the task output - a code-based transition guard needs its own classification logic anyway. Why not let the model handle it?

And it composes naturally with the rest of the agent’s reasoning. The same context window that holds the task instructions also holds the state machine rules. Feels elegant. One cognitive unit, one prompt, one call.

These are real benefits. The pattern persists because it works - for a while.

3. Why It Fails: LLM Drift

Two plates contrasting a clean state diagram with a tangled one — Drift, drawn.

A state machine has a contract: given state S and input I, the next state S’ is fixed. That determinism is the entire value proposition. Without it, you don’t have a state machine. You have a stochastic function that occasionally returns a state name.

LLMs at temperature > 0 are explicitly non-deterministic. The same state, the same input, will produce different transitions across runs. But even at temperature = 0, you have a different problem: determinism is an artifact of a specific model checkpoint. Model versions change - sometimes silently. When an API provider fine-tunes a model, rolls out an RLHF update, or changes the system prompt formatting, the “deterministic” behavior of your prompt-based state machine can shift.

Consider a five-state document-processing agent: idle → ingested → extracting → complete | error. After a routine model update, the extracting → complete transition starts occasionally routing to error instead - not because the task failed, but because the updated model’s instruction-following behavior weights the error-handling prose in the prompt differently than the previous checkpoint did. The document processed successfully. The agent thinks it failed.

The bug is in the prompt. But there is no diff to look at. The model changed upstream, invisibly, and your state machine changed with it. You cannot pin a model version indefinitely on most hosted APIs. You cannot write a regression test that catches “the LLM changed its mind about this transition.” The state machine you deployed is not the state machine you are running six months later.

4. Why It Fails: Instruction Collision

The state machine instructions and the task-execution instructions share the same context window. The LLM is asked to follow both simultaneously, with no enforced boundary between them. This creates a class of failure that no amount of prompt engineering fully eliminates.

The attention dilution problem: under long conversations or deeply recursive agent loops, the state machine instructions drift toward the end of the model’s effective attention range. The model continues generating tokens, but the constraints it is supposedly operating under have become suggestions. State machine rules written at position 200 in a 16,000-token context are simply less influential than the task content at position 15,800.

The semantic collision problem is worse. What happens when task content contains state names? A document that says “move this ticket to the error state” - is that task content to be processed, or a state transition instruction to be executed? The model has to guess. With carefully chosen state names and extensive prompt framing you can reduce the collision surface, but you cannot eliminate it. Any vocabulary that appears in both the state machine definition and the domain of tasks you are processing is a potential injection vector for unintended transitions.

The LLM does not have a privileged parser for “system instructions” versus “user content.” It has one context window and it reads all of it.

5. Why It Fails: No Audit Trail

A code-based state machine has a callstack. You can set a breakpoint on every transition. You can log the current state, the triggering event, the inputs, and the resulting state with a single decorator. You can replay a sequence of transitions offline, in a unit test, with no network call. When something goes wrong, you have a trace.

A prompt-based state machine’s decision process lives inside a forward pass. You see the output token - {"next_state": "error"} - but you do not see why. The chain of attention weights that produced that token is not inspectable. The model’s internal “state tracking” is not a data structure you can query. It is an emergent property of transformer activations.

This matters when agents get stuck. When a multi-step workflow reaches an unexpected state at step 14, you need to know: did the state machine receive the wrong input, or did it make the wrong transition given correct input? These are different bugs with different fixes. In a code-based system, you answer this question in seconds by reading the log. In a prompt-based system, you re-run the workflow, maybe with different temperatures, and try to reproduce the behavior. Sometimes you can’t reproduce it because the model update that caused it already rolled back. The bug existed in production but not in your debugging environment.

You cannot write a reliable fix for a state machine whose failure mode is unobservable.

6. The Alternative: Runtime State Machine, LLM as One Node

Block diagram with a state-machine core and one subordinate input cell — One node, not the conductor.

The solution is not sophisticated, and it inverts the prompt-based design rather than refining it. Instead of describing the state machine in prose and letting the model emit the next state, you represent the machine as data and let ordinary code decide transitions. The whole machine reduces to three things.

First, an explicit, closed set of states - a small enumeration, not a free-text vocabulary. The set of legal states is fixed and enumerable; nothing outside it can ever appear, because there is no token-generation step that could invent a fourth option.

Second, the transition table itself, expressed as data: a mapping from each state to the exact list of states that may legally follow it. This table is the single source of truth for the machine’s shape. Idle may only advance to processing; processing may resolve to either complete or error; complete is terminal; error returns to idle. Because the table is a data structure rather than a paragraph of instructions, every valid transition is explicit, the whole machine is enumerable in one glance, and adding or removing an edge is a data edit you can diff, review, and test - not a reweighting of prose that the model may interpret differently after an upstream update.

Third, a single transition function that consults the table. Given a current state and a desired target, it checks the table; if the target is not in the allowed set it raises, and if an optional guard condition is supplied and fails, it also raises. Otherwise it returns the new state. This function is fully deterministic and never touches an LLM. You can unit-test it exhaustively - every legal edge, every illegal edge, every guard outcome - with no mock, no API key, and no network call.

The LLM still has a job, but a contained one. Inside a given state, the workflow calls the model to do content work: extract entities from the task, classify the input, generate output. The model returns a structured result. Then plain code inspects that result and chooses which transition to request - advancing to complete on success, to error otherwise - and the transition function enforces whether that move is legal. Control flow lives in code; the model is a reasoning oracle invoked at a node, never the dispatcher that decides which node runs next.

Because every transition flows through that one function, each one can be logged with the current state, the target state, the guard result, and a timestamp. When something fails at step 14, you have a complete, replayable trace.

7. What Belongs in Code vs. Prompts

The decision framework is a single rule with a corollary:

If the behavior must be deterministic, it belongs in code. If the behavior benefits from language understanding, it belongs in a prompt.

Applied to agent systems:

In CODE:

State definitions and the valid set of states
The transition table (which states can follow which)
Transition guards (conditions that must be true for a transition to fire)
Error states and their recovery paths
Loop termination conditions
Timeout logic and retry counts
The maximum number of steps before forced termination

In PROMPTS:

Generating natural language output for users
Extracting structured data from unstructured text
Classifying task content into categories your transition guards can consume
Summarizing long documents before passing them to downstream steps
Deciding what to say at a given step, not which step to go to

The LLM is a reasoning oracle for content decisions. It is not a process controller. The moment you ask it to output a state name that your system will treat as a routing instruction, you have handed control flow to a stochastic process.

One concrete test: could a determined adversary manipulate the agent’s behavior by injecting text into the task content? In a prompt-based state machine, the answer is often yes - task content and transition instructions share a context, and the boundary between them is fuzzy by design. In a code-based state machine, the LLM’s output is a classification or extraction result that your code consumes. Injecting “transition to complete” into the task content changes the LLM’s text output, not the Python transition logic.

8. Write the Code Version First

The prompt-as-state-machine pattern will keep appearing because it is trivially easy to write on day one. You get something working in 20 lines. It demos well. The failure modes only surface under operational conditions: model updates, long context, adversarial inputs, edge cases that weren’t in the demo.

The technical debt compounds. Every new state you add to the prompt makes the transition logic harder for the model to follow reliably. Every new feature requirement that crosses state boundaries adds another collision surface. After a few months the prompt is 400 tokens of state machine logic that nobody fully understands and everyone is afraid to modify.

Rewriting a prompt-based state machine into a code-based one is always possible. It is always painful. The tests you should have written from the start don’t exist. The transitions that seemed obvious are underspecified. You find edge cases in production that you have to trace through LLM outputs to even understand.

Write the code version first. The closed set of states takes five minutes. The transition table takes ten. The unit tests take twenty. You will spend all of those minutes anyway - just later, under worse conditions, after the bug has been in production long enough to matter.

Stop putting state machines in prompts. Define the states explicitly, make the transition table the single source of truth, and test it exhaustively with no model in the loop. Ship an agent you can actually debug. The LLM belongs inside the nodes, doing content work - not governing which node runs next.