Testing AI Agents: Five Dimensions Line Coverage Misses

Why Your Green Test Suite Means Nothing

Testing AI agents well is not about line coverage - and a green suite proves less than you think. You build an LLM-based routing agent. You write unit tests. You hit 95% line coverage. CI turns green. You deploy. Within a week, the agent produces nonsensical routing decisions under load, crashes on unusual task strings, and gets into routing loops the loop detector misses because two threads updated its state simultaneously.

The unit tests were not lying. The functions worked in isolation. Autonomous systems do not fail in isolated functions. They fail at five seams unit tests are structurally incapable of testing:

Integration seams - modules compose incorrectly; type contracts or initialization order that mocks silently paper over
Adversarial inputs - the distribution of real user input is not the distribution you imagined
Concurrency - shared state (history, registries, loop detectors) is not thread-safe; races corrupt it under load
Failure mode cascades - components fail; a system that crashes instead of degrading gracefully pages you at 3 AM
Distribution boundary violations - invariants that hold on tested examples silently break on the ones you didn’t think of

This post gives you a five-dimension template - one suite per dimension - that actually covers them.

The Five Dimensions at a Glance

Five-panel grid of small schematic diagrams — Five dimensions at a glance.

Dimension	Tests for	Key tool
Integration	Cross-module composition with real dependencies	pytest + real in-memory objects
Adversarial	Malformed inputs, boundary violations, injection	pytest + parametrize fixtures
Concurrency	Races, lock correctness, shared-state corruption	threading + pytest
Failure modes	Graceful degradation cascade	pytest + unittest.mock
Property-based	Distribution boundary invariants	hypothesis

Each file is independent. Each catches a different failure class. Run all five in CI on every push.

Dimension 1: Integration

Unit tests test a function. Integration tests test a pipeline. The most common class of bugs in multi-module systems is not “function X has wrong logic” - it is “X works correctly but Y expects a different type,” or “A requires B’s setup to have been called first.” Mocks hide all of this: they don’t enforce the real interface, don’t care about argument order, and return exactly what you told them to.

For a routing agent, an integration test uses a real in-memory registry and real scorer instead of mocks, exercising the full route-score-record pipeline. Build the router around an actual registry seeded with a handful of capabilities, route a real task through it, and then assert on the observable consequences: the returned capability is one the registry actually holds, the router’s history grew by exactly one entry, and that entry records the task and the selection that were truly made. No component is mocked, so a type mismatch or an initialization-order assumption between modules surfaces here instead of hiding behind a mock’s canned return value.

A second integration test guards a subtler contract: a capability registered at runtime must become routable immediately. Register a new capability, then route many tasks and assert the new capability eventually gets selected. The detail that makes this test honest is that the tasks must vary. An earlier version reused the same task string every iteration, and because the deterministic hash scorer maps a fixed string to a fixed index, the newly registered capability was never reachable - yet the test passed for the wrong reason. Varying the task strings spreads the hash across all indices, including the new one, so the assertion actually exercises what it claims to. The same pattern extends to further invariants - that routing only ever selects registered capabilities, that history stays ordered, that the loop detector fires on a repeated task - each a different assertion over the same real pipeline.

Dimension 2: Adversarial

When you write a happy-path test, you write task = "summarize this document". Real users send tasks with embedded newlines, 100,000-character strings, prompt-injection attempts, SQL metacharacters, null bytes, whitespace-only inputs, empty strings, and emoji. Adversarial testing is not about breaking your system - it is about asserting that “how it behaves” means “returns an error or a safe result,” not “crashes, leaks state, or executes the injection.”

Drive the router with a fixed roster of hostile inputs, each exercising a different boundary: a blunt prompt-injection string instructing the system to exfiltrate secrets, a newline-embedded fake system directive, a SQL-injection-styled payload, a hundred-thousand-character string, a run of control and null bytes, whitespace-only input, the empty string, and a long run of emoji. For every one of these, the contract is the same: either the router returns a capability that genuinely exists in the registry, or it rejects the input with an explicit typed error. What it must never do is return something that is not a registered capability - that would mean injected text leaked into the return value - or crash with an unexpected exception type, or quietly mutate shared registry state as a side effect. Three independent assertions, run across the whole hostile roster, cover those three failure modes: no unexpected exception, result is a known capability (catches injection becoming the return value), and the registry is not mutated (catches side-effect writes).

Dimension 3: Concurrency

Two parallel timelines colliding at a marked resource — Where parallel paths collide.

A routing agent serving multiple users will have multiple threads calling route() simultaneously. CPython’s GIL does not protect you: it releases between bytecodes, and the read-check-write sequence of a loop detection check is not atomic. Concurrent appends to an unprotected list drop entries.

The decisive test is a counting test. Spin up a couple dozen threads, have each route a fixed batch of distinct tasks, join them all, and then assert that the router’s history holds exactly the product of threads and tasks-per-thread. If the count comes up short, you have a concurrent-write race on the history list - two threads read the same length, both append at the same index, and one entry vanishes. A non-thread-safe implementation fails this reliably under load; the fix is a lock around every shared-state write, and the test tells you the moment you forget one. The same harness extends to two further concurrency tests: that concurrent routing raises no exceptions under load, and that registry reads stay consistent while a writer thread mutates the registry, catching stale-snapshot reads.

Dimension 4: Failure Mode Cascade

A production routing agent has at minimum three independently failing components: the capability registry, the scoring model, and the loop detector. A naive implementation lets any failure crash the routing call. The correct design is a degradation hierarchy: registry fails, fall back to a hardcoded default set; scorer fails, use uniform weights; loop detector fails, continue without loop protection and log a warning.

The most important test here forces total failure. Patch all three components - the registry, the scorer, and the loop detector - to raise the moment they are touched, then route a task and assert two things: the call does not propagate an exception, and the result it returns comes from the documented fallback set. If the router raises instead of degrading, the test fails loudly. Around that worst-case test sit the single-component variants - registry down falls back to a default capability set, scorer down falls back to uniform selection, loop detector down continues routing without protection, a partial registry failure uses whatever capabilities remain available - each patching exactly one component to fail and asserting the specific fallback that should kick in.

The total-failure test is the most important: failures correlate (a database outage takes down both registry and scorer at once), so a system that handles each in isolation but crashes when two fail simultaneously will crash in production.

Dimension 5: Property-Based Testing

First implementations fail on edge cases their authors did not anticipate. Property-based testing closes that gap by specifying invariants and letting Hypothesis generate thousands of inputs to break them. You assert a contract; Hypothesis is the adversary.

For a routing agent, the invariants: routing a non-empty task always returns a capability from the registry; history size never decreases; the loop detector always returns a boolean; the selected capability is always a string.

The central property: for any non-empty task, routing either returns a capability the registry actually holds or rejects the input with a typed error - nothing in between. State that as a property, let Hypothesis generate hundreds of text inputs against it (skipping the whitespace-only cases that the contract explicitly rejects), and the framework hunts for a counterexample. The other invariants follow the same shape: history length is monotonically non-decreasing across a sequence of routes, the loop detector always returns a genuine boolean, and the selected capability is always a string. Each is a property paired with an input-generation strategy.

Hypothesis will find inputs where these fail: a single-character task that hits a different branch, Unicode combining characters that collapse loop detection, whitespace stripped to empty with no downstream guard. Pushing the example count to a few hundred per property adds only a couple of seconds and catches significantly more than the default.

The Composite Picture

Failure mode	Integration	Adversarial	Concurrency	Failure cascade	Property-based
Type contract mismatch	YES	no	no	no	partial
Initialization order bug	YES	no	no	no	no
Injection attack effect	no	YES	no	no	no
Boundary input crash	partial	YES	no	no	YES
Race condition	no	no	YES	no	no
Component failure crash	no	no	no	YES	no
Invariant violation	no	partial	no	no	YES

No single dimension catches everything; the value is the combination. Run all five in CI on every push - total runtime is under 60 seconds for a routing agent of this complexity.

The Shared Test Harness

All five suites share one small harness, and the design of the harness is what makes the suites honest. Two test doubles do the work - not mocks, but real working stand-ins.

The fake registry is an in-memory list of capabilities guarded by a lock. It lists its capabilities, registers new ones idempotently, and exposes a separate fallback list. Because it is real and lock-guarded, the concurrency suite can genuinely race against it and the integration suite genuinely observes registrations.

The fake router is a configurable routing agent that exercises the full pipeline. It validates the input type and a maximum length up front, raising typed errors that the adversarial suite expects. Then it reads capabilities from the registry, scores them, runs the loop check, and appends the outcome to its history - and each of those three stages is wrapped so a failure degrades rather than crashes: a registry failure falls back to the fallback capability set, a scorer failure falls back to a random choice, and a loop-detector failure is swallowed so routing continues without protection. The history append happens under a lock. The scorer is deterministic for reproducibility - it hashes the task and indexes into the capability list - which is exactly why the integration suite has to vary its task strings to reach every capability. One deliberate escape hatch makes the concurrency tests meaningful: the router can be constructed with a no-op lock that satisfies the lock interface but provides no mutual exclusion, so a test can reproduce the unprotected race on demand and confirm the real lock actually prevents it.

The green CI badge you get after implementing this means something different. Not “the functions work in isolation” but “the system composes correctly, handles adversarial input without leaking state, is safe under concurrent load, degrades gracefully under component failure, and preserves its invariants across the full input distribution.” That is the test suite production agents need.