← BACK TO DISPATCH

The race your single-threaded test will never find

Two dozen threads routed a fixed batch each. The history should have held the exact product of threads and tasks. It came up short. Two threads read the same length, both wrote to the same slot, and one entry just vanished.

The race your single-threaded test will never find

The first time I counted, the number was wrong, and the wrongness is the whole story.

A routing agent serving real traffic has many requests calling it at once. My tests, like everyone's, called it one at a time, and one at a time it was perfect. I assumed the GIL would cover me, the comforting myth that Python somehow makes shared state safe. It does not. It releases between bytecodes, and a read-check-write sequence, read the history length, decide the slot, write the entry, is not atomic across that gap. Two threads read the same length, both write to the same index, and one of those writes silently disappears.

You cannot find this by reading the code and you cannot find it with a single-threaded test, because single-threaded execution never lets the two reads interleave. The only test that catches it is a counting test, and it is almost embarrassingly simple. Spin up a couple dozen threads. Have each route a fixed batch of distinct tasks. Join them all. Then assert the history holds exactly the product of threads and tasks per thread. If it comes up even one short, you have a concurrent-write race, and the missing entries are the proof.

The fix is a lock around every shared-state write. The point of the test is that it tells you the instant you forget one, which you will, because the unprotected version passes every other test you own. It only fails here, under real concurrency, exactly where production lives.

What unsettled me was how confidently the broken version behaved everywhere else. Green unit tests. Green integration. Clean logs. The defect was invisible until I forced two dozen threads to fight over one list and then counted what survived. Correctness under load is not something you reason your way into. It is something you make the threads prove.

If your tests never run your agent concurrently, every race it has is still in there, waiting for the day real traffic arrives to interleave the two reads you assumed could never overlap.

Part of a series. Start here: A green test suite proves less than you think