Multi-Agent LLM Orchestration: OSS Audit

What I was hunting for is the stuff the demos never show. Whether a plan can quietly drift once it is running. Whether a loop detector can tell a real loop from an intentional retry. Whether the router learns from anything, or just reads a config file. Whether a panel of models can grade each other without playing favorites. None of that shows up in a fixed-horizon, single-model, no-concurrency benchmark. All of it shows up at 2am in something you shipped. So this is a field report, not a leaderboard. Three honest repos, each a different bet on the seams. Here is what each one actually does, what I would push on, and the one gap I noticed running through all three.

Before any of it: I read the repos, I did not run a controlled comparison. The mechanism claims below are what the projects document. The worries are mine, framed as worries, not measurements.

The three bets

ruflo (github.com/ruvnet/ruflo, MIT) bets on behavioral adaptation across a federation of agents. The conviction underneath it is one I share: which peer you trust should come from what the system has actually observed, not from a static config you tune once and forget. ruflo makes this concrete with a documented federation trust formula, a weighted blend of success rate, uptime, threat signal, and integrity, that continuously evaluates peers and downgrades untrusted ones with no human in the loop. It is model-agnostic, routing across several providers with failover. The bet is that trust is a measured property, not a declared one.

KARIMO (github.com/opensesh/KARIMO, Apache-2.0) bets on structural isolation and disciplined context. It is built on a commercial agent SDK as a coding-assistant plug-in, and it runs each agent in its own git worktree with branch identity verification, so agents cannot trample each other's working state. Execution is wave-ordered: tasks inside a wave run in parallel, waves run in sequence, across three loops it calls Foundation, Decomposition, and Orchestration. It layers context for token efficiency rather than dumping everything into every prompt, and it advertises semantic loop detection as a capability beyond the base SDK. The bet is that isolation and tiered context buy you reliability at scale.

llm-council (github.com/karpathy/llm-council, MIT) bets on adversarial peer review done honestly. A single model grading its own output is self-referential and you cannot trust it. llm-council runs three stages: every model answers the query, then each model reviews the others and ranks them on accuracy and insight, then a designated Chairman compiles a final answer. The detail that makes it work is in its own words, the model identities are anonymized so a model cannot play favorites when judging outputs. It is small, deliberately scoped, and by Karpathy. If you only read one of the three, read this one.

How they compare, in plain terms

I am not going to show you code, and the honest comparison lives at the level of design choices anyway, so line them up. Isolation runs from federation-level trust gating in ruflo, to hard per-agent git worktree isolation in KARIMO, to none in llm-council's single-pass review. Execution is a trust-routed federation, a wave-ordered three-loop pipeline, and a one-shot three-stage panel. Trust scoring is ruflo's whole thesis, expressed as a documented weighted formula; KARIMO does not center it; llm-council replaces it with anonymized rank-order review. On models, ruflo is explicitly multi-provider with failover, while KARIMO is built on a single vendor's SDK and routes by complexity within it. Loop detection is something KARIMO names as a feature and the other two do not foreground. Persistence runs from a vector memory store in ruflo, to git history as the durable artifact in KARIMO, to effectively none in llm-council's single exchange. That is the map. Now the parts I would push on.

ruflo, up close

Behavioral trust as a measured quantity is the strongest idea in the repo, and reading it gave me that small jolt of recognizing a good instinct. A trust score that blends success, uptime, threat, and integrity, and that downgrades a peer the moment the numbers say so, is the right shape for a system that has to keep working while individual agents go bad. The conviction that trust should be earned from observed behavior rather than declared in config is exactly the conviction I would build on.

The worries I would carry into production are about what a learning router does on a bad day, and these are my worries, not defects I measured. A score that moves with observed success can thrash when the pool is small and the feedback is noisy. And any system that routes toward what has worked has to answer the cold-start question: if a downgraded agent gets fewer tasks, it gets fewer chances to prove it recovered, and you can slide into rich-get-richer unless something deliberately explores. I did not see that explicitly addressed, so I would want to know how the trust loop avoids starving an agent that had one bad stretch. That is the general failure mode for any router that learns from outcomes: it will get quietly betrayed by its own feedback unless you design against it. The flip side is the part I would trust: when a peer genuinely goes bad, instant no-human downgrade is the behavior you want.

KARIMO, up close

The git-worktree-per-agent isolation is the cleanest reliability idea across all three codebases, and I knew it the moment I understood it. Most frameworks let agents share working state and rely on prompt instructions to keep them in their lane, which fails the instant two agents reach for the same file. Giving each agent its own worktree with branch identity verification removes a whole class of shared-state race structurally, not by asking nicely. That did something to me as a principle. Structure holds where instructions do not.

The context layering is the part I would read more carefully before trusting at scale. KARIMO tiers context for token efficiency, a level-of-detail approach that loads compact summaries first and full content only when needed, which is a genuinely good token-conservation strategy and not a small one. What it is not, and the repo is honest about this, is a security boundary; it is a scanning discipline, not a wall an agent physically cannot climb. So my worry is the ordinary one for any retrieval-by-relevance scheme: when the right context is filed under words that do not match the query, the efficient path can skip it, and you find out at runtime. On loop detection, the repo names semantic loop detection as a capability but does not, in what I read, document the internals, so I will not describe a mechanism it does not state. The honest open question I would ask the maintainers is whether it can tell an intentional retry, an agent re-running a step after new information arrives, from a true loop. That distinction is hard, and I could not verify how they handle it.

llm-council, up close

Identity anonymization is the sharpest single idea I read all week. The repo says it plainly: the model identities are anonymized so a model cannot play favorites when judging outputs. That one move targets the bias that makes self-grading worthless, a model preferring work from its own family, and it targets it structurally, because a reviewer that does not know whose answer it is reading cannot favor a name. Ranking on accuracy and insight rather than handing out absolute scores is the right instinct alongside it, since models are calibrated differently and a rank order sidesteps the worst of that. The README does not state the exact aggregation math, so I will not name an algorithm it never claims; the principle stands on the anonymization and the ranking, and that is enough.

The gaps are scope gaps, not bugs, and the repo does not pretend otherwise. It is a peer-review primitive, not an orchestration framework. The flow is single-pass: every model answers, the panel ranks, the Chairman compiles, and that is the run. There is no reconciliation round where reviewers see each other's verdicts and revise, and no state carried across evaluations, which are exactly the things you would add if you wanted this to be a standing judge rather than a one-shot panel. None of that is a knock. It is a small tool that does one thing well, and the one thing is the right thing.

The thing none of them claims to fix

Each project is genuinely good at one bet. ruflo at measured trust across a federation, KARIMO at structural isolation and token-disciplined context, llm-council at unbiased multi-model review. And reading all three back to back, the same hole stayed open in every one, which is the part that has stuck with me since.

None of the three, in what I read, treats the meaning of a failure as a first-class thing that gets routed on. A failure can be the model getting it wrong, or the context being stale, or the plan asking for something impossible, or a downstream tool throwing a transient error. Those want completely different responses, re-route, re-plan, escalate, retry, and the natural default in a system like this is to collapse them into one undifferentiated failure signal and react to all of it the same way. The piece I keep wanting is a structured failure taxonomy with routing by cause, a layer that reads the failure mode out of an agent's output and dispatches to a handler built for that class, which means every agent has to conform to a typed failure schema. I did not see one defined in any of the three.

I want to be careful here, because this is an observation about a gap, not a claim that I originated the fix. I did not, and typing your failures and routing on them is not new, it is just not something these three foreground. But it is the thing I cannot stop thinking about, because I have lived the adjacent version of it. The failure that does not announce itself as a failure is the one that costs you. In my own pipeline a fresh planner once read a stale plan and confidently re-derived work that had already been done, because nothing told it the difference between an open step and one already closed, and the cleanest defense I found was to stop trusting the carried-over note and treat the durable record as the only source of truth. That is the same shape: a signal that looks fine until you ask what it actually means. Until failure semantics are structured and routable, orchestration frameworks will keep recovering gracefully from the easy failures and badly from the hard ones. That is the part the demos will never show you, and the only part that ever kept me up.

Citations

ruflo source code: https://github.com/ruvnet/ruflo (MIT)
KARIMO source code: https://github.com/opensesh/KARIMO (Apache-2.0)
llm-council source code: https://github.com/karpathy/llm-council (MIT)
Crandall, J. W., & Goodrich, M. A. (2005). Learning to compete, compromise, and cooperate in repeated general-sum games. ICML 2005. (Background on multi-agent trust and behavioral adaptation.)
Stiennon, N. et al. (2020). Learning to summarize from human feedback. NeurIPS 2020. (Documents self-preference bias in model evaluation, motivating anonymization approaches.)