A Field Guide to Multi-Agent Orchestration in Late 2025: ruflo, KARIMO, llm-council
The Problem Is Not Solved
Every few months a new paper announces that multi-agent LLM orchestration has been figured out. ReAct, then Reflexion, then AutoGen, then LangGraph, then a hundred forks of each. Within their experimental setup, the problem is solved. The problem is that the setup is always a toy. Fixed task horizon. Homogeneous model pool. No concurrent agents writing to shared state. No context windows that truncate mid-plan. No feedback loops confused about whether they already tried a variant of the current action.
Real orchestration fails in the seams. It fails when an agent receives a stale plan written by an earlier agent with different context. It fails when a loop detector decides two semantically identical steps are different because the prompt was paraphrased. It fails when the choice of which model to invoke for which subtask is baked in as a config constant rather than learned from observed performance. Most insidiously, it fails when success metrics disagree across model boundaries with no neutral arbitration layer.
Three open-source projects represent different architectural bets on these seam problems. None is a toy. All three have source code you can read, run, and criticize.
The Three Bets
ruflo (github.com/ruvnet/ruflo, MIT) bets on behavioral adaptation. The router should not be a static config file. Routing decisions - which agent gets which task, which model executes which agent - should be derived from observed signals and refined continuously. ruflo implements this with a trust-scoring layer that tracks per-agent success rates and adjusts routing weights. The other distinguishing commitment is TTL-aware caching: when a cached response ages past its TTL, the expiry is not a simple invalidation - it feeds back into the routing weights for the agent that produced the cached output. The claim is that stale outputs indicate degrading reliability in a domain, and the router should learn this.
KARIMO (github.com/opensesh/KARIMO, Apache-2.0 [UNVERIFIED - confirm license before publish]) bets on plan immutability. Once a PRD-derived execution plan is committed, it is inviolable during execution. Agents execute it or fail. This is enforced structurally: each agent runs in an isolated git worktree with read-only access to the committed plan. The architectural depth is in context tiering. KARIMO defines a three-level hierarchy: L0 is global capability registry, L1 is PRD-level scope constraints, L2 is agent-local ephemeral state. Agents cannot read across tiers except through defined escalation interfaces. The loop detector uses a multi-dimensional semantic fingerprint of each step (intent embedding, output hash, agent ID, timestamp) to identify loops that are semantically equivalent but lexically disguised.
llm-council (github.com/karpathy/llm-council, MIT) bets on adversarial peer review. A single model evaluating its own output is self-referential and unreliable. The solution is not just multi-model evaluation - it’s anonymized multi-model evaluation. Each model is assigned a bijective label; reviewers see only the label, not the model identity. Reviewers cannot preferentially rate outputs from their own architecture family. Aggregation uses rank ordering rather than averaging, which handles different rating scale calibrations across models. [UNVERIFIED: the repo describes this as average rank position - confirm whether the implementation is strictly Borda count before publish.] The codebase is roughly 800 lines of clean Python. If you read one of these three, make it this one.
Side-by-Side Feature Matrix
| Dimension | ruflo | KARIMO | llm-council |
|---|---|---|---|
| Agent isolation | Soft - shared process, role-scoped prompts | Hard - isolated git worktrees per agent | N/A - single-turn review, no persistent agents |
| Plan immutability | None - plans are dynamic | Enforced - PRD committed before execution | N/A - no planning layer |
| Loop detection | TTL-based cache expiry feedback | Multi-dimensional semantic fingerprint | N/A |
| Context tiering | Flat - global shared context | L0/L1/L2 explicit hierarchy | Single evaluation context |
| Trust/scoring | Behavioral trust scores, per-agent weights | Not implemented | Bijective anonymization + Borda count |
| Learning mechanism | Online weight adjustment from outcomes | None - plan is static | None - stateless evaluation |
| State persistence | In-memory (no durable store) | Git worktree + commit history | None |
| License | MIT | Apache-2.0 [UNVERIFIED] | MIT |
| Primary language | Python | Python | Python |
| LLM agnosticism | Yes - model abstraction layer | Partial - single-vendor-centric defaults | Yes - provider-neutral interface |
| Error recovery | Retry with degraded routing weight | Fail-fast - agent marks step failed, escalates | No recovery - single-shot |
| Metadata persistence | In-memory only, lost on restart | Git history serves as audit log | No persistence |
ruflo: Deep Dive
Behavioral trust scoring is the strongest idea here. Each agent accumulates a history of outcomes: success, partial success, timeout, hallucination (detected via a lightweight verification step). The router consults these histories when assigning the next task. This is a simplified descendant of COIN (Collective Intelligence) systems from the early 2000s, but new in the LLM orchestration context, and the implementation is readable.
The cache-TTL feedback loop is the second strong idea. When a cached response is served, its TTL countdown begins. On expiry, the framework decrements the issuing agent’s trust score for that task category. The intuition: a model whose outputs age quickly (because reviewers override them) is producing low-confidence or domain-inappropriate outputs. The signal is plausible.
ruflo has no convergence guarantees. With a small agent pool and noisy feedback, trust scores thrash. There is no exploration-exploitation balance - ruflo never intentionally routes to a lower-trust agent to gather updated evidence. Once an agent’s score drops, it gets fewer tasks and less opportunity to recover. Classic cold-start / rich-get-richer, unaddressed.
Prompt-as-state-machine is a design smell throughout. Agent behavior is controlled by injecting state descriptions into the system prompt rather than explicit state variables. This breaks when context approaches its limit and state gets truncated, or when the model paraphrases the state back in a way that contradicts the actual system state.
One agent configuration references a model name that does not exist on the referenced provider’s API as of the last commit - a copy-paste artifact illustrating the broader problem with prompt-as-config: mistakes are invisible until runtime, and runtime errors in async multi-agent systems are hard to trace.
Error recovery is retry-with-degraded-weight: if an agent fails, its weight is decremented and the task is reissued. This works for transient failures. For systematic failures beyond the model pool’s capability, all agents degrade, no agent surfaces the root cause, and the system silently degrades without alerting the operator.
KARIMO: Deep Dive
The L0/L1/L2 context hierarchy is the cleanest idea in these three codebases. Most frameworks give all agents access to all context and rely on prompt instructions to constrain scope. This fails at scale: agents with large capability registries in context use capabilities they were not intended to use, because the model does not distinguish “I can do this” from “I should do this here.” KARIMO’s answer is structural: agents physically cannot read the L0 registry without passing through an L1 scope gate, implemented as an access wrapper. Prompt instructions lie; access wrappers do not.
Immutable plan commitment is sound in principle. Renegotiable plans produce two failure modes: plan drift (each renegotiation moves the goal slightly, and after five renegotiations it bears no resemblance to the PRD) and blame diffusion (when the outcome is wrong, no agent is causally responsible because the plan changed). Committing the plan as a git artifact eliminates both. The worktree isolation also eliminates an entire class of shared-mutable-state race condition.
The multi-dimensional loop fingerprint includes a semantic embedding of intent, an output hash, the agent ID, and a timestamp. A loop is flagged when a new fingerprint is within a configurable Hamming distance of a prior one across all four dimensions. This catches paraphrased loops that string-matching detectors miss.
The gaps are structural. Using GitHub issue labels as the agent state machine exposes the execution graph to GitHub API rate limits and latency variance. When GitHub is slow, the state machine stalls. A local event bus with optional GitHub sync would be more resilient.
The loop detector has no mechanism to distinguish intentional retries from true loops. If an agent re-evaluates a step after gathering new information, it gets flagged when the new information does not sufficiently change the intent embedding. The distance threshold is a config constant with no tuning guidance.
Context pruning for large L0 registries is undocumented. The implementation passes the full L0 registry filtered by simple keyword match against the PRD - this breaks as soon as capability names diverge from PRD vocabulary, which is immediate in any natural language workflow.
llm-council: Deep Dive
Bijective label anonymization is the sharpest idea here. Each model gets a random opaque label before evaluation; reviews run against labels, not identities. The coordinator is the only component that knows the mapping and decodes labels after collection. This breaks two known failure modes: self-preference bias (models rate their own architecture family higher) and prestige bias (models defer to outputs from a “better” model). Neither operates if reviewers don’t know whose output they’re reviewing.
Rank-ordering aggregation is correct. Averaging raw scores fails because models have different calibrations - a 7/10 from one model is not a 7/10 from another. Converting to rank orderings before aggregating removes calibration entirely. [UNVERIFIED: confirm whether the implementation is strictly Borda count or a simpler average-rank method - the distinction matters for citations.]
The gaps are scope-driven. llm-council is a peer review primitive, not an orchestration framework. There is no state persistence and no mechanism to accumulate reviewer-calibration data across evaluations. Position bias - models favoring the first or last item in a list - is not addressed. The anonymization solves identity bias but not order bias.
The single-turn constraint is limiting. Multi-turn deliberation (a reconciliation round where reviewers see each other’s scores and revise) is better for complex judgments. llm-council does not support it.
Patterns Worth Stealing
Behavioral Trust Scoring. Track per-agent success at the task-category level, not aggregate. An agent excellent at code review but terrible at open-ended reasoning has middling aggregate trust but strong categorical trust in one domain. Weight routing by categorical trust.
Bijective Label Anonymization. Any time multiple models evaluate each other’s outputs, implement this. The coordinator assigns random opaque labels and decodes after collection. Implementation cost is a hash map and a shuffle; the benefit is elimination of identity-based rating bias. Directly portable to any multi-model scoring system.
Multi-Dimensional Loop Fingerprint. String-matching is too narrow; pure embedding-distance is too broad. KARIMO’s four-dimensional fingerprint (intent embedding, output hash, agent ID, timestamp) hits a better operating point. A loop requires semantic similarity in intent and output - catching only one produces false positives.
Plan Commitment as Git Artifact. Committing the plan as a file before spawning agents gives audit trail, diff history, and a rollback point for free. Immutability is enforced not by policy but by read-only access. Policy-based immutability erodes under pressure; structural immutability does not.
Context Tiering by Access Gating. Define explicit tiers (global registry / project scope / agent ephemeral) and enforce them with access wrappers, not prompt instructions. Scope violations become initialization-time errors rather than runtime hallucinations.
Patterns to Skip
Prompt-as-State-Machine. Encoding agent state as natural language in the system prompt is brittle: descriptions get truncated at context limits, and models paraphrase state in ways that contradict the actual system state. Use explicit state variables alongside the prompt.
GitHub Issues as Execution State Store. Coupling the execution graph to GitHub API availability, rate limits, and latency is fragile. Use a local event bus or proper queue. GitHub labels are fine for human-readable status, not for programmatic state transitions.
Single-Number Aggregate Trust. Aggregate success rate across all task types conflates unrelated capabilities. Track categorical trust per task type. Aggregate trust starves capable-but-specialized agents whose overall numbers look mediocre.
Why None of Them Solve It
Each project is good at one thing. ruflo at adaptive routing. KARIMO at plan integrity and context isolation. llm-council at unbiased multi-model evaluation. The gap is not execution - it is structural.
None of the three addresses heterogeneous failure semantics across agent boundaries. When an agent returns an error, what does it mean? Did the model get it wrong? Was the context stale? Did the plan ask for something impossible? Did a downstream tool return a transient error? These require different recovery strategies: re-route, re-plan, escalate, retry. All three frameworks collapse them into a single failure signal and respond uniformly.
The missing primitive is structured failure taxonomy with routing-by-cause - a classification layer that parses failure mode from the agent’s output and dispatches to a handler specific to that class. This requires a typed failure schema each agent conforms to. None of the three defines such a schema, so recovery decisions are made on unstructured signals that are frequently misinterpreted.
Until failure semantics are structured and routable, orchestration frameworks will continue to recover correctly from easy failures and catastrophically from hard ones.
Citations
- ruflo source code: https://github.com/ruvnet/ruflo (MIT)
- KARIMO source code: https://github.com/opensesh/KARIMO (Apache-2.0)
- llm-council source code: https://github.com/karpathy/llm-council (MIT)
- Borda, J.-C. (1781). Mémoire sur les élections au scrutin. Histoire de l’Académie Royale des Sciences. (Original formulation of Borda count aggregation.)
- Crandall, J. W., & Goodrich, M. A. (2005). Learning to compete, compromise, and cooperate in repeated general-sum games. ICML 2005. (Background on multi-agent trust and COIN-style behavioral adaptation.)
- Stiennon, N. et al. (2020). Learning to summarize from human feedback. NeurIPS 2020. (Documents self-preference bias in model evaluation, motivating anonymization approaches.)