Tools

Tools: Mixture-of-Agents: Making LLMs Collaborate Instead of Compete

2026-02-20 0 views admin

Tools: Mixture-of-Agents: Making LLMs Collaborate Instead of Compete

Source: Dev.to

The Problem With "Just Pick the Best Model" ## How MoA Works: The Two-Phase Architecture ## Six Strategies, Six Different Behaviors ## Strategy 1: Consensus (Default) ## Strategy 2: Council ## Strategy 3: Best-Of ## Strategy 4: Chain ## Strategy 5: MoA (The Real Thing) ## The Engineering Decisions That Mattered ## Strategy 6: Self-MoA ## What Surprised Me ## When to Use What ## Try It Yourself ## The Bigger Picture What if instead of picking the best model for your prompt, you made all models collaborate on the answer? That's the core idea behind Mixture-of-Agents (MoA) — a technique from a 2024 research paper that showed LLMs produce better outputs when they can see and improve upon each other's responses. The paper demonstrated that even weaker models can boost the quality of stronger ones through this iterative refinement. I implemented MoA as a production API endpoint. This post covers the architecture, the six strategies I built, the engineering decisions that weren't obvious, and the parts that surprised me. Most developers approach multi-model setups with a simple question: which model is best for this task? But the answer changes depending on the prompt, the domain, the time of day, and honestly a bit of luck. I noticed something while building a Compare mode that runs the same prompt through multiple models simultaneously. When I looked at the side-by-side outputs, the best answer was rarely from a single model. One model would nail the structure. Another would have a better code example. A third would catch an edge case the others missed. The insight: the best response doesn't exist yet — it's a synthesis of what each model does well. Every MoA request follows the same skeleton: Phase 1 is embarrassingly parallel — all models run concurrently. Phase 2 is where the strategy matters. This looks simple, but the synthesis step is where the engineering complexity lives. I didn't build just one synthesis approach. Different use cases need different synthesis behaviors. The synthesizer gets all source responses and one instruction: combine the strongest points while resolving contradictions. This is the workhorse strategy. For most prompts, consensus produces noticeably better answers than any single model. The synthesizer naturally picks the best explanation from one model, the best code from another, and structures it coherently. Same input, but the synthesis output is structured differently: Council mode is invaluable when you need transparency about model consensus. If you're using LLMs for research or decision support, knowing where models agree vs. disagree is often more useful than a single blended answer. The synthesizer picks the single best response and enhances it with useful additions from the others. Minimal rewriting — focused on augmentation. This is the fastest synthesis approach and works well when one model clearly dominates but the others have minor additions worth incorporating. The synthesizer works through each response sequentially, building a comprehensive answer by incrementally incorporating each model's contribution. Chain produces the most thorough output but tends to be longer. Use it when completeness matters more than conciseness. This is where it gets interesting. The previous strategies are all single-pass synthesis. True MoA adds refinement layers where models iterate on each other's work. Each layer's responses are injected as reference material via system message: Reference budget management. You can't just dump three 4,000-token responses into the context of every model at every layer. I set a total reference budget of 12,000 characters across all references, with a 3,200-character cap per individual answer. Anything longer gets truncated. This keeps costs sane while preserving the most useful content. Early stopping. If a layer produces zero successful responses (all models hit rate limits or errors), the system keeps the previous layer's successes and skips to synthesis. This prevents total failure when one bad layer would cascade. Layer count sweet spot. The paper tested up to 3 layers. In practice, I found that 1-2 layers give the best quality-to-cost ratio. Layer 0 to Layer 1 produces the biggest quality jump. Layer 1 to Layer 2 is marginal improvement for double the API calls. I default to layers: 1 and let users override. What if you trust one model but want to hedge against its variance? Self-MoA generates multiple diverse candidates from a single model by varying the temperature and system prompt. For a request with temperature: 0.7 and 4 samples: The synthesizer then combines these four perspectives into one answer. It's surprisingly effective — you get diversity without paying for multiple model providers. Weaker models genuinely improve stronger ones. I was skeptical, but the data backs the paper's finding. When Gemini Flash (a fast, cheap model) is included alongside GPT and Claude in MoA, the final synthesized answer is often better than a 2-model blend of just GPT + Claude. The weaker model catches things the stronger ones miss or phrases things differently enough to trigger better synthesis. The synthesizer model matters more than the source models. If I had to pick where to spend my budget, I'd put the best model as the synthesizer and use cheaper models as sources. The synthesis step is where quality is won or lost. Consensus beats MoA for simple prompts. Full MoA with refinement layers is overkill for straightforward questions. The extra API calls and latency aren't worth it. I use MoA for high-value outputs — technical architecture decisions, long-form content, complex code generation — where the quality improvement justifies 3-4x the cost. Streaming MoA is an UX challenge. In Compare mode, you can stream each model's response as it arrives. In MoA, the user sees nothing until Phase 2 starts. I solved this by streaming status events during Phase 1 so the user knows progress is happening: Here's my decision framework after running thousands of requests through each strategy: All strategies cost the same from a billing perspective because the credit cost is fixed per Blend request. The real cost difference is in the underlying API calls — MoA with 2 layers and 3 models makes 9 API calls (3 per layer × 3 layers including synthesis), while Consensus makes 4 (3 source + 1 synthesis). If you want to experiment with these strategies, the full API is at LLMWise. A Blend request looks like: The complete technical documentation covering all six strategies, the scoring algorithms, and the reference injection system is at llmwise.ai/llms-full.txt. MoA represents a shift in how we think about LLMs. Instead of asking "which model is best?", we ask "how can models collaborate?" The answer turns out to be: surprisingly well, when you give them the right architecture. The techniques here aren't theoretical. They're running in production, handling real requests, and consistently producing better outputs than any single model alone. The cost overhead is real, but for high-value use cases, the quality improvement is worth it. If you're running multi-model setups in production, I'd love to hear your approach. Are you blending outputs or just routing to the best model? What's working? Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: Phase 1: Source Generation └── N models answer the prompt independently Phase 2: Synthesis └── A synthesizer model combines the best parts Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Phase 1: Source Generation └── N models answer the prompt independently Phase 2: Synthesis └── A synthesizer model combines the best parts CODE_BLOCK: Phase 1: Source Generation └── N models answer the prompt independently Phase 2: Synthesis └── A synthesizer model combines the best parts COMMAND_BLOCK: async def blend(models, synthesizer, messages, strategy): # Phase 1: Get source responses (concurrent) tasks = [call_model(m, messages) for m in models] source_responses = await asyncio.gather(tasks, return_exceptions=True) # Filter failures successes = [r for r in source_responses if not isinstance(r, Exception)] if len(successes) == 0: raise AllSourcesFailedError() # Phase 2: Synthesize based on strategy return await synthesize(synthesizer, messages, successes, strategy) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: async def blend(models, synthesizer, messages, strategy): # Phase 1: Get source responses (concurrent) tasks = [call_model(m, messages) for m in models] source_responses = await asyncio.gather(tasks, return_exceptions=True) # Filter failures successes = [r for r in source_responses if not isinstance(r, Exception)] if len(successes) == 0: raise AllSourcesFailedError() # Phase 2: Synthesize based on strategy return await synthesize(synthesizer, messages, successes, strategy) COMMAND_BLOCK: async def blend(models, synthesizer, messages, strategy): # Phase 1: Get source responses (concurrent) tasks = [call_model(m, messages) for m in models] source_responses = await asyncio.gather(tasks, return_exceptions=True) # Filter failures successes = [r for r in source_responses if not isinstance(r, Exception)] if len(successes) == 0: raise AllSourcesFailedError() # Phase 2: Synthesize based on strategy return await synthesize(synthesizer, messages, successes, strategy) CODE_BLOCK: CONSENSUS_PROMPT = """You are a synthesis expert. You have received multiple responses to the same question from different AI models. Your job: 1. Identify the strongest points from each response 2. Resolve any contradictions by weighing the majority view 3. Produce one definitive answer that's better than any individual response Do not mention that multiple models were consulted. """ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: CONSENSUS_PROMPT = """You are a synthesis expert. You have received multiple responses to the same question from different AI models. Your job: 1. Identify the strongest points from each response 2. Resolve any contradictions by weighing the majority view 3. Produce one definitive answer that's better than any individual response Do not mention that multiple models were consulted. """ CODE_BLOCK: CONSENSUS_PROMPT = """You are a synthesis expert. You have received multiple responses to the same question from different AI models. Your job: 1. Identify the strongest points from each response 2. Resolve any contradictions by weighing the majority view 3. Produce one definitive answer that's better than any individual response Do not mention that multiple models were consulted. """ CODE_BLOCK: { "final_answer": "The synthesized conclusion", "agreement_points": ["Where all models aligned"], "disagreement_points": ["Where they diverged + analysis"], "follow_up_questions": ["Areas needing exploration"] } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: { "final_answer": "The synthesized conclusion", "agreement_points": ["Where all models aligned"], "disagreement_points": ["Where they diverged + analysis"], "follow_up_questions": ["Areas needing exploration"] } CODE_BLOCK: { "final_answer": "The synthesized conclusion", "agreement_points": ["Where all models aligned"], "disagreement_points": ["Where they diverged + analysis"], "follow_up_questions": ["Areas needing exploration"] } CODE_BLOCK: Step 1: Start with Model A's response as base Step 2: Read Model B's response, integrate new points Step 3: Read Model C's response, integrate new points Step 4: Final coherence pass Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Step 1: Start with Model A's response as base Step 2: Read Model B's response, integrate new points Step 3: Read Model C's response, integrate new points Step 4: Final coherence pass CODE_BLOCK: Step 1: Start with Model A's response as base Step 2: Read Model B's response, integrate new points Step 3: Read Model C's response, integrate new points Step 4: Final coherence pass CODE_BLOCK: Layer 0: Each model answers independently GPT → Response A₀ Claude → Response B₀ Gemini → Response C₀ Layer 1: Each model sees Layer 0's answers as "references" GPT sees [B₀, C₀] → produces A₁ (improved) Claude sees [A₀, C₀] → produces B₁ (improved) Gemini sees [A₀, B₀] → produces C₁ (improved) Layer 2: Each model sees Layer 1's answers GPT sees [B₁, C₁] → produces A₂ Claude sees [A₁, C₁] → produces B₂ Gemini sees [A₁, B₁] → produces C₂ Final: Synthesizer combines Layer 2 outputs Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Layer 0: Each model answers independently GPT → Response A₀ Claude → Response B₀ Gemini → Response C₀ Layer 1: Each model sees Layer 0's answers as "references" GPT sees [B₀, C₀] → produces A₁ (improved) Claude sees [A₀, C₀] → produces B₁ (improved) Gemini sees [A₀, B₀] → produces C₁ (improved) Layer 2: Each model sees Layer 1's answers GPT sees [B₁, C₁] → produces A₂ Claude sees [A₁, C₁] → produces B₂ Gemini sees [A₁, B₁] → produces C₂ Final: Synthesizer combines Layer 2 outputs CODE_BLOCK: Layer 0: Each model answers independently GPT → Response A₀ Claude → Response B₀ Gemini → Response C₀ Layer 1: Each model sees Layer 0's answers as "references" GPT sees [B₀, C₀] → produces A₁ (improved) Claude sees [A₀, C₀] → produces B₁ (improved) Gemini sees [A₀, B₀] → produces C₁ (improved) Layer 2: Each model sees Layer 1's answers GPT sees [B₁, C₁] → produces A₂ Claude sees [A₁, C₁] → produces B₂ Gemini sees [A₁, B₁] → produces C₂ Final: Synthesizer combines Layer 2 outputs CODE_BLOCK: REFERENCE_INJECTION = """Below are responses from other AI assistants for the same question. Use them as references to improve your answer. Identify what's strong, correct any errors, and expand where needed. {references} Now provide your improved response to the original question. """ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: REFERENCE_INJECTION = """Below are responses from other AI assistants for the same question. Use them as references to improve your answer. Identify what's strong, correct any errors, and expand where needed. {references} Now provide your improved response to the original question. """ CODE_BLOCK: REFERENCE_INJECTION = """Below are responses from other AI assistants for the same question. Use them as references to improve your answer. Identify what's strong, correct any errors, and expand where needed. {references} Now provide your improved response to the original question. """ COMMAND_BLOCK: MAX_TOTAL_CHARS = 12_000 MAX_PER_ANSWER = 3_200 def prepare_references(responses): truncated = [r[:MAX_PER_ANSWER] for r in responses] total = sum(len(r) for r in truncated) if total > MAX_TOTAL_CHARS: # Proportionally reduce each ratio = MAX_TOTAL_CHARS / total truncated = [r[:int(len(r) ratio)] for r in truncated] return truncated Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: MAX_TOTAL_CHARS = 12_000 MAX_PER_ANSWER = 3_200 def prepare_references(responses): truncated = [r[:MAX_PER_ANSWER] for r in responses] total = sum(len(r) for r in truncated) if total > MAX_TOTAL_CHARS: # Proportionally reduce each ratio = MAX_TOTAL_CHARS / total truncated = [r[:int(len(r) * ratio)] for r in truncated] return truncated COMMAND_BLOCK: MAX_TOTAL_CHARS = 12_000 MAX_PER_ANSWER = 3_200 def prepare_references(responses): truncated = [r[:MAX_PER_ANSWER] for r in responses] total = sum(len(r) for r in truncated) if total > MAX_TOTAL_CHARS: # Proportionally reduce each ratio = MAX_TOTAL_CHARS / total truncated = [r[:int(len(r) * ratio)] for r in truncated] return truncated COMMAND_BLOCK: async def run_moa_layers(models, messages, num_layers): prev_responses = None for layer in range(num_layers): layer_responses = await run_layer( models, messages, prev_responses ) successes = [r for r in layer_responses if r is not None] if len(successes) == 0 and prev_responses: # Early stop: keep previous layer's results break if len(successes) > 0: prev_responses = successes return prev_responses Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: async def run_moa_layers(models, messages, num_layers): prev_responses = None for layer in range(num_layers): layer_responses = await run_layer( models, messages, prev_responses ) successes = [r for r in layer_responses if r is not None] if len(successes) == 0 and prev_responses: # Early stop: keep previous layer's results break if len(successes) > 0: prev_responses = successes return prev_responses COMMAND_BLOCK: async def run_moa_layers(models, messages, num_layers): prev_responses = None for layer in range(num_layers): layer_responses = await run_layer( models, messages, prev_responses ) successes = [r for r in layer_responses if r is not None] if len(successes) == 0 and prev_responses: # Early stop: keep previous layer's results break if len(successes) > 0: prev_responses = successes return prev_responses CODE_BLOCK: TEMPERATURE_OFFSETS = [-0.25, 0.0, +0.25, +0.45, +0.15, +0.35, -0.1, +0.3] AGENT_PROMPTS = [ "Focus on technical accuracy and precision.", "Prioritize practical examples and real-world applications.", "Emphasize clarity and make the explanation accessible.", "Be thorough and cover edge cases others might miss.", "Challenge assumptions and flag potential weaknesses.", "Focus on brevity and directness.", ] Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: TEMPERATURE_OFFSETS = [-0.25, 0.0, +0.25, +0.45, +0.15, +0.35, -0.1, +0.3] AGENT_PROMPTS = [ "Focus on technical accuracy and precision.", "Prioritize practical examples and real-world applications.", "Emphasize clarity and make the explanation accessible.", "Be thorough and cover edge cases others might miss.", "Challenge assumptions and flag potential weaknesses.", "Focus on brevity and directness.", ] CODE_BLOCK: TEMPERATURE_OFFSETS = [-0.25, 0.0, +0.25, +0.45, +0.15, +0.35, -0.1, +0.3] AGENT_PROMPTS = [ "Focus on technical accuracy and precision.", "Prioritize practical examples and real-world applications.", "Emphasize clarity and make the explanation accessible.", "Be thorough and cover edge cases others might miss.", "Challenge assumptions and flag potential weaknesses.", "Focus on brevity and directness.", ] CODE_BLOCK: Candidate 1: temp 0.45, prompt "accuracy" → conservative Candidate 2: temp 0.70, prompt "practical" → baseline Candidate 3: temp 0.95, prompt "clarity" → creative Candidate 4: temp 1.15, prompt "edge cases" → exploratory Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Candidate 1: temp 0.45, prompt "accuracy" → conservative Candidate 2: temp 0.70, prompt "practical" → baseline Candidate 3: temp 0.95, prompt "clarity" → creative Candidate 4: temp 1.15, prompt "edge cases" → exploratory CODE_BLOCK: Candidate 1: temp 0.45, prompt "accuracy" → conservative Candidate 2: temp 0.70, prompt "practical" → baseline Candidate 3: temp 0.95, prompt "clarity" → creative Candidate 4: temp 1.15, prompt "edge cases" → exploratory CODE_BLOCK: {"event": "source", "model": "gpt-5.2", "status": "complete", "tokens": 847} {"event": "source", "model": "claude-sonnet-4.5", "status": "complete", "tokens": 1203} {"event": "source", "model": "gemini-3-flash", "status": "complete", "tokens": 692} {"event": "synthesis", "status": "starting", "strategy": "consensus"} {"event": "chunk", "content": "The key difference between..."} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: {"event": "source", "model": "gpt-5.2", "status": "complete", "tokens": 847} {"event": "source", "model": "claude-sonnet-4.5", "status": "complete", "tokens": 1203} {"event": "source", "model": "gemini-3-flash", "status": "complete", "tokens": 692} {"event": "synthesis", "status": "starting", "strategy": "consensus"} {"event": "chunk", "content": "The key difference between..."} CODE_BLOCK: {"event": "source", "model": "gpt-5.2", "status": "complete", "tokens": 847} {"event": "source", "model": "claude-sonnet-4.5", "status": "complete", "tokens": 1203} {"event": "source", "model": "gemini-3-flash", "status": "complete", "tokens": 692} {"event": "synthesis", "status": "starting", "strategy": "consensus"} {"event": "chunk", "content": "The key difference between..."} COMMAND_BLOCK: curl -X POST https://llmwise.ai/api/v1/blend \ -H "Authorization: Bearer mm_sk_YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{ "models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"], "synthesizer": "claude-sonnet-4.5", "strategy": "moa", "layers": 1, "messages": [ {"role": "user", "content": "Design a rate limiter for a distributed system"} ], "stream": true }' Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: curl -X POST https://llmwise.ai/api/v1/blend \ -H "Authorization: Bearer mm_sk_YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{ "models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"], "synthesizer": "claude-sonnet-4.5", "strategy": "moa", "layers": 1, "messages": [ {"role": "user", "content": "Design a rate limiter for a distributed system"} ], "stream": true }' COMMAND_BLOCK: curl -X POST https://llmwise.ai/api/v1/blend \ -H "Authorization: Bearer mm_sk_YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{ "models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"], "synthesizer": "claude-sonnet-4.5", "strategy": "moa", "layers": 1, "messages": [ {"role": "user", "content": "Design a rate limiter for a distributed system"} ], "stream": true }'

🏷️ Tags

how-totutorialguidedev.toaillmgptrouting