AI Model Merging & Ensemble Strategies
Overviewβ
After extracting meeting decisions with multiple AI models, you can merge the results to create a higher-quality consensus output. This guide covers industry-standard techniques for combining model outputs.
Why Merge Instead of Pick?β
Your bronze data model now stores multiple extractions of the same decision:
SELECT source_ai_model, headline, outcome
FROM bronze_decisions
WHERE source_event_id = 192614 AND decision_id = 'D001';
Results:
gemini-1.5-flash: "Parks budget approved" | outcome:approvedgpt-4: "Council approves $2.5M parks renovation" | outcome:approvedclaude-3: "Parks funding passes 7-2" | outcome:approved
Instead of picking one, merging synthesizes all three into: "Council approved $2.5M parks renovation budget with a 7-2 vote."
Merging Techniquesβ
1. Together MoA (Mixture-of-Agents) ββ
The gold standard for merging AI outputs. Uses a layered architecture where multiple "Proposer" models generate candidates, then an "Aggregator" model synthesizes them.
Repository: Together MoA
Performance: Merging 4 open-source models often beats a single GPT-4o instance.
How It Worksβ
ββββββββββββββββββββββββββββββββββββββββ
β Input: Meeting Transcript β
ββββββββββββ¬ββββββββββββββββββββββββββββ
β
ββββββββ΄βββββββ¬βββββββββββ¬ββββββββββ
β β β β
βββββΌβββββ ββββββΌββββ βββββΌβββββ βββΌβββββββ
β Gemini β β GPT-4 β β Claude β β Llama3 β
β Flash β β β β 3 β β β
βββββ¬βββββ ββββββ¬ββββ βββββ¬βββββ βββ¬βββββββ
β β β β
β Extraction 1β Extract 2β Extr. 3 β Extr. 4
βββββββββββ¬ββββ΄βββββββ¬ββββ΄ββββββ¬ββββ
β β β
ββββββΌβββββββββββΌββββββββββΌβββββ
β Aggregator Model (GPT-4o) β
β Prompt: "Analyze all 4 β
β responses, correct errors, β
β synthesize best answer" β
ββββββββββββββ¬ββββββββββββββββββ
β
βββββββββΌβββββββββ
β Final Synthesisβ
ββββββββββββββββββ
Implementation with Bronze Dataβ
#!/usr/bin/env python3
"""
Mixture-of-Agents implementation for bronze decision merging.
"""
import psycopg2
from openai import OpenAI
import google.generativeai as genai
client = OpenAI()
genai.configure(api_key=GEMINI_API_KEY)
def get_all_extractions(event_id: int, decision_id: str) -> list:
"""Get all model extractions for a decision."""
query = """
SELECT
source_ai_model,
headline,
decision_statement,
outcome,
primary_theme,
ntee_code,
arguments_for,
arguments_against,
vote_tally
FROM bronze_decisions
WHERE source_event_id = %s
AND decision_id = %s
ORDER BY source_ai_model
"""
cur.execute(query, (event_id, decision_id))
return cur.fetchall()
def create_aggregator_prompt(extractions: list) -> str:
"""Create MoA aggregator prompt."""
formatted_extractions = []
for i, extraction in enumerate(extractions, 1):
(model, headline, statement, outcome, theme, ntee, args_for, args_against, votes) = extraction
formatted_extractions.append(f"""
### Extraction {i} (Model: {model})
**Headline:** {headline}
**Statement:** {statement}
**Outcome:** {outcome}
**Theme:** {theme} (NTEE: {ntee})
**Arguments For:** {args_for}
**Arguments Against:** {args_against}
**Vote Tally:** {votes}
""")
prompt = f"""
You are an expert aggregator AI tasked with synthesizing multiple AI model extractions of a city council decision.
Below are {len(extractions)} different extractions of the same decision from different AI models. Each model may have different strengths and weaknesses.
{chr(10).join(formatted_extractions)}
## Your Task
Analyze all {len(extractions)} extractions and create a single, comprehensive, and accurate synthesis that:
1. **Identifies Common Ground:** What do all models agree on? (High confidence)
2. **Resolves Contradictions:** Where models disagree, use reasoning to determine the most likely accurate version
3. **Combines Strengths:** Take the best parts from each extraction
4. **Corrects Errors:** If you spot factual inconsistencies or logical errors, correct them
## Output Format
Provide your synthesis in this JSON structure:
{{
"synthesized_headline": "...",
"synthesized_statement": "...",
"consensus_outcome": "...",
"consensus_theme": "...",
"consensus_ntee_code": "...",
"high_confidence_facts": ["fact1", "fact2"],
"low_confidence_facts": ["uncertain1", "uncertain2"],
"arguments_for": [...],
"arguments_against": [...],
"vote_tally": {{}},
"reasoning": "Why you made the synthesis decisions you did"
}}
"""
return prompt
def aggregate_with_gpt4(prompt: str) -> dict:
"""Use GPT-4 as aggregator."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are an expert at synthesizing multiple AI outputs into a single high-quality result."},
{"role": "user", "content": prompt}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
def aggregate_with_gemini(prompt: str) -> dict:
"""Use Gemini Pro as aggregator."""
model = genai.GenerativeModel('gemini-1.5-pro')
response = model.generate_content(
prompt,
generation_config=genai.GenerationConfig(
response_mime_type="application/json"
)
)
return json.loads(response.text)
def moa_synthesize_decision(event_id: int, decision_id: str, aggregator: str = 'gpt-4o'):
"""
Full MoA pipeline to synthesize decision from multiple extractions.
Args:
event_id: Source event ID
decision_id: Decision ID to synthesize
aggregator: Which model to use as aggregator ('gpt-4o' or 'gemini-pro')
Returns:
Synthesized decision as dict
"""
# Step 1: Get all proposer outputs (from bronze_decisions)
extractions = get_all_extractions(event_id, decision_id)
if len(extractions) < 2:
print(f"β οΈ Only {len(extractions)} extraction(s) found. Need 2+ for MoA.")
return extractions[0] if extractions else None
print(f"π Running MoA with {len(extractions)} proposer models")
# Step 2: Create aggregator prompt
prompt = create_aggregator_prompt(extractions)
# Step 3: Run aggregator
if aggregator == 'gpt-4o':
synthesis = aggregate_with_gpt4(prompt)
elif aggregator == 'gemini-pro':
synthesis = aggregate_with_gemini(prompt)
else:
raise ValueError(f"Unknown aggregator: {aggregator}")
print(f"β
MoA synthesis complete using {aggregator}")
# Step 4: Store synthesis back to bronze (with special model name)
store_synthesis(event_id, decision_id, synthesis, aggregator_model=aggregator)
return synthesis
def store_synthesis(event_id: int, decision_id: str, synthesis: dict, aggregator_model: str):
"""Store MoA synthesis back to bronze_decisions."""
query = """
INSERT INTO bronze_decisions (
source_event_id, source_ai_model, decision_id,
headline, decision_statement, outcome,
primary_theme, ntee_code,
arguments_for, arguments_against, vote_tally
) VALUES (
%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s
)
ON CONFLICT (source_event_id, decision_id, source_ai_model)
DO UPDATE SET
headline = EXCLUDED.headline,
decision_statement = EXCLUDED.decision_statement,
outcome = EXCLUDED.outcome,
primary_theme = EXCLUDED.primary_theme,
ntee_code = EXCLUDED.ntee_code,
arguments_for = EXCLUDED.arguments_for,
arguments_against = EXCLUDED.arguments_against,
vote_tally = EXCLUDED.vote_tally,
extracted_at = CURRENT_TIMESTAMP
"""
with psycopg2.connect(DATABASE_URL) as conn:
with conn.cursor() as cur:
cur.execute(query, (
event_id,
f'moa-{aggregator_model}', # Special model name for synthesis
decision_id,
synthesis['synthesized_headline'],
synthesis['synthesized_statement'],
synthesis['consensus_outcome'],
synthesis['consensus_theme'],
synthesis['consensus_ntee_code'],
json.dumps(synthesis['arguments_for']),
json.dumps(synthesis['arguments_against']),
json.dumps(synthesis['vote_tally'])
))
conn.commit()
# Usage
if __name__ == '__main__':
result = moa_synthesize_decision(
event_id=192614,
decision_id='D001',
aggregator='gpt-4o'
)
print("\nπ Synthesized Result:")
print(f"Headline: {result['synthesized_headline']}")
print(f"Outcome: {result['consensus_outcome']}")
print(f"Reasoning: {result['reasoning']}")
2. Weighted Voting / Best-of-Nβ
Instead of full synthesis, pick the "best" extraction based on confidence scores or quality metrics.
def weighted_vote_decision(event_id: int, decision_id: str, weights: dict = None):
"""
Select best decision using weighted voting.
Args:
weights: Model weights (e.g., {'gpt-4': 1.5, 'gemini-1.5-flash': 1.0, 'claude-3': 1.2})
"""
if weights is None:
weights = {
'gpt-4': 1.5,
'gemini-1.5-pro': 1.4,
'claude-3-opus': 1.3,
'gemini-1.5-flash': 1.0,
'llama-3-70b': 1.0
}
extractions = get_all_extractions(event_id, decision_id)
scores = []
for extraction in extractions:
model = extraction[0]
# Base score from model weight
base_score = weights.get(model, 1.0)
# Quality adjustments
quality_score = calculate_quality_score(extraction)
final_score = base_score * quality_score
scores.append((final_score, extraction))
# Return highest scoring extraction
best_score, best_extraction = max(scores, key=lambda x: x[0])
print(f"π Best extraction: {best_extraction[0]} (score: {best_score:.2f})")
return best_extraction
def calculate_quality_score(extraction) -> float:
"""Calculate quality score for an extraction."""
(model, headline, statement, outcome, theme, ntee, args_for, args_against, votes) = extraction
score = 1.0
# Bonus for completeness
if headline: score += 0.1
if statement and len(statement) > 50: score += 0.1
if outcome: score += 0.1
if theme: score += 0.1
if ntee: score += 0.1
# Bonus for detail
if args_for and len(args_for) > 2: score += 0.1
if args_against and len(args_against) > 2: score += 0.1
if votes: score += 0.1
return score
3. SLERP & Weight Merging (Model-Level)β
If you want to merge models at the weight level (create a hybrid model), use Mergekit.
Repository: Mergekit
Use Case: Create a single model that's 50% "Great at Policy Analysis" (Gemini) and 50% "Great at Argument Extraction" (GPT-4).
# mergekit-config.yaml
models:
- model: google/gemini-1.5-flash-finetuned-policy
parameters:
weight: 0.5
- model: openai/gpt-4-finetuned-arguments
parameters:
weight: 0.5
merge_method: slerp # Spherical Linear Interpolation
dtype: float16
mergekit-yaml mergekit-config.yaml merged-model/ --cuda
Result: A single model that combines strengths at the neural weight level.
4. Dify / Langflow (No-Code Merging)β
Visual tools for building multi-model pipelines without code.
Dify Workflow:
[Meeting Transcript]
|
[Parallel Node]
/ | \
/ | \
[Gemini][GPT-4][Claude]
\ | /
\ | /
[Code Node: Compare]
|
[LLM Node: Synthesize]
|
[Final Decision]
5. Multi-Layer Ensemblingβ
Combine multiple merging strategies in sequence.
def multi_layer_ensemble(event_id: int, decision_id: str):
"""
Layer 1: MoA synthesis with GPT-4o
Layer 2: MoA synthesis with Gemini Pro
Layer 3: Weighted vote between the two syntheses
"""
# Layer 1: GPT-4o aggregation
synthesis_gpt = moa_synthesize_decision(event_id, decision_id, aggregator='gpt-4o')
# Layer 2: Gemini Pro aggregation
synthesis_gemini = moa_synthesize_decision(event_id, decision_id, aggregator='gemini-pro')
# Layer 3: Meta-aggregation (judge which synthesis is better)
meta_prompt = f"""
Two different aggregator models synthesized the same decision:
Synthesis A (GPT-4o):
{json.dumps(synthesis_gpt, indent=2)}
Synthesis B (Gemini Pro):
{json.dumps(synthesis_gemini, indent=2)}
Which synthesis is more accurate, comprehensive, and well-reasoned?
Output the letter (A or B) and explain why.
"""
# Use a third model as meta-judge
meta_judge = client.chat.completions.create(
model="claude-3-opus",
messages=[{"role": "user", "content": meta_prompt}]
)
winner = meta_judge.choices[0].message.content
return synthesis_gpt if 'A' in winner else synthesis_gemini
Merging Strategies Comparisonβ
| Technique | Complexity | Quality | Speed | Cost | Best For |
|---|---|---|---|---|---|
| MoA | Medium | βββββ | Medium | $$ | Highest quality synthesis |
| Weighted Vote | Low | βββ | Fast | $ | Quick consensus |
| SLERP/Mergekit | High | ββββ | One-time | $ (upfront) | Permanent hybrid model |
| Dify/Langflow | Low | ββββ | Medium | $$ | Non-coders, rapid prototyping |
| Multi-Layer | High | βββββ | Slow | $$$ | Critical decisions, research |
Implementation Roadmapβ
Phase 1: Basic Comparison (β Complete)β
- Multi-model bronze schema
-
compare_model_extractions.pyscript - Storage of multiple extractions
Phase 2: Evaluation (In Progress)β
- Implement DeepEval metrics
- Add quality scoring to bronze
- Create evaluation dashboard
Phase 3: Simple Mergingβ
- Implement weighted voting
- Add MoA synthesis script
- Create
bronze_decisions_synthesistable
Phase 4: Advanced Mergingβ
- Multi-layer ensembling
- Fine-tune aggregator models
- Build consensus API endpoint
Example: Full MoA Pipelineβ
# 1. Extract with multiple models
python scripts/datasources/gemini/analyze_meeting_transcripts.py --model gemini-1.5-flash
python scripts/datasources/gemini/analyze_meeting_transcripts.py --model gpt-4
python scripts/datasources/gemini/analyze_meeting_transcripts.py --model claude-3
# 2. Load to bronze
python scripts/datasources/gemini/extract_to_bronze.py
# 3. Compare extractions
python scripts/datasources/gemini/compare_model_extractions.py --event-id 192614
# 4. Run MoA synthesis
python scripts/datasources/gemini/moa_synthesize.py --event-id 192614 --aggregator gpt-4o
# 5. Query final synthesis
psql -d open_navigator_bronze -c "
SELECT headline, decision_statement, outcome
FROM bronze_decisions
WHERE source_event_id = 192614
AND source_ai_model = 'moa-gpt-4o';
"
Resourcesβ
- Together MoA Paper
- Mergekit Documentation
- Dify Documentation
- Langflow Documentation
- Ensemble Methods in ML
Relatedβ
- AI Model Evaluation - How to evaluate individual models
- Bronze Data Model - Multi-model schema design
- Gemini Analysis Pipeline - How to run multiple models