Scale and Search Patterns: End-to-End Civic Tech Projects
This guide analyzes 6 additional civic tech projects focused on full-stack deployments, large-scale data aggregation, and public search portals. These complement our existing integration (Civic Scraper, City Scrapers, CDP, Engagic, Councilmatic) with new patterns for:
- 🤖 AI summarization (OpenTowns, MeetingBank)
- 🔍 Multi-jurisdiction search (CivicBand, LocalView)
- 🔔 Keyword alerting (OpenTowns)
- 📊 Research-grade pipelines (LocalView, MeetingBank)
- 🌍 International adaptability (OpenCouncil)
🎯 What's NEW vs. Our Existing Integration
| Pattern | Already Have | NEW from These Projects |
|---|---|---|
| Platform detection | ✅ Civic Scraper | - |
| Event schema | ✅ City Scrapers | - |
| Video ingestion | ✅ CDP | ✅ LocalView scale patterns |
| Matter tracking | ✅ Engagic | - |
| Search UX | ✅ Councilmatic | ✅ CivicBand cross-jurisdiction |
| AI Summarization | ❌ | ✅ OpenTowns, MeetingBank |
| Keyword Alerts | ❌ | ✅ OpenTowns |
| Scale (1,000+ jurisdictions) | ⚠️ Partial | ✅ CivicBand, LocalView |
| International patterns | ❌ | ✅ OpenCouncil |
📚 Project Analysis
1. Council Data Project (CDP) ⭐ Already Integrated
Status: Already documented in INTEGRATION_GUIDE.md
Key patterns we already use:
- Video transcript ingestion
- Searchable transcript storage
- Event indexing pipeline
See: docs/INTEGRATION_GUIDE.md Section 4
2. OpenTowns 🆕 AI Summarization Pioneer
GitHub: https://opentowns.org
License: Open civic-tech (check specific repo)
Focus: Small towns, AI-generated summaries, keyword alerts
🔥 What to Adopt
A. AI Summarization Pattern
# They generate readable summaries from raw transcripts/PDFs
# Pattern: transcript → summary → key decisions
from openai import OpenAI
from models.meeting_event import MeetingEvent
async def generate_meeting_summary(event: MeetingEvent, transcript: str) -> dict:
"""
OpenTowns pattern: Generate human-readable meeting summaries.
Returns:
{
'executive_summary': str, # 2-3 sentences
'key_decisions': list[str], # Bullet points
'health_policy_items': list[str], # Filtered for oral health
'next_actions': list[str] # Follow-up items
}
"""
client = OpenAI()
prompt = f"""
Summarize this local government meeting for public understanding.
Meeting: {event.title}
Date: {event.start.strftime('%B %d, %Y')}
Transcript: {transcript[:10000]} # First 10k chars
Provide:
1. Executive summary (2-3 sentences)
2. Key decisions made (bullet points)
3. Health policy items (if any)
4. Next actions/follow-ups
Focus on: What decisions were made? What happens next?
"""
response = client.chat.completions.create(
model="gpt-4o-mini", # Cost-effective for summaries
messages=[
{"role": "system", "content": "You are a civic engagement assistant helping residents understand local government."},
{"role": "user", "content": prompt}
],
temperature=0.3 # Lower for factual accuracy
)
# Parse response into structured format
summary_text = response.choices[0].message.content
return {
'executive_summary': extract_section(summary_text, 'Executive summary'),
'key_decisions': extract_bullets(summary_text, 'Key decisions'),
'health_policy_items': extract_bullets(summary_text, 'Health policy'),
'next_actions': extract_bullets(summary_text, 'Next actions'),
'raw_summary': summary_text
}
B. Keyword Alert System
# OpenTowns sends alerts when keywords appear in meetings
# Pattern: Watch list → match detection → user notification
from typing import List, Dict
import re
class KeywordAlertSystem:
"""
OpenTowns pattern: Alert users when keywords appear in meetings.
"""
# Oral health keyword categories
KEYWORD_CATEGORIES = {
'fluoridation': [
'fluoride', 'fluoridation', 'water treatment',
'community water fluoridation', 'CWF'
],
'dental_access': [
'dental', 'dentist', 'oral health', 'teeth',
'medicaid dental', 'dental clinic'
],
'public_health': [
'health department', 'public health', 'CDC',
'preventive care', 'health equity'
]
}
def detect_keywords(self, text: str) -> Dict[str, List[str]]:
"""
Find all matching keywords in text.
Returns: {'fluoridation': ['fluoride', 'CWF'], ...}
"""
text_lower = text.lower()
matches = {}
for category, keywords in self.KEYWORD_CATEGORIES.items():
found = []
for keyword in keywords:
# Word boundary matching
pattern = r'\b' + re.escape(keyword.lower()) + r'\b'
if re.search(pattern, text_lower):
found.append(keyword)
if found:
matches[category] = found
return matches
def generate_alert(self, event: MeetingEvent, matches: Dict[str, List[str]]) -> dict:
"""
Create alert notification for users.
"""
return {
'alert_type': 'keyword_match',
'jurisdiction': f"{event.jurisdiction_name}, {event.state_code}",
'meeting_title': event.title,
'meeting_date': event.start.isoformat(),
'categories_matched': list(matches.keys()),
'keywords_found': [kw for kws in matches.values() for kw in kws],
'meeting_url': event.source,
'priority': 'high' if 'fluoridation' in matches else 'medium'
}
Implementation Priority: 🔥 HIGH - Summaries make data usable for advocates
3. LocalView 🆕 Research-Grade Scale
Website: https://www.localview.net
GitHub: https://mellonurbanism.harvard.edu/localview
License: Open-source data pipeline
Scale: Nationwide coverage, largest public dataset
🔥 What to Adopt
A. Scale Architecture Patterns
LocalView handles thousands of jurisdictions with:
- Batch processing (not real-time)
- Distributed storage (videos + transcripts)
- Quality metrics (completeness scoring)
# LocalView pattern: Process jurisdictions in batches with quality tracking
from dataclasses import dataclass
from datetime import datetime
from typing import Optional
@dataclass
class JurisdictionQuality:
"""
LocalView pattern: Track data quality per jurisdiction.
"""
jurisdiction_name: str
state_code: str
# Completeness metrics
total_meetings_expected: int # Based on calendar
total_meetings_found: int
meetings_with_agendas: int
meetings_with_minutes: int
meetings_with_videos: int
meetings_with_transcripts: int
# Freshness
last_scraped: datetime
last_meeting_found: Optional[datetime]
scraping_frequency: str # 'daily', 'weekly', 'monthly'
# Health metrics
consecutive_failures: int
last_success: Optional[datetime]
@property
def completeness_score(self) -> float:
"""
Overall data quality score (0-100).
"""
if self.total_meetings_expected == 0:
return 0.0
found_rate = self.total_meetings_found / self.total_meetings_expected
agenda_rate = self.meetings_with_agendas / max(self.total_meetings_found, 1)
minutes_rate = self.meetings_with_minutes / max(self.total_meetings_found, 1)
# Weighted average
score = (
found_rate * 40 + # 40%: Finding meetings
agenda_rate * 30 + # 30%: Having agendas
minutes_rate * 30 # 30%: Having minutes
)
return min(score * 100, 100.0)
@property
def health_status(self) -> str:
"""
Scraper health: healthy, degraded, failed
"""
if self.consecutive_failures >= 5:
return 'failed'
elif self.consecutive_failures >= 2:
return 'degraded'
else:
return 'healthy'
B. Batch Processing Strategy
# LocalView processes in batches, not all-at-once
from pyspark.sql import SparkSession
from typing import Iterator
def process_jurisdictions_in_batches(
spark: SparkSession,
batch_size: int = 100,
priority_filter: str = 'high'
) -> Iterator[dict]:
"""
LocalView pattern: Process large numbers of jurisdictions efficiently.
Strategy:
1. Load high-priority jurisdictions first
2. Process in batches to manage memory
3. Track quality metrics per batch
4. Resume from failures
"""
# Load targets from Gold layer
targets_df = spark.read.format("delta").load("data/delta/gold/scraping_targets")
# Filter and sort
priority_targets = targets_df \
.filter(f"priority_tier = '{priority_filter}'") \
.orderBy("priority_score", ascending=False)
total_targets = priority_targets.count()
# Process in batches
for offset in range(0, total_targets, batch_size):
batch_df = priority_targets.limit(batch_size).offset(offset)
batch_results = {
'batch_number': offset // batch_size + 1,
'batch_size': batch_size,
'jurisdictions_processed': 0,
'meetings_found': 0,
'errors': []
}
for row in batch_df.collect():
try:
# Scrape jurisdiction
meetings = scrape_jurisdiction(row['url'], row['platform'])
batch_results['jurisdictions_processed'] += 1
batch_results['meetings_found'] += len(meetings)
except Exception as e:
batch_results['errors'].append({
'jurisdiction': row['jurisdiction_name'],
'error': str(e)
})
yield batch_results
Implementation Priority: 🔥 HIGH - Essential for scaling to 32,333 municipalities
4. MeetingBank 🆕 Summarization Research
Website: https://meetingbank.github.io
GitHub: Linked from site
License: Open dataset
Focus: 6 cities, high-quality summarization benchmark
🔥 What to Adopt
A. Summarization Quality Benchmarks
MeetingBank is used in academic research for summarization. They have:
- Gold-standard human summaries (for validation)
- Multiple summary lengths (short, medium, long)
- Evaluation metrics (ROUGE, BERTScore)
# MeetingBank pattern: Validate AI summaries against quality benchmarks
from typing import Dict
import numpy as np
class SummaryQualityValidator:
"""
MeetingBank pattern: Ensure AI summaries meet quality standards.
"""
# Quality thresholds from academic research
MIN_ROUGE_L = 0.25 # ROUGE-L F1 score
MIN_LENGTH_RATIO = 0.05 # Summary should be 5-20% of original
MAX_LENGTH_RATIO = 0.20
def validate_summary(self, original: str, summary: str) -> Dict[str, any]:
"""
Check if summary meets quality standards.
"""
# Length checks
orig_words = len(original.split())
summ_words = len(summary.split())
length_ratio = summ_words / orig_words if orig_words > 0 else 0
# Basic quality checks
checks = {
'length_appropriate': self.MIN_LENGTH_RATIO <= length_ratio <= self.MAX_LENGTH_RATIO,
'has_key_terms': self._check_key_terms(original, summary),
'no_repetition': self._check_repetition(summary),
'proper_structure': self._check_structure(summary),
}
return {
'passes_validation': all(checks.values()),
'checks': checks,
'length_ratio': length_ratio,
'word_count': summ_words,
'quality_score': sum(checks.values()) / len(checks)
}
def _check_key_terms(self, original: str, summary: str) -> bool:
"""
Ensure summary includes key terms from original.
"""
# Extract important terms (simplified - use TF-IDF in production)
orig_words = set(original.lower().split())
summ_words = set(summary.lower().split())
# At least 30% overlap of unique terms
overlap = len(orig_words & summ_words) / len(orig_words)
return overlap >= 0.30
def _check_repetition(self, summary: str) -> bool:
"""
Check for excessive repetition (indicates poor quality).
"""
sentences = summary.split('.')
unique_ratio = len(set(sentences)) / len(sentences) if sentences else 0
return unique_ratio >= 0.80 # At least 80% unique sentences
def _check_structure(self, summary: str) -> bool:
"""
Check for proper summary structure.
"""
# Should have multiple sentences
sentences = [s.strip() for s in summary.split('.') if s.strip()]
return len(sentences) >= 2 and len(sentences) <= 10
Implementation Priority: 🟡 MEDIUM - Important for quality, but MVP can use basic summaries
5. CivicBand 🆕 Multi-Jurisdiction Search
Website: https://civic.band
GitHub: Linked from site (Raft Foundation)
Scale: 1,000+ municipalities
Focus: Google-like search across jurisdictions
🔥 What to Adopt
A. Cross-Jurisdiction Search Architecture
CivicBand lets users search "fluoridation" and get results from all municipalities at once.
# CivicBand pattern: Federated search across jurisdictions
from elasticsearch import Elasticsearch # Or Meilisearch for open-source
from typing import List, Dict
from models.meeting_event import MeetingEvent
class CrossJurisdictionSearch:
"""
CivicBand pattern: Search meetings across all jurisdictions.
"""
def __init__(self):
# Use Meilisearch (open-source) or Elasticsearch
self.es = Elasticsearch(['http://localhost:9200'])
self.index_name = 'meeting_events'
def index_meeting(self, event: MeetingEvent):
"""
Add meeting to search index.
"""
doc = {
'id': event.id,
'title': event.title,
'description': event.description,
'jurisdiction': event.jurisdiction_name,
'state': event.state_code,
'date': event.start.isoformat(),
'full_text': self._build_searchable_text(event),
'agenda_url': next((link.href for link in event.links if 'agenda' in link.title.lower()), None),
'oral_health_relevant': event.oral_health_relevant,
'keywords': event.keywords_found
}
self.es.index(index=self.index_name, id=event.id, document=doc)
def search(
self,
query: str,
states: List[str] = None,
date_range: tuple = None,
oral_health_only: bool = False
) -> List[Dict]:
"""
Search across all jurisdictions.
Example:
search("fluoridation", states=['AL', 'GA'], oral_health_only=True)
"""
must_clauses = [
{"multi_match": {
"query": query,
"fields": ["title^3", "description^2", "full_text"], # Boost title matches
"type": "best_fields"
}}
]
# Filter by state
if states:
must_clauses.append({"terms": {"state": states}})
# Filter by date range
if date_range:
must_clauses.append({
"range": {"date": {"gte": date_range[0], "lte": date_range[1]}}
})
# Filter oral health only
if oral_health_only:
must_clauses.append({"term": {"oral_health_relevant": True}})
search_query = {
"query": {"bool": {"must": must_clauses}},
"size": 100,
"highlight": {
"fields": {
"title": {},
"description": {},
"full_text": {"fragment_size": 150}
}
},
"sort": [
{"_score": "desc"},
{"date": "desc"}
]
}
results = self.es.search(index=self.index_name, body=search_query)
return [{
'jurisdiction': hit['_source']['jurisdiction'],
'state': hit['_source']['state'],
'title': hit['_source']['title'],
'date': hit['_source']['date'],
'snippet': hit.get('highlight', {}).get('full_text', [''])[0],
'url': hit['_source']['agenda_url'],
'relevance_score': hit['_score']
} for hit in results['hits']['hits']]
def _build_searchable_text(self, event: MeetingEvent) -> str:
"""
Combine all text fields for indexing.
"""
parts = [
event.title or '',
event.description or '',
' '.join(event.keywords_found),
' '.join(link.title for link in event.links)
]
return ' '.join(parts)
B. Jurisdiction Faceting
# CivicBand shows result counts by jurisdiction
def get_search_facets(query: str) -> Dict[str, int]:
"""
Show how many results per jurisdiction.
Example output:
{
'Birmingham, AL': 12,
'Atlanta, GA': 8,
'Montgomery, AL': 5
}
"""
search_query = {
"query": {"multi_match": {"query": query, "fields": ["title", "full_text"]}},
"size": 0, # We only want aggregations
"aggs": {
"by_jurisdiction": {
"terms": {
"field": "jurisdiction.keyword",
"size": 50 # Top 50 jurisdictions
},
"aggs": {
"by_state": {
"terms": {"field": "state.keyword"}
}
}
}
}
}
results = self.es.search(index=self.index_name, body=search_query)
facets = {}
for bucket in results['aggregations']['by_jurisdiction']['buckets']:
jurisdiction = bucket['key']
count = bucket['doc_count']
state = bucket['by_state']['buckets'][0]['key']
facets[f"{jurisdiction}, {state}"] = count
return facets
Implementation Priority: 🟡 MEDIUM - Valuable for end-users, but scraping comes first
6. OpenCouncil 🆕 International Adaptability
Website: https://opencouncil.gr
GitHub: https://github.com/schemalabz/opencouncil
License: Open-source
Focus: Greek councils, but adaptable to U.S.
🔥 What to Adopt
A. Internationalization Patterns
OpenCouncil works in Greece (different government structure). This teaches us:
- Flexible schema (not hardcoded to U.S. structures)
- Configurable jurisdiction types (councils, boards, commissions)
- Multi-language support (not needed now, but good architecture)
# OpenCouncil pattern: Flexible jurisdiction configuration
from enum import Enum
from dataclasses import dataclass
from typing import List, Optional
class GovernmentLevel(Enum):
"""
OpenCouncil pattern: Support multiple government structures.
"""
MUNICIPAL = "municipal" # City/town councils
COUNTY = "county" # County boards
TOWNSHIP = "township" # Township boards
SCHOOL_DISTRICT = "school" # School boards
SPECIAL_DISTRICT = "special" # Water, fire, etc.
STATE = "state" # State agencies (future)
@dataclass
class JurisdictionConfig:
"""
OpenCouncil pattern: Configure each jurisdiction's unique structure.
"""
jurisdiction_name: str
government_level: GovernmentLevel
# Meeting schedule
typical_meeting_frequency: str # 'weekly', 'biweekly', 'monthly'
typical_meeting_days: List[str] # ['Monday', 'Thursday']
typical_meeting_time: str # '18:00'
# Website structure
calendar_url: Optional[str]
agenda_url_pattern: Optional[str] # Template: "https://example.gov/agenda-{date}"
minutes_url_pattern: Optional[str]
# Legislative bodies
bodies: List[str] # ['City Council', 'Planning Commission', 'Board of Health']
# Custom fields
metadata: dict # For jurisdiction-specific data
# Example: Configure Birmingham, AL
BIRMINGHAM_CONFIG = JurisdictionConfig(
jurisdiction_name="Birmingham",
government_level=GovernmentLevel.MUNICIPAL,
typical_meeting_frequency='biweekly',
typical_meeting_days=['Tuesday'],
typical_meeting_time='18:00',
calendar_url="https://birminghamal.gov/council/meetings",
bodies=['City Council', 'Board of Health', 'Planning Commission'],
metadata={'population': 200733, 'oral_health_priority': 'high'}
)
Implementation Priority: 🟢 LOW - Good architecture, but not urgent
🎯 Implementation Roadmap
Phase 1: AI Summarization (OpenTowns pattern) 🔥
Priority: HIGH
Timeline: 1-2 weeks
Depends on: Existing OpenAI integration
# TODO: Implement in extraction/summarizer.py
- [ ] Generate executive summaries from meeting transcripts
- [ ] Extract key decisions as bullet points
- [ ] Identify health policy items
- [ ] Add quality validation (MeetingBank patterns)
Phase 2: Keyword Alerts (OpenTowns pattern) 🔥
Priority: HIGH
Timeline: 1 week
Depends on: Meeting data ingestion
# TODO: Implement in alerts/keyword_monitor.py
- [ ] Define oral health keyword categories
- [ ] Pattern matching with word boundaries
- [ ] Generate alerts for users
- [ ] Email/webhook notification system
Phase 3: Scale Architecture (LocalView pattern) 🔥
Priority: HIGH
Timeline: 2 weeks
Depends on: Platform scrapers
# TODO: Implement in discovery/batch_processor.py
- [ ] Quality metrics per jurisdiction
- [ ] Batch processing (100 at a time)
- [ ] Failure tracking and retry
- [ ] Completeness scoring
Phase 4: Multi-Jurisdiction Search (CivicBand pattern) 🟡
Priority: MEDIUM
Timeline: 2-3 weeks
Depends on: Significant meeting data
# TODO: Implement in search/federated_search.py
- [ ] Set up Elasticsearch or Meilisearch
- [ ] Index all meetings
- [ ] Cross-jurisdiction search API
- [ ] Jurisdiction faceting
Phase 5: Quality Validation (MeetingBank pattern) 🟡
Priority: MEDIUM
Timeline: 1 week
Depends on: AI summarization
# TODO: Implement in extraction/quality_validator.py
- [ ] Summary length validation
- [ ] Key term extraction
- [ ] Repetition detection
- [ ] Structure checking
Phase 6: Flexible Config (OpenCouncil pattern) 🟢
Priority: LOW
Timeline: 1 week
Depends on: None
# TODO: Implement in config/jurisdiction_configs.py
- [ ] Per-jurisdiction configuration
- [ ] Meeting schedule patterns
- [ ] Legislative body tracking
📊 Comparison with Existing Integration
| Capability | Original 5 Projects | New 6 Projects | Status |
|---|---|---|---|
| Platform detection | ✅ Civic Scraper | - | Complete |
| Event schema | ✅ City Scrapers | - | Complete |
| Video ingestion | ✅ CDP | ✅ LocalView (scale) | Need scale patterns |
| Matter tracking | ✅ Engagic | - | Complete |
| Person/vote tracking | ✅ Councilmatic | - | Roadmapped |
| AI Summarization | ❌ | ✅ OpenTowns, MeetingBank | TODO: High priority |
| Keyword Alerts | ❌ | ✅ OpenTowns | TODO: High priority |
| Cross-jurisdiction search | ⚠️ Basic | ✅ CivicBand | TODO: Medium priority |
| Quality metrics | ❌ | ✅ LocalView, MeetingBank | TODO: Medium priority |
| Batch processing | ⚠️ Basic | ✅ LocalView | TODO: High priority |
💻 Quick Start: Integrate Summarization
Here's how to add OpenTowns-style summarization right now:
# File: extraction/summarizer.py
from openai import OpenAI
from models.meeting_event import MeetingEvent
from config.settings import settings
client = OpenAI(api_key=settings.openai_api_key)
def summarize_meeting(event: MeetingEvent, full_text: str) -> dict:
"""
Generate OpenTowns-style summary with oral health focus.
"""
prompt = f"""
You are summarizing a local government meeting for public health advocates.
Meeting: {event.title}
Jurisdiction: {event.jurisdiction_name}, {event.state_code}
Date: {event.start.strftime('%B %d, %Y')}
Full text (first 8000 chars):
{full_text[:8000]}
Provide:
1. Executive Summary (2-3 sentences)
2. Key Decisions (bullet list)
3. Oral Health Items (if any - fluoridation, dental access, etc.)
4. Next Actions (follow-ups, future meetings)
Focus on: What was decided? What's happening next?
"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You summarize local government meetings for public understanding."},
{"role": "user", "content": prompt}
],
temperature=0.3
)
return {
'summary': response.choices[0].message.content,
'model': 'gpt-4o-mini',
'tokens_used': response.usage.total_tokens
}
# Usage:
# summary = summarize_meeting(event, full_transcript)
# event.description = summary['summary']
🎬 Next Steps
- Implement AI summarization (OpenTowns pattern) → Makes data usable
- Add keyword alerts (OpenTowns pattern) → Engage advocates
- Add batch processing (LocalView pattern) → Scale to 1,000+ jurisdictions
- Build search interface (CivicBand pattern) → User discovery
- Add quality metrics (LocalView + MeetingBank) → Monitor data health
📖 References
- OpenTowns: https://opentowns.org
- LocalView: https://www.localview.net
- MeetingBank: https://meetingbank.github.io
- CivicBand: https://civic.band
- OpenCouncil: https://github.com/schemalabz/opencouncil
- Council Data Project: https://councildataproject.org (see INTEGRATION_GUIDE.md)
📝 License & Attribution
All patterns documented here are derived from open-source projects:
- OpenTowns: Open civic-tech project
- LocalView: Open-source (Harvard Mellon Urbanism)
- MeetingBank: Open dataset
- CivicBand: Open-source (Raft Foundation)
- OpenCouncil: Open-source (MIT)
- CDP: MIT License
When using code patterns, maintain attribution per each project's license.