Integration Guide: Reusing Open-Source Municipal Scraping Logic
Overview​
This guide shows how to integrate proven patterns from established open-source projects into the Oral Health Policy Pulse scraping pipeline.
Current State​
✅ You already have:
- Census Gazetteer data with 85,302 jurisdictions (names + FIPS codes)
- GSA .gov domain matching
- 76 discovered URLs ready for scraping
- Legistar platform references in codebase
- Base ScraperAgent class in
agents/scraper.py
1. Civic Scraper Integration​
Repository: biglocalnews/civic-scraper
License: Apache 2.0 (✅ Compatible)
What to Adopt:​
A. Platform Detection Logic​
# They have excellent platform detection
# Location: civic_scraper/platforms/__init__.py
PLATFORMS = {
'legistar': LegistarScraper,
'granicus': GranicusScraper,
'calagenda': CalAgendaScraper,
'civicplus': CivicPlusScraper
}
def detect_platform(url: str) -> Optional[str]:
"""Auto-detect which platform a URL uses"""
if 'legistar.com' in url or '/Legistar/' in url:
return 'legistar'
elif 'granicus.com' in url or '/Mediasite/' in url:
return 'granicus'
# ... more patterns
Your Action: Add discovery/platform_detector.py using their patterns
B. Document Downloader with Retry Logic​
# civic_scraper/download.py has robust downloading
# Features:
# - Exponential backoff
# - Content-type validation
# - Duplicate detection via hash
# - Progress tracking
async def download_document(url: str, session: httpx.AsyncClient) -> bytes:
"""Download with retries and validation"""
for attempt in range(3):
try:
response = await session.get(url, timeout=30.0)
response.raise_for_status()
# Validate it's actually a document
content_type = response.headers.get('content-type', '')
if 'pdf' in content_type or 'html' in content_type:
return response.content
except Exception as e:
if attempt == 2:
raise
await asyncio.sleep(2 ** attempt)
Your Action: Enhance agents/scraper.py with their retry patterns
2. City Scrapers Integration​
Repository: city-scrapers/city-scrapers
License: MIT (✅ Compatible)
What to Adopt:​
A. Standardized Event Schema​
# They normalize all meeting data to a common format
# city_scrapers/core/models.py
@dataclass
class Event:
title: str
description: str
classification: str # "Board", "Commission", "Council"
start: datetime
end: Optional[datetime]
all_day: bool
location: Dict[str, Any]
links: List[Dict[str, str]] # [{"title": "Agenda", "href": "..."}]
source: str
# Classification types they use:
CLASSIFICATIONS = [
"Board",
"Commission",
"Committee",
"Council",
"Town Hall",
"Public Hearing"
]
Your Action: Create models/meeting_event.py with this schema for your Silver layer
B. Scraper Testing Framework​
# They have excellent test patterns
# tests/test_scrapers.py
def test_scraper():
"""Test with frozen HTML responses"""
scraper = CityScraper()
# Use saved HTML files to avoid live requests during testing
with open('tests/fixtures/sample_calendar.html') as f:
results = scraper.parse(f.read())
assert len(results) > 0
assert results[0].title
assert results[0].source
Your Action: Add tests/fixtures/ directory with sample HTML from different platforms
3. Council Data Project (CDP) Integration​
Repository: CouncilDataProject/cdp-scrapers
License: MIT (✅ Compatible)
What to Adopt:​
A. Generic Ingestion Pipeline​
# CDP has a beautiful generic scraper pipeline
# cdp_scrapers/scraper_utils.py
class IngestionModel:
"""Standard format for ingested data"""
sessions: List[Session] # Individual meetings
@dataclass
class Session:
video_uri: Optional[str]
session_datetime: datetime
session_index: int
caption_uri: Optional[str]
@dataclass
class EventMinutesItem:
name: str
minutes_item: MinutesItem
def reduced_list(items: List[Any], key_attr: str) -> List[Any]:
"""Deduplicate items by a key attribute"""
seen = set()
result = []
for item in items:
key = getattr(item, key_attr)
if key not in seen:
seen.add(key)
result.append(item)
return result
Your Action: Create models/ingestion.py based on their schemas
B. Video Transcript Integration (Future)​
# CDP processes meeting videos into searchable transcripts
# This is advanced but incredibly valuable
# They use:
# - AWS Transcribe / Google Speech-to-Text
# - Sentence indexing with timestamps
# - Speaker diarization (who said what)
# You could add this in Phase 2 after document scraping works
Your Action: Document in docs/ROADMAP.md for future implementation
4. Engagic Integration​
Repository: Engagic/engagic
License: Check repo (likely AGPL)
What to Adopt:​
A. "Matter" Tracking Across Meetings​
# Engagic tracks individual legislative items across meetings
# This is PERFECT for oral health policy tracking
@dataclass
class Matter:
matter_id: str
matter_number: str # "Bill 2024-001"
title: str
type: str # "Ordinance", "Resolution", "Motion"
first_introduced: datetime
status: str # "Introduced", "Committee", "Passed", "Failed"
votes: List[Vote]
related_documents: List[str]
# Track how a fluoridation ordinance evolves:
# Meeting 1: Introduced (just mentioned in minutes)
# Meeting 2: Committee review (document link added)
# Meeting 3: Public hearing (comments recorded)
# Meeting 4: Final vote (result captured)
Your Action: Create models/matter.py for tracking policy evolution
B. LLM-Powered Document Parsing​
# Engagic uses LLMs to extract structure from "blob" PDFs
# You already have OpenAI configured!
async def extract_agenda_items(pdf_text: str) -> List[AgendaItem]:
"""Use GPT to extract structured items from unstructured text"""
prompt = """
Extract agenda items from this meeting minutes text.
For each item, identify:
- Item number
- Title
- Description
- Any votes or decisions
- Keywords related to health, dental, fluoride, water, public health
Return JSON array.
"""
response = await openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You extract structured data from government documents"},
{"role": "user", "content": f"{prompt}\n\n{pdf_text}"}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
Your Action: Add extraction/llm_parser.py using your existing OpenAI setup
5. Councilmatic Integration​
Repository: datamade/councilmatic-starter-template
License: MIT (✅ Compatible)
What to Adopt:​
A. Person/Organization Tracking​
# Councilmatic tracks who voted on what
# Useful for understanding power dynamics around oral health policy
@dataclass
class Person:
name: str
role: str # "Council Member", "Mayor", "Commissioner"
district: Optional[str]
party: Optional[str]
@dataclass
class Vote:
motion: str
option: str # "yes", "no", "abstain"
person: Person
date: datetime
Your Action: Add to models/governance.py
B. Search Interface Patterns​
# They have excellent search UX
# filters.py shows what users want:
SEARCH_FILTERS = [
"date_range",
"topic", # ["health", "water", "budget"]
"organization", # Which board/commission
"document_type", # ["agenda", "minutes", "transcript"]
"status", # ["pending", "passed", "failed"]
]
# Your FastAPI endpoints could mirror this
@app.get("/api/search")
async def search_documents(
query: str,
topics: List[str] = Query(default=["oral_health", "fluoridation"]),
date_from: Optional[date] = None,
date_to: Optional[date] = None,
state: Optional[str] = None
):
"""Search scraped documents with filters"""
# Query your Delta Lake Gold layer
Your Action: Add to api/routes/search.py (create if doesn't exist)
Implementation Priorities​
Phase 1: Foundation (Week 1)​
- Platform Detection - Add
discovery/platform_detector.pyfrom Civic Scraper patterns - Standardized Schema - Create
models/meeting_event.pyfrom City Scrapers - Enhanced Downloader - Improve
agents/scraper.pyretry logic
Phase 2: Scraping (Week 2-3)​
- Legistar Scraper - Implement full Legistar support using Civic Scraper patterns
- Generic HTML Parser - Use BeautifulSoup patterns from City Scrapers
- PDF Extraction - Add PyPDF2/pdfplumber support
Phase 3: Intelligence (Week 4)​
- LLM Parser - Add
extraction/llm_parser.pyfrom Engagic patterns - Matter Tracking - Create
models/matter.pyfor policy evolution - Keyword Detection - Oral health, fluoridation, dental policy detection
Phase 4: Scale (Week 5+)​
- Test All 76 URLs - Run full scraper on discovered targets
- Expand to All Municipalities - Process all 32,333 jurisdictions
- Video Transcripts - CDP-style video processing (future)
Code Snippets to Add Now​
1. Platform Detector​
File: discovery/platform_detector.py
"""
Platform detection for municipal websites.
Based on patterns from biglocalnews/civic-scraper.
"""
from typing import Optional
from urllib.parse import urlparse
PLATFORM_PATTERNS = {
'legistar': [
'legistar.com',
'/Legistar/',
'/LegislationDetail.aspx',
'/Calendar.aspx'
],
'granicus': [
'granicus.com',
'/Mediasite/',
'/ViewPublisher.php'
],
'municode': [
'municode.com',
'/meeting_minutes'
],
'civicplus': [
'civicplus.com',
'/AgendaCenter/',
'/DocumentCenter/'
]
}
def detect_platform(url: str) -> Optional[str]:
"""
Detect which platform a municipality website uses.
Args:
url: Municipality website URL
Returns:
Platform name or None if unknown
"""
url_lower = url.lower()
for platform, patterns in PLATFORM_PATTERNS.items():
if any(pattern.lower() in url_lower for pattern in patterns):
return platform
return None
def get_scraper_class(platform: str):
"""Get appropriate scraper class for platform"""
from scrapers.legistar import LegistarScraper
from scrapers.granicus import GranicusScraper
from scrapers.generic import GenericScraper
scrapers = {
'legistar': LegistarScraper,
'granicus': GranicusScraper
}
return scrapers.get(platform, GenericScraper)
2. Meeting Event Model​
File: models/meeting_event.py
"""
Standardized meeting event model.
Based on City Scrapers schema.
"""
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional, List, Dict, Any
@dataclass
class Location:
name: str
address: Optional[str] = None
city: Optional[str] = None
state: Optional[str] = None
@dataclass
class Link:
title: str # "Agenda", "Minutes", "Video"
href: str
content_type: Optional[str] = None # "application/pdf", "text/html"
@dataclass
class MeetingEvent:
"""
Normalized representation of a government meeting.
Compatible with City Scrapers format.
"""
# Core identification
id: str # Hash of source_url + start_time
title: str
description: str
classification: str # "Board", "Commission", "Council", "Committee"
# Temporal
start: datetime
end: Optional[datetime] = None
all_day: bool = False
# Spatial
location: Location
# Content
links: List[Link] = field(default_factory=list)
source: str = "" # Original URL
# Metadata
jurisdiction_name: str = ""
state_code: str = ""
fips_code: Optional[str] = None
scraped_at: datetime = field(default_factory=datetime.utcnow)
# Health policy relevance (your special sauce!)
oral_health_relevant: bool = False
keywords_found: List[str] = field(default_factory=list)
confidence_score: float = 0.0
def to_dict(self) -> Dict[str, Any]:
"""Convert to dictionary for Delta Lake storage"""
return {
'id': self.id,
'title': self.title,
'description': self.description,
'classification': self.classification,
'start': self.start.isoformat(),
'end': self.end.isoformat() if self.end else None,
'all_day': self.all_day,
'location_name': self.location.name,
'location_address': self.location.address,
'links': [{'title': l.title, 'href': l.href} for l in self.links],
'source': self.source,
'jurisdiction_name': self.jurisdiction_name,
'state_code': self.state_code,
'fips_code': self.fips_code,
'scraped_at': self.scraped_at.isoformat(),
'oral_health_relevant': self.oral_health_relevant,
'keywords_found': self.keywords_found,
'confidence_score': self.confidence_score
}
3. Enhanced Discovery Pipeline​
Add to: discovery/discovery_pipeline.py
async def discover_platform_capabilities(self):
"""
For each discovered URL, detect which platform it uses.
This prepares optimal scraping strategies.
"""
from discovery.platform_detector import detect_platform
logger.info("Detecting platforms for discovered URLs...")
silver_path = f"{settings.delta_lake_path}/silver/discovered_urls"
urls_df = self.spark.read.format("delta").load(silver_path)
enriched_urls = []
for row in urls_df.take(urls_df.count()):
row_dict = row.asDict()
url = row_dict['url']
# Detect platform
platform = detect_platform(url)
row_dict['platform'] = platform if platform else 'generic'
row_dict['scraper_ready'] = platform is not None
enriched_urls.append(row_dict)
# Write back to Silver layer with platform info
from pyspark.sql import Row
enriched_df = self.spark.createDataFrame([Row(**u) for u in enriched_urls])
enriched_df.write.format("delta").mode("overwrite").save(silver_path)
logger.success(f"Platform detection complete - {len(enriched_urls)} URLs analyzed")
return enriched_urls
Next Steps​
- Review Licenses - All mentioned projects use permissive licenses (MIT/Apache 2.0), but double-check
- Clone Repos Locally - Study their code structure:
cd /tmpgit clone https://github.com/biglocalnews/civic-scrapergit clone https://github.com/city-scrapers/city-scrapers
- Add Attribution - In your
README.md, credit these projects - Start with Platform Detector - Implement
discovery/platform_detector.pyfirst - Test with Your 76 URLs - Run platform detection on your discovered URLs
Resources​
- Civic Scraper Docs: https://github.com/biglocalnews/civic-scraper/wiki
- City Scrapers Tutorial: https://cityscrapers.org/docs/development/
- CDP Architecture: https://councildataproject.org/
- Legistar API Docs: https://webapi.legistar.com/Home/Examples
Questions to Consider​
- Do you want video transcript support? (CDP pattern, requires AWS/GCP credits)
- How important is real-time tracking? (vs batch processing)
- Will you expose a public API? (Councilmatic patterns useful here)
- Need to track voting records? (Councilmatic person/vote models)
Let me know which phase you want to implement first!