Skip to main content

Integration Guide: Reusing Open-Source Municipal Scraping Logic

Overview​

This guide shows how to integrate proven patterns from established open-source projects into the Oral Health Policy Pulse scraping pipeline.

Current State​

✅ You already have:

  • Census Gazetteer data with 85,302 jurisdictions (names + FIPS codes)
  • GSA .gov domain matching
  • 76 discovered URLs ready for scraping
  • Legistar platform references in codebase
  • Base ScraperAgent class in agents/scraper.py

1. Civic Scraper Integration​

Repository: biglocalnews/civic-scraper License: Apache 2.0 (✅ Compatible)

What to Adopt:​

A. Platform Detection Logic​

# They have excellent platform detection
# Location: civic_scraper/platforms/__init__.py

PLATFORMS = {
'legistar': LegistarScraper,
'granicus': GranicusScraper,
'calagenda': CalAgendaScraper,
'civicplus': CivicPlusScraper
}

def detect_platform(url: str) -> Optional[str]:
"""Auto-detect which platform a URL uses"""
if 'legistar.com' in url or '/Legistar/' in url:
return 'legistar'
elif 'granicus.com' in url or '/Mediasite/' in url:
return 'granicus'
# ... more patterns

Your Action: Add discovery/platform_detector.py using their patterns

B. Document Downloader with Retry Logic​

# civic_scraper/download.py has robust downloading
# Features:
# - Exponential backoff
# - Content-type validation
# - Duplicate detection via hash
# - Progress tracking

async def download_document(url: str, session: httpx.AsyncClient) -> bytes:
"""Download with retries and validation"""
for attempt in range(3):
try:
response = await session.get(url, timeout=30.0)
response.raise_for_status()

# Validate it's actually a document
content_type = response.headers.get('content-type', '')
if 'pdf' in content_type or 'html' in content_type:
return response.content
except Exception as e:
if attempt == 2:
raise
await asyncio.sleep(2 ** attempt)

Your Action: Enhance agents/scraper.py with their retry patterns


2. City Scrapers Integration​

Repository: city-scrapers/city-scrapers License: MIT (✅ Compatible)

What to Adopt:​

A. Standardized Event Schema​

# They normalize all meeting data to a common format
# city_scrapers/core/models.py

@dataclass
class Event:
title: str
description: str
classification: str # "Board", "Commission", "Council"
start: datetime
end: Optional[datetime]
all_day: bool
location: Dict[str, Any]
links: List[Dict[str, str]] # [{"title": "Agenda", "href": "..."}]
source: str

# Classification types they use:
CLASSIFICATIONS = [
"Board",
"Commission",
"Committee",
"Council",
"Town Hall",
"Public Hearing"
]

Your Action: Create models/meeting_event.py with this schema for your Silver layer

B. Scraper Testing Framework​

# They have excellent test patterns
# tests/test_scrapers.py

def test_scraper():
"""Test with frozen HTML responses"""
scraper = CityScraper()

# Use saved HTML files to avoid live requests during testing
with open('tests/fixtures/sample_calendar.html') as f:
results = scraper.parse(f.read())

assert len(results) > 0
assert results[0].title
assert results[0].source

Your Action: Add tests/fixtures/ directory with sample HTML from different platforms


3. Council Data Project (CDP) Integration​

Repository: CouncilDataProject/cdp-scrapers License: MIT (✅ Compatible)

What to Adopt:​

A. Generic Ingestion Pipeline​

# CDP has a beautiful generic scraper pipeline
# cdp_scrapers/scraper_utils.py

class IngestionModel:
"""Standard format for ingested data"""
sessions: List[Session] # Individual meetings

@dataclass
class Session:
video_uri: Optional[str]
session_datetime: datetime
session_index: int
caption_uri: Optional[str]

@dataclass
class EventMinutesItem:
name: str
minutes_item: MinutesItem


def reduced_list(items: List[Any], key_attr: str) -> List[Any]:
"""Deduplicate items by a key attribute"""
seen = set()
result = []
for item in items:
key = getattr(item, key_attr)
if key not in seen:
seen.add(key)
result.append(item)
return result

Your Action: Create models/ingestion.py based on their schemas

B. Video Transcript Integration (Future)​

# CDP processes meeting videos into searchable transcripts
# This is advanced but incredibly valuable

# They use:
# - AWS Transcribe / Google Speech-to-Text
# - Sentence indexing with timestamps
# - Speaker diarization (who said what)

# You could add this in Phase 2 after document scraping works

Your Action: Document in docs/ROADMAP.md for future implementation


4. Engagic Integration​

Repository: Engagic/engagic License: Check repo (likely AGPL)

What to Adopt:​

A. "Matter" Tracking Across Meetings​

# Engagic tracks individual legislative items across meetings
# This is PERFECT for oral health policy tracking

@dataclass
class Matter:
matter_id: str
matter_number: str # "Bill 2024-001"
title: str
type: str # "Ordinance", "Resolution", "Motion"
first_introduced: datetime
status: str # "Introduced", "Committee", "Passed", "Failed"
votes: List[Vote]
related_documents: List[str]

# Track how a fluoridation ordinance evolves:
# Meeting 1: Introduced (just mentioned in minutes)
# Meeting 2: Committee review (document link added)
# Meeting 3: Public hearing (comments recorded)
# Meeting 4: Final vote (result captured)

Your Action: Create models/matter.py for tracking policy evolution

B. LLM-Powered Document Parsing​

# Engagic uses LLMs to extract structure from "blob" PDFs
# You already have OpenAI configured!

async def extract_agenda_items(pdf_text: str) -> List[AgendaItem]:
"""Use GPT to extract structured items from unstructured text"""
prompt = """
Extract agenda items from this meeting minutes text.
For each item, identify:
- Item number
- Title
- Description
- Any votes or decisions
- Keywords related to health, dental, fluoride, water, public health

Return JSON array.
"""

response = await openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You extract structured data from government documents"},
{"role": "user", "content": f"{prompt}\n\n{pdf_text}"}
],
response_format={"type": "json_object"}
)

return json.loads(response.choices[0].message.content)

Your Action: Add extraction/llm_parser.py using your existing OpenAI setup


5. Councilmatic Integration​

Repository: datamade/councilmatic-starter-template License: MIT (✅ Compatible)

What to Adopt:​

A. Person/Organization Tracking​

# Councilmatic tracks who voted on what
# Useful for understanding power dynamics around oral health policy

@dataclass
class Person:
name: str
role: str # "Council Member", "Mayor", "Commissioner"
district: Optional[str]
party: Optional[str]

@dataclass
class Vote:
motion: str
option: str # "yes", "no", "abstain"
person: Person
date: datetime

Your Action: Add to models/governance.py

B. Search Interface Patterns​

# They have excellent search UX
# filters.py shows what users want:

SEARCH_FILTERS = [
"date_range",
"topic", # ["health", "water", "budget"]
"organization", # Which board/commission
"document_type", # ["agenda", "minutes", "transcript"]
"status", # ["pending", "passed", "failed"]
]

# Your FastAPI endpoints could mirror this
@app.get("/api/search")
async def search_documents(
query: str,
topics: List[str] = Query(default=["oral_health", "fluoridation"]),
date_from: Optional[date] = None,
date_to: Optional[date] = None,
state: Optional[str] = None
):
"""Search scraped documents with filters"""
# Query your Delta Lake Gold layer

Your Action: Add to api/routes/search.py (create if doesn't exist)


Implementation Priorities​

Phase 1: Foundation (Week 1)​

  • Platform Detection - Add discovery/platform_detector.py from Civic Scraper patterns
  • Standardized Schema - Create models/meeting_event.py from City Scrapers
  • Enhanced Downloader - Improve agents/scraper.py retry logic

Phase 2: Scraping (Week 2-3)​

  • Legistar Scraper - Implement full Legistar support using Civic Scraper patterns
  • Generic HTML Parser - Use BeautifulSoup patterns from City Scrapers
  • PDF Extraction - Add PyPDF2/pdfplumber support

Phase 3: Intelligence (Week 4)​

  • LLM Parser - Add extraction/llm_parser.py from Engagic patterns
  • Matter Tracking - Create models/matter.py for policy evolution
  • Keyword Detection - Oral health, fluoridation, dental policy detection

Phase 4: Scale (Week 5+)​

  • Test All 76 URLs - Run full scraper on discovered targets
  • Expand to All Municipalities - Process all 32,333 jurisdictions
  • Video Transcripts - CDP-style video processing (future)

Code Snippets to Add Now​

1. Platform Detector​

File: discovery/platform_detector.py

"""
Platform detection for municipal websites.
Based on patterns from biglocalnews/civic-scraper.
"""
from typing import Optional
from urllib.parse import urlparse

PLATFORM_PATTERNS = {
'legistar': [
'legistar.com',
'/Legistar/',
'/LegislationDetail.aspx',
'/Calendar.aspx'
],
'granicus': [
'granicus.com',
'/Mediasite/',
'/ViewPublisher.php'
],
'municode': [
'municode.com',
'/meeting_minutes'
],
'civicplus': [
'civicplus.com',
'/AgendaCenter/',
'/DocumentCenter/'
]
}

def detect_platform(url: str) -> Optional[str]:
"""
Detect which platform a municipality website uses.

Args:
url: Municipality website URL

Returns:
Platform name or None if unknown
"""
url_lower = url.lower()

for platform, patterns in PLATFORM_PATTERNS.items():
if any(pattern.lower() in url_lower for pattern in patterns):
return platform

return None


def get_scraper_class(platform: str):
"""Get appropriate scraper class for platform"""
from scrapers.legistar import LegistarScraper
from scrapers.granicus import GranicusScraper
from scrapers.generic import GenericScraper

scrapers = {
'legistar': LegistarScraper,
'granicus': GranicusScraper
}

return scrapers.get(platform, GenericScraper)

2. Meeting Event Model​

File: models/meeting_event.py

"""
Standardized meeting event model.
Based on City Scrapers schema.
"""
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional, List, Dict, Any

@dataclass
class Location:
name: str
address: Optional[str] = None
city: Optional[str] = None
state: Optional[str] = None

@dataclass
class Link:
title: str # "Agenda", "Minutes", "Video"
href: str
content_type: Optional[str] = None # "application/pdf", "text/html"

@dataclass
class MeetingEvent:
"""
Normalized representation of a government meeting.
Compatible with City Scrapers format.
"""
# Core identification
id: str # Hash of source_url + start_time
title: str
description: str
classification: str # "Board", "Commission", "Council", "Committee"

# Temporal
start: datetime
end: Optional[datetime] = None
all_day: bool = False

# Spatial
location: Location

# Content
links: List[Link] = field(default_factory=list)
source: str = "" # Original URL

# Metadata
jurisdiction_name: str = ""
state_code: str = ""
fips_code: Optional[str] = None
scraped_at: datetime = field(default_factory=datetime.utcnow)

# Health policy relevance (your special sauce!)
oral_health_relevant: bool = False
keywords_found: List[str] = field(default_factory=list)
confidence_score: float = 0.0

def to_dict(self) -> Dict[str, Any]:
"""Convert to dictionary for Delta Lake storage"""
return {
'id': self.id,
'title': self.title,
'description': self.description,
'classification': self.classification,
'start': self.start.isoformat(),
'end': self.end.isoformat() if self.end else None,
'all_day': self.all_day,
'location_name': self.location.name,
'location_address': self.location.address,
'links': [{'title': l.title, 'href': l.href} for l in self.links],
'source': self.source,
'jurisdiction_name': self.jurisdiction_name,
'state_code': self.state_code,
'fips_code': self.fips_code,
'scraped_at': self.scraped_at.isoformat(),
'oral_health_relevant': self.oral_health_relevant,
'keywords_found': self.keywords_found,
'confidence_score': self.confidence_score
}

3. Enhanced Discovery Pipeline​

Add to: discovery/discovery_pipeline.py

async def discover_platform_capabilities(self):
"""
For each discovered URL, detect which platform it uses.
This prepares optimal scraping strategies.
"""
from discovery.platform_detector import detect_platform

logger.info("Detecting platforms for discovered URLs...")

silver_path = f"{settings.delta_lake_path}/silver/discovered_urls"
urls_df = self.spark.read.format("delta").load(silver_path)

enriched_urls = []
for row in urls_df.take(urls_df.count()):
row_dict = row.asDict()
url = row_dict['url']

# Detect platform
platform = detect_platform(url)
row_dict['platform'] = platform if platform else 'generic'
row_dict['scraper_ready'] = platform is not None

enriched_urls.append(row_dict)

# Write back to Silver layer with platform info
from pyspark.sql import Row
enriched_df = self.spark.createDataFrame([Row(**u) for u in enriched_urls])
enriched_df.write.format("delta").mode("overwrite").save(silver_path)

logger.success(f"Platform detection complete - {len(enriched_urls)} URLs analyzed")

return enriched_urls

Next Steps​

  1. Review Licenses - All mentioned projects use permissive licenses (MIT/Apache 2.0), but double-check
  2. Clone Repos Locally - Study their code structure:
    cd /tmp
    git clone https://github.com/biglocalnews/civic-scraper
    git clone https://github.com/city-scrapers/city-scrapers
  3. Add Attribution - In your README.md, credit these projects
  4. Start with Platform Detector - Implement discovery/platform_detector.py first
  5. Test with Your 76 URLs - Run platform detection on your discovered URLs

Resources​


Questions to Consider​

  1. Do you want video transcript support? (CDP pattern, requires AWS/GCP credits)
  2. How important is real-time tracking? (vs batch processing)
  3. Will you expose a public API? (Councilmatic patterns useful here)
  4. Need to track voting records? (Councilmatic person/vote models)

Let me know which phase you want to implement first!