Jurisdiction Discovery - Quick Start Guide
No External APIs Required! 🎉
This discovery system uses pattern-based matching and public datasets only. No search API keys needed!
Quick Start
1. Install Dependencies
All required packages are in requirements.txt:
pip install -r requirements.txt
Key packages:
httpx- HTTP client for URL verificationbeautifulsoup4- HTML parsing for web crawlingpyspark- Data processingdelta-spark- Delta Lake storage
2. Initialize Delta Lake
python main.py init
3. Run Discovery
# Test with 100 jurisdictions
python main.py discover-jurisdictions --limit 100
# Single state
python main.py discover-jurisdictions --state CA
# Full discovery (~30k jurisdictions, 12-18 hours)
python main.py discover-jurisdictions
4. View Results
python main.py discovery-stats
Expected output:
📊 Jurisdiction Discovery Statistics
Bronze Layer (Raw Data):
Total jurisdictions: 90,735
- county: 3,143
- municipality: 19,495
- school_district: 13,051
Silver Layer (Discovered URLs):
Total discoveries: 87
Homepages found: 78 (89.7%)
Minutes URLs found: 65 (74.7%)
Avg confidence: 0.82
Gold Layer (Scraping Targets):
Total targets: 65
High priority: 42
5. Start Scraping
python main.py scrape-batch --source discovered --limit 50
How It Works
Strategy 1: GSA Domain Matching
The system directly matches jurisdiction names to the GSA .gov registry:
"Sacramento County" → normalized: "sacramento"
GSA lookup → "sacramento.gov" ✓
Confidence: 1.0
Strategy 2: URL Pattern Generation
Common government URL patterns are tested:
Counties:
co.{name}.{state}.us{name}county.gov
Cities:
www.{name}.govcityof{name}.gov
School Districts:
{name}.k12.{state}.us{name}schools.org
Example:
"Fresno" (municipality, CA)
Test: https://www.fresno.gov → ✓ Found
Confidence: 0.9
Strategy 3: Web Crawling
Once a homepage is found:
- Crawl for "minutes", "agendas" links
- Detect CMS platforms (Granicus, CivicClerk, etc.)
- Boost confidence for .gov domains
Performance
Expected Results
- Counties: 85-95% discovery rate
- Cities > 10k: 75-90% discovery rate
- School Districts: 70-85% discovery rate
- Processing Time: ~3-5 min per 100 jurisdictions
- Total Cost: $0 (no API fees!)
Optimization
Parallel Processing:
# Process multiple states in parallel
for state in CA TX NY FL PA; do
python main.py discover-jurisdictions --state $state &
done
wait
Databricks Notebook: For production runs, use the Databricks notebook:
- Upload
notebooks/Jurisdiction_Discovery.py - Create cluster (2-4 workers)
- Run with Spark parallel processing
Troubleshooting
Low Discovery Rate
Check if URL patterns need adjustment for specific regions:
# In discovery/url_discovery_agent.py
# Add regional patterns, e.g.:
if state == "MA": # Massachusetts has unique patterns
patterns.extend([
(f"https://www.{name_slug}.ma.us", 0.85),
])
Memory Errors
Process in smaller batches:
# By state
python main.py discover-jurisdictions --state CA
# Or by type
python main.py discover-jurisdictions --type county
Census Download Fails
Cached for 7 days by default. For manual download:
- Download from: https://www.census.gov/programs-surveys/gus.html
- Place in
data/cache/census/ - Rerun discovery
Next Steps
- Test Discovery: Run with
--limit 100 - Review Results: Check
discovery-stats - Full Run: Remove limit for production
- Start Scraping: Use discovered URLs
- Schedule Re-Discovery: Monthly updates
Cost
Total: $0 🎉
- No API fees
- Uses free public datasets
- Only local/cloud compute costs
Compare to legacy approach:
Google Search API: $150Bing Search API: $90- Pattern Matching: $0
Ready to discover 90,000+ government websites with zero external dependencies! 🚀
Jurisdiction Discovery System - Setup Guide
Quick Start
1. Configure Search APIs
The discovery system requires search API keys to find government websites. You can use either Google Custom Search or Bing Search (or both for redundancy).
Option A: Google Custom Search API
-
Enable the API
- Visit Google Cloud Console
- Create a new project or select existing
- Enable "Custom Search API"
-
Create API Key
- Go to "Credentials" → "Create Credentials" → "API Key"
- Copy your API key
-
Create Search Engine
- Visit Google Custom Search
- Click "Add" to create new search engine
- Set "Sites to search" to:
*.gov(to focus on government sites) - Copy your "Search Engine ID"
-
Add to .env
GOOGLE_SEARCH_API_KEY=your_google_api_keyGOOGLE_SEARCH_ENGINE_ID=your_search_engine_id
Pricing: First 100 queries/day free, then $5 per 1,000 queries
Option B: Bing Search API
-
Create Azure Account
- Visit Azure Portal
- Create account (free tier available)
-
Create Bing Search Resource
- Click "Create a resource" → Search for "Bing Search v7"
- Select pricing tier (F1 free tier: 1k queries/month)
- Create resource
-
Get API Key
- Go to your Bing Search resource
- Click "Keys and Endpoint"
- Copy one of the keys
-
Add to .env
BING_SEARCH_API_KEY=your_bing_api_key
Pricing: Free tier: 1,000 queries/month; Paid: $3 per 1,000 queries
2. Install Dependencies
All required packages are already in requirements.txt:
pip install -r requirements.txt
Key packages for discovery:
httpx==0.27.0- Async HTTP clientbeautifulsoup4==4.12.2- HTML parsingpyspark==3.5.0- Data processingdelta-spark==3.0.0- Delta Lake
3. Initialize Delta Lake
python main.py init
This creates the necessary Delta Lake tables.
4. Run Discovery Pipeline
Test Run (100 jurisdictions)
python main.py discover-jurisdictions --limit 100
Expected output:
📊 Bronze Layer Complete:
Total records: 90,735
Counties: 3,143
Municipalities: 19,495
...
📊 URL Discovery Complete:
Attempted: 100
Successful: 87
Homepages found: 87
Minutes URLs found: 65
Avg confidence: 0.72
📊 Gold Layer Complete:
Scraping targets created: 65
High priority (>150): 42
...
✅ Discovery Complete!
State-Specific Discovery
python main.py discover-jurisdictions --state CA
Full Production Run
# Discovers all ~30,000 high-priority jurisdictions
# Takes 4-6 hours with parallel processing
python main.py discover-jurisdictions
5. View Statistics
python main.py discovery-stats
Output:
📊 Jurisdiction Discovery Statistics
Bronze Layer (Raw Data):
Total jurisdictions: 90,735
- county: 3,143
- municipality: 19,495
- school_district: 13,051
- special_district: 38,542
- township: 16,504
Silver Layer (Discovered URLs):
Total discoveries: 27,483
Homepages found: 24,125 (87.8%)
Minutes URLs found: 18,562 (67.5%)
Avg confidence: 0.74
Gold Layer (Scraping Targets):
Total targets: 18,562
High priority: 12,340
- pending: 18,562
6. Start Scraping
# Scrape high-priority targets
python main.py scrape-batch --source discovered --limit 50 --priority 150
# Or scrape all pending targets (use with caution!)
python main.py scrape-batch --source discovered --limit 1000
Using Databricks Notebook
For production deployment on Databricks:
-
Upload Notebook
databricks workspace import notebooks/Jurisdiction_Discovery.py \-l PYTHON \-f SOURCE \/Users/your-email@company.com/Jurisdiction_Discovery -
Configure Secrets
# Create secret scopedatabricks secrets create-scope oral-health-app# Add API keysdatabricks secrets put-secret oral-health-app google-search-api-keydatabricks secrets put-secret oral-health-app google-search-engine-iddatabricks secrets put-secret oral-health-app bing-search-api-key -
Create Cluster
- Runtime: 14.3 LTS or higher
- Node type: Standard_DS3_v2 (or similar)
- Workers: 2-4 (for parallel processing)
- Libraries: All from
requirements.txt
-
Run Notebook
- Open notebook in Databricks workspace
- Attach to cluster
- Run all cells
Cost Estimation
API Costs
For discovering 30,000 jurisdictions:
| Provider | Free Tier | Paid Cost | Total Cost |
|---|---|---|---|
| 100/day (3,000/month) | $5/1k | ~$135 | |
| Bing | 1,000/month | $3/1k | ~$87 |
| Both | 4,000 free | Rest on Bing | ~$78 |
Recommendation: Use both APIs to maximize free tier usage.
Compute Costs
Local Development:
- Free (uses local resources)
- ~4-6 hours for full discovery
Databricks:
- Cluster: ~$2-4/hour
- Total: ~$8-24 for full discovery
- Can use spot instances to reduce cost
Re-discovery Schedule
- Monthly: Catch URL changes and new jurisdictions
- Cost: ~$10-20/month (many URLs cached)
Troubleshooting
Low Discovery Rate
Problem: Only finding 30-40% of URLs
Solutions:
- Check API keys are correct
- Verify API quotas not exceeded
- Review failed discoveries:
from pyspark.sql.functions import colsilver_df = spark.read.format("delta").load("silver/discovered_urls")failed = silver_df.filter(col("homepage_url").isNull())failed.show(20, truncate=False)
Memory Errors
Problem: Out of memory during discovery
Solutions:
-
Process by state:
for state in CA TX NY FL PA OH IL MI NC GA; dopython main.py discover-jurisdictions --state $statedone -
Increase Spark memory:
spark = SparkSession.builder \.config("spark.driver.memory", "8g") \.config("spark.executor.memory", "8g") \.getOrCreate() -
Use Databricks cluster (more memory available)
API Rate Limits
Problem: Hitting rate limits too quickly
Solutions:
-
Reduce batch size in
url_discovery_agent.py:batch_size = 5 # Instead of 10 -
Add delays between batches:
await asyncio.sleep(1) # After each batch -
Use both Google and Bing to distribute load
Census Data Download Fails
Problem: Census Bureau site unreachable
Solutions:
-
Use cached data (automatically cached for 7 days)
-
Manual download:
# Download files manually from Census Bureau# Place in data/cache/census/ -
Check Census Bureau site status: https://www.census.gov/programs-surveys/gus.html
Monitoring Progress
Check Discovery Status
-- In Databricks SQL or Spark
SELECT
state,
COUNT(*) as total,
COUNT(homepage_url) as found,
ROUND(COUNT(homepage_url) * 100.0 / COUNT(*), 1) as success_rate
FROM silver.discovered_urls
GROUP BY state
ORDER BY success_rate DESC;
Track Scraping Progress
SELECT
scraping_status,
COUNT(*) as count,
ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM gold.scraping_targets), 1) as pct
FROM gold.scraping_targets
GROUP BY scraping_status;
Next Steps
Once discovery is complete:
-
Review High-Priority Targets
- Check for false positives
- Validate CMS platform detection
-
Start Scraping
- Begin with top 100 high-priority sites
- Monitor document quality
- Adjust priority scores as needed
-
Schedule Automation
- Set up monthly re-discovery job
- Monitor for new jurisdictions
- Track URL changes
-
Integration
- Connect to existing scraper agents
- Feed documents to classification pipeline
- Generate advocacy opportunities
Support
For issues or questions:
- GitHub Issues: github.com/getcommunityone/open-navigator-for-engagement/issues
- Documentation: JURISDICTION_DISCOVERY.md
Ready to discover 90,000+ government websites! 🦷✨