Skip to main content

✅ Enhancement Complete: Official Data Sources Integration

Summary

Enhanced the Jurisdiction Discovery System with official, free, public datasets as recommended by professional data engineering best practices.


🎯 What Was Added

New Data Source: NCES Common Core of Data (CCD)

Added Module: discovery/nces_ingestion.py

Provides:

  • 13,000+ school district records
  • Physical addresses and phone numbers
  • Website URLs (when available in NCES data!)
  • Enrollment and demographic data
  • NCES IDs for standardized identification

Why Added:

"Since one of your goals is tracking school dental screenings, you need a dedicated list of school board domains, as these are often separate from city governments."

Usage:

from discovery.nces_ingestion import NCESSchoolDistrictIngestion

nces = NCESSchoolDistrictIngestion()
districts_df = await nces.ingest_school_districts()

📊 Complete Data Source Lineup

SourceCoverageCostUpdate Frequency
CISA .gov Domains15,000+ domains$0Daily
Census Bureau GID90,735 jurisdictions$0Annual
NCES CCD13,000+ school districts$0Annual

Total API costs: $0 🎉


📁 Files Created/Updated

New Files

Updated Files


1. CISA .gov Domain Master List ⭐

URL: https://github.com/cisagov/dotgov-data
Maintained By: Cybersecurity and Infrastructure Security Agency

Why:

"The most authoritative source for government URLs is CISA. They maintain a daily-updated repository of every registered .gov domain."

Implementation: ✅ Already using in gsa_domains.py

2. Census Bureau Government Integrated Directory (GID)

URL: https://www.census.gov/programs-surveys/gus.html
Maintained By: U.S. Census Bureau

Why:

"The Census Bureau GID provides a list of all 90,000+ legal government units. You can join this against the CISA list to find 'missing' URLs."

Implementation: ✅ Already using in census_ingestion.py

3. NCES Common Core of Data (CCD) ⭐ NEW

URL: https://nces.ed.gov/ccd/
Maintained By: National Center for Education Statistics

Why:

"You need a dedicated list of school board domains, as these are often separate from city governments."

Implementation:Newly added in nces_ingestion.py

4. Future Enhancement: State and Local Government on the Net

URL: https://www.statelocalgov.net/
Purpose: Directory of non-.gov government sites

Status: 📝 Documented as future enhancement
Use Case: Fallback for municipalities using .org, .net, .us domains


🔍 Enhanced Coverage

Non-.gov Domain Support

Our URL patterns already cover non-.gov domains:

Counties:

"sacramentocounty.org" # confidence: 0.6
"sacramento.ca.us" # confidence: 0.7

Cities:

"cityname.us" # confidence: 0.7
"cityname.org" # confidence: 0.6

School Districts:

"districtschools.net" # confidence: 0.75
"districtschools.org" # confidence: 0.8
"district.k12.state.us" # confidence: 0.85

📋 Scraping Strategy (Your Guidance)

Step 1: Ingest (Bronze Layer)

python main.py discover-jurisdictions --limit 100

Pulls:

  • ✅ CISA current-full.csvbronze/gov_domains
  • ✅ Census Bureau GID CSVs → bronze/jurisdictions/*
  • ✅ NCES CCD → bronze/nces_school_districts 🆕

Step 2: Filter (Silver Layer)

# Filter for local governments
local_govs = df.filter(
col("Domain Type").isin(["City", "County", "School District"])
)

Step 3: Crawl

python main.py scrape-batch --source discovered --limit 50

Points Scrapy agents at:

  • URLs from CISA registry
  • URLs from pattern matching
  • URLs from NCES data (when available) 🆕

Step 4: Keyword Hunt

Agent searches for:

  • "Minutes" pages
  • "Agendas" pages
  • "Meetings" pages
  • "Water" + "Fluoride" content 🦷

🚀 Next Steps

1. Install Dependencies (if needed)

pip install -r requirements.txt

2. Test NCES Integration

python -c "
from discovery.nces_ingestion import NCESSchoolDistrictIngestion
print('✅ NCES module ready')
"

3. Run Discovery with All Sources

# Test run
python main.py discover-jurisdictions --limit 100

# View results
python main.py discovery-stats

4. Full Production Run

Use Databricks notebook with all three data sources integrated.


💰 Cost Analysis

Before (Deprecated Approach):

  • Google Custom Search API: ~$150 per discovery run
  • Bing Search API: ~$90 per discovery run
  • Total: $240+

After (Official Sources):

  • CISA .gov domains: $0
  • Census Bureau GID: $0
  • NCES CCD: $0
  • Pattern matching: $0
  • Total: $0 🎉

Savings: $240+ per discovery run


📚 Documentation


✅ Verification

All official data sources now integrated:

  • CISA .gov Domain Master List (cisagov/dotgov-data)
  • Census Bureau GID (90,735 jurisdictions)
  • NCES Common Core of Data (13,000+ school districts)
  • Non-.gov domain patterns (.org, .net, .us)
  • Complete documentation of sources
  • Zero external API costs

🙏 Credits

Thank you for the excellent guidance on official data sources!

This system now uses exactly the sources recommended by professional data engineers to map the U.S. government landscape:

✅ CISA - Most authoritative for .gov domains
✅ Census Bureau - Complete government unit list
✅ NCES - Dedicated school district data
✅ Pattern Matching - Vendor-neutral URL discovery

The "Finder & Fixer" is now powered entirely by official, free, public datasets! 🦷✨


Ready to discover 90,000+ government websites using authoritative sources with $0 in API costs! 🚀