Skip to main content

Data Sources Overview

This document covers the official, free, public datasets used by Open Navigator.

📚 Full Citations & Academic References

For complete citations, licenses, and attribution for all data sources, see:

👉 Citations & Data Sources — Includes BibTeX citations, license information, coverage details, and links to original sources.

📊 Data Scale & Coverage

Open Navigator provides comprehensive coverage across the United States:

Data TypeCountCoverage
Government Jurisdictions90,000+All U.S. local governments
Counties3,144100% of U.S. counties
Municipalities19,500+Cities, towns, villages
Townships36,000+County subdivisions
School Districts13,000+Complete NCES coverage
Nonprofit Organizations3,000,000+All IRS-registered 501(c) orgs
Official .gov Domains15,000+CISA validated domains
States50All U.S. states + DC
Meeting Video Sources1,000+Cities with full transcripts

Key Insight: All data sources are 100% free and public - no subscriptions, no API fees, no paywalls.


📂 Data Source Categories

Open Navigator integrates data from six main categories:

  1. Government Jurisdictions - Cities, counties, school districts (this page)
  2. Nonprofit Organizations - IRS Form 990s, charity ratings, transparency data
  3. Ballot Measures & Elections - Propositions, referendums, election results
  4. Public Opinion & Surveys - Scientifically validated survey questions, polling data
  5. Fact-Checking & Verification - Google Fact Check API, FactCheck.org, PolitiFact claim verification
  6. Open Source Projects - Civic tech repositories, community tools, digital public goods

🏛️ Government Jurisdiction Data

1. CISA .gov Domain Master List ⭐ Most Authoritative

Source: Cybersecurity and Infrastructure Security Agency (CISA)
URL: https://github.com/cisagov/dotgov-data
File: current-full.csv (updated daily!)

What It Contains:

  • 15,000+ registered .gov domains
  • Domain Type: City, County, State, Tribal, School District
  • Organization names and locations
  • Security contacts and registration dates

Why We Use It:

"The most authoritative source for government URLs is CISA. They maintain a daily-updated repository of every registered .gov domain."

How We Use It:

# Direct download from GitHub
from discovery.gsa_domains import GSADomainList

gsa = GSADomainList()
domains_df = await gsa.download_domain_list()

Lakehouse Strategy:

  1. Ingest to Bronze Layer (bronze/gov_domains)
  2. Filter by Domain Type for targeted scraping (City, County)
  3. Use for exact matching (confidence: 0.95-1.0)
  4. Use for fuzzy matching with 75%+ similarity

2. U.S. Census Bureau - Government Integrated Directory (GID)

Source: U.S. Census Bureau, Government Statistics
URL: https://www.census.gov/programs-surveys/gus.html
Dataset: 2022 Census of Governments

What It Contains:

  • 90,735 total government units
    • 3,143 counties
    • 19,495 municipalities (cities/towns)
    • 16,504 townships
    • 13,051 school districts
    • 38,542 special districts
  • FIPS codes (standardized IDs)
  • Population data
  • Geographic hierarchy (state, county, place)

Why We Use It:

"The Census Bureau GID provides a list of all 90,000+ legal government units. You can join this against the CISA list to find 'missing' URLs that your agent needs to hunt for."

How We Use It:

from discovery.census_ingestion import CensusGovernmentIngestion

census = CensusGovernmentIngestion()
dfs = await census.ingest_all_jurisdictions()

Lakehouse Strategy:

  1. Ingest to Bronze Layer (bronze/jurisdictions/{{type}})
  2. Create unified view with all jurisdiction types
  3. Join with CISA to identify missing URLs
  4. Prioritize by population for scraping

3. NCES Common Core of Data (CCD)

Source: National Center for Education Statistics (NCES)
URL: https://nces.ed.gov/ccd/
Dataset: Local Education Agency (LEA) Universe Survey

What It Contains:

  • 13,000+ school districts
  • Official district names and NCES IDs
  • Physical addresses and phone numbers
  • Website URLs (when available)
  • Enrollment and demographic data
  • District type (Regular, Charter, etc.)

Why We Use It:

"Since one of your goals is tracking school dental screenings, you need a dedicated list of school board domains, as these are often separate from city governments."

How We Use It:

from discovery.nces_ingestion import NCESSchoolDistrictIngestion

nces = NCESSchoolDistrictIngestion()
districts_df = await nces.ingest_school_districts()

Lakehouse Strategy:

  1. Ingest to Bronze Layer (bronze/nces_school_districts)
  2. Extract provided URLs (many NCES records include website field!)
  3. Use district names to generate URL patterns for missing sites
  4. Common pattern: {{district}}.k12.{{state}}.us

📋 Summary Table: Where to Pull the Lists

Jurisdiction TypePrimary Free SourceFormatCoverage
All Official .govCISA dotgov-dataCSV / GitHub15,000+ domains
School DistrictsNCES CCD DataCSV13,000+ districts
Counties/CitiesCensus Bureau GIDCSV22,638 jurisdictions
TownshipsCensus Bureau GIDCSV16,504 townships
Special DistrictsCensus Bureau GIDCSV38,542 districts
State LegislaturesLegiScan APIJSON / API50 states

🔍 Scraping Strategy (Based on Your Guidance)

Step 1: Ingest

python main.py init # Initialize Delta Lake
python main.py discover-jurisdictions --limit 100 # Test run

Pulls:

  • current-full.csv from CISA → Bronze layer
  • ✅ Census GID CSVs → Bronze layer
  • ✅ NCES CCD data → Bronze layer

Step 2: Filter

# Create Silver layer table
df = spark.read.format("delta").load("bronze/gov_domains")

# Filter for local governments
local_govs = df.filter(
col("Domain Type").isin(["City", "County", "School District"])
)

Result: ~8,000-10,000 high-priority targets

Step 3: Crawl

python main.py scrape-batch --source discovered --limit 50

Points Scrapy agents at discovered URLs:

  • Homepage URLs from CISA + pattern matching
  • Verified with HTTP HEAD/GET requests
  • Prioritized by population and domain type

Step 4: Keyword Hunt

Agent searches for:

  • "Minutes" pages
  • "Agendas" pages
  • "Meetings" pages
  • "Water" + "Fluoride" content

CMS Detection:

  • Granicus
  • CivicClerk
  • Municode
  • Legistar

🚀 Non-.gov Coverage

Many smaller municipalities use non-.gov domains:

  • .org (e.g., cityofsomewhere.org)
  • .us (e.g., somewhere.ca.us)
  • .net (e.g., districschools.net)

Our URL patterns cover these:

# Pattern generation includes:
patterns = [
"https://cityname.gov", # Primary
"https://cityname.us", # Alternative
"https://cityname.org", # Non-profit
"https://cityname.net", # Legacy
]

Future Enhancement:


💰 Cost: $0

All data sources are free and publicly available:

SourceCostUpdate Frequency
CISA dotgov-data$0Daily
Census Bureau GID$0Annual
NCES CCD$0Annual
Pattern Matching$0On-demand

Total API costs: $0 🎉

Compare to deprecated approach:

  • Google Custom Search API: $5/1000 queries = ~$150
  • Bing Search API: $7/1000 queries = ~$90

Savings: $240+ per discovery run


📚 References

Government Jurisdiction Data:

Nonprofit Data: Nonprofit Data:

Open Source Projects:


✅ Credits

System Architecture: Medallion Architecture (Bronze → Silver → Gold)
Data Engineering Pattern: Delta Lake + PySpark
Sustainable Approach: No deprecated search APIs
Guidance Source: Professional data engineering best practices

Thank you for the excellent guidance on official data sources! 🙏

This system now uses the exact sources recommended by data engineers to map the U.S. government landscape. 🦷✨