Data Sources Overview
This document covers the official, free, public datasets used by Open Navigator.
For complete citations, licenses, and attribution for all data sources, see:
👉 Citations & Data Sources — Includes BibTeX citations, license information, coverage details, and links to original sources.
📊 Data Scale & Coverage
Open Navigator provides comprehensive coverage across the United States:
| Data Type | Count | Coverage |
|---|---|---|
| Government Jurisdictions | 90,000+ | All U.S. local governments |
| Counties | 3,144 | 100% of U.S. counties |
| Municipalities | 19,500+ | Cities, towns, villages |
| Townships | 36,000+ | County subdivisions |
| School Districts | 13,000+ | Complete NCES coverage |
| Nonprofit Organizations | 3,000,000+ | All IRS-registered 501(c) orgs |
| Official .gov Domains | 15,000+ | CISA validated domains |
| States | 50 | All U.S. states + DC |
| Meeting Video Sources | 1,000+ | Cities with full transcripts |
Key Insight: All data sources are 100% free and public - no subscriptions, no API fees, no paywalls.
📂 Data Source Categories
Open Navigator integrates data from six main categories:
- Government Jurisdictions - Cities, counties, school districts (this page)
- Nonprofit Organizations - IRS Form 990s, charity ratings, transparency data
- Ballot Measures & Elections - Propositions, referendums, election results
- Public Opinion & Surveys - Scientifically validated survey questions, polling data
- Fact-Checking & Verification - Google Fact Check API, FactCheck.org, PolitiFact claim verification
- Open Source Projects - Civic tech repositories, community tools, digital public goods
🏛️ Government Jurisdiction Data
1. CISA .gov Domain Master List ⭐ Most Authoritative
Source: Cybersecurity and Infrastructure Security Agency (CISA)
URL: https://github.com/cisagov/dotgov-data
File: current-full.csv (updated daily!)
What It Contains:
- 15,000+ registered .gov domains
- Domain Type: City, County, State, Tribal, School District
- Organization names and locations
- Security contacts and registration dates
Why We Use It:
"The most authoritative source for government URLs is CISA. They maintain a daily-updated repository of every registered .gov domain."
How We Use It:
# Direct download from GitHub
from discovery.gsa_domains import GSADomainList
gsa = GSADomainList()
domains_df = await gsa.download_domain_list()
Lakehouse Strategy:
- Ingest to Bronze Layer (
bronze/gov_domains) - Filter by
Domain Typefor targeted scraping (City, County) - Use for exact matching (confidence: 0.95-1.0)
- Use for fuzzy matching with 75%+ similarity
2. U.S. Census Bureau - Government Integrated Directory (GID)
Source: U.S. Census Bureau, Government Statistics
URL: https://www.census.gov/programs-surveys/gus.html
Dataset: 2022 Census of Governments
What It Contains:
- 90,735 total government units
- 3,143 counties
- 19,495 municipalities (cities/towns)
- 16,504 townships
- 13,051 school districts
- 38,542 special districts
- FIPS codes (standardized IDs)
- Population data
- Geographic hierarchy (state, county, place)
Why We Use It:
"The Census Bureau GID provides a list of all 90,000+ legal government units. You can join this against the CISA list to find 'missing' URLs that your agent needs to hunt for."
How We Use It:
from discovery.census_ingestion import CensusGovernmentIngestion
census = CensusGovernmentIngestion()
dfs = await census.ingest_all_jurisdictions()
Lakehouse Strategy:
- Ingest to Bronze Layer (
bronze/jurisdictions/{{type}}) - Create unified view with all jurisdiction types
- Join with CISA to identify missing URLs
- Prioritize by population for scraping
3. NCES Common Core of Data (CCD)
Source: National Center for Education Statistics (NCES)
URL: https://nces.ed.gov/ccd/
Dataset: Local Education Agency (LEA) Universe Survey
What It Contains:
- 13,000+ school districts
- Official district names and NCES IDs
- Physical addresses and phone numbers
- Website URLs (when available)
- Enrollment and demographic data
- District type (Regular, Charter, etc.)
Why We Use It:
"Since one of your goals is tracking school dental screenings, you need a dedicated list of school board domains, as these are often separate from city governments."
How We Use It:
from discovery.nces_ingestion import NCESSchoolDistrictIngestion
nces = NCESSchoolDistrictIngestion()
districts_df = await nces.ingest_school_districts()
Lakehouse Strategy:
- Ingest to Bronze Layer (
bronze/nces_school_districts) - Extract provided URLs (many NCES records include website field!)
- Use district names to generate URL patterns for missing sites
- Common pattern:
{{district}}.k12.{{state}}.us
📋 Summary Table: Where to Pull the Lists
| Jurisdiction Type | Primary Free Source | Format | Coverage |
|---|---|---|---|
| All Official .gov | CISA dotgov-data | CSV / GitHub | 15,000+ domains |
| School Districts | NCES CCD Data | CSV | 13,000+ districts |
| Counties/Cities | Census Bureau GID | CSV | 22,638 jurisdictions |
| Townships | Census Bureau GID | CSV | 16,504 townships |
| Special Districts | Census Bureau GID | CSV | 38,542 districts |
| State Legislatures | LegiScan API | JSON / API | 50 states |
🔍 Scraping Strategy (Based on Your Guidance)
Step 1: Ingest
python main.py init # Initialize Delta Lake
python main.py discover-jurisdictions --limit 100 # Test run
Pulls:
- ✅
current-full.csvfrom CISA → Bronze layer - ✅ Census GID CSVs → Bronze layer
- ✅ NCES CCD data → Bronze layer
Step 2: Filter
# Create Silver layer table
df = spark.read.format("delta").load("bronze/gov_domains")
# Filter for local governments
local_govs = df.filter(
col("Domain Type").isin(["City", "County", "School District"])
)
Result: ~8,000-10,000 high-priority targets
Step 3: Crawl
python main.py scrape-batch --source discovered --limit 50
Points Scrapy agents at discovered URLs:
- Homepage URLs from CISA + pattern matching
- Verified with HTTP HEAD/GET requests
- Prioritized by population and domain type
Step 4: Keyword Hunt
Agent searches for:
- "Minutes" pages
- "Agendas" pages
- "Meetings" pages
- "Water" + "Fluoride" content
CMS Detection:
- Granicus
- CivicClerk
- Municode
- Legistar
🚀 Non-.gov Coverage
Many smaller municipalities use non-.gov domains:
.org(e.g.,cityofsomewhere.org).us(e.g.,somewhere.ca.us).net(e.g.,districschools.net)
Our URL patterns cover these:
# Pattern generation includes:
patterns = [
"https://cityname.gov", # Primary
"https://cityname.us", # Alternative
"https://cityname.org", # Non-profit
"https://cityname.net", # Legacy
]
Future Enhancement:
- State and Local Government on the Net
- Could scrape this directory as fallback for missing URLs
- Manually curated list of non-.gov government sites
💰 Cost: $0
All data sources are free and publicly available:
| Source | Cost | Update Frequency |
|---|---|---|
| CISA dotgov-data | $0 | Daily |
| Census Bureau GID | $0 | Annual |
| NCES CCD | $0 | Annual |
| Pattern Matching | $0 | On-demand |
Total API costs: $0 🎉
Compare to deprecated approach:
Google Custom Search API: $5/1000 queries = ~$150Bing Search API: $7/1000 queries = ~$90
Savings: $240+ per discovery run ✅
📚 References
Government Jurisdiction Data:
- CISA .gov Domains: https://github.com/cisagov/dotgov-data
- Census Bureau GID: https://www.census.gov/programs-surveys/gus.html
- NCES CCD: https://nces.ed.gov/ccd/
- State/Local Gov Directory: https://www.statelocalgov.net/
- LegiScan API: https://legiscan.com/legiscan
Nonprofit Data: Nonprofit Data:
- See Nonprofit Data Sources for ProPublica, Charity Navigator, Candid/GuideStar, and GiveWell
Open Source Projects:
- See Open Source Repositories for civic tech projects, GitHub data, and community tools
✅ Credits
System Architecture: Medallion Architecture (Bronze → Silver → Gold)
Data Engineering Pattern: Delta Lake + PySpark
Sustainable Approach: No deprecated search APIs
Guidance Source: Professional data engineering best practices
Thank you for the excellent guidance on official data sources! 🙏
This system now uses the exact sources recommended by data engineers to map the U.S. government landscape. 🦷✨