Skip to main content

Census Bureau Data URL Fix

Problem

The original Census Bureau data URLs were returning 404 errors because the data structure changed.

Solution

Updated URLs (2022 Census of Governments)

The Census Bureau publishes data as ZIP files containing Excel spreadsheets, not direct CSV files.

New URLs:

Required Dependencies

To process Excel files from Census Bureau:

pip install openpyxl

How It Works

  1. Downloads ZIP file from Census Bureau
  2. Extracts Excel file (.xlsx) from ZIP
  3. Converts to CSV using pandas
  4. Caches locally (7-day cache)

Installation

source venv/bin/activate
pip install pyspark delta-spark openpyxl

Usage

python main.py discover-jurisdictions --limit 10

The system will:

  • Download Census ZIP files automatically
  • Extract and convert Excel → CSV
  • Cache for 7 days to avoid re-downloading
  • Process jurisdiction data into Delta Lake

Data Source Reference

Official Page: https://www.census.gov/data/tables/2022/econ/gus/2022-governments.html

Available Tables:

  • Table 2: Local Governments by Type and State
  • Table 5: County Governments by Population-Size Group
  • Table 6: Subcounty General-Purpose Governments
  • Table 8: Special District Governments by Function
  • Table 9: Public School Systems by Type

Update Frequency: Census of Governments runs every 5 years (2017, 2022, 2027...)

Next Update: 2027 Census of Governments


Troubleshooting

Missing openpyxl

ModuleNotFoundError: No module named 'openpyxl'

Fix: pip install openpyxl

ZIP Extraction Fails

Check disk space in data/cache/census/ directory

Still Getting 404

The Census Bureau may have moved files. Check: https://www.census.gov/programs-surveys/gus/data/datasets.html


Alternative: Manual Download

If automated download fails:

  1. Visit: https://www.census.gov/data/tables/2022/econ/gus/2022-governments.html
  2. Download ZIP files manually
  3. Extract Excel files
  4. Place in data/cache/census/ as:
    • counties_20260421.csv
    • municipalities_20260421.csv
    • etc.

The system will use cached files automatically.