State-Split Data Files (Deprecated)
:::warning Deprecated This approach of splitting files into separate state files is deprecated.
Use Partitioned Datasets instead for:
- Same efficiency as separate files
- Ability to query across states
- Better analytics tool support
- Simpler data management :::
All gold parquet files with state information were previously split into state-specific files. This has been replaced by partitioned datasets which offer the same benefits with better queryability.
What Changed
Instead of downloading one massive file with all states:
- ❌
nonprofits_organizations.parquet(72 MB, 1.9M records)
You can now download just the state(s) you need:
- ✅
nonprofits_organizations_AL.parquet(Alabama only, ~1 MB) - ✅
nonprofits_organizations_CA.parquet(California only, ~8 MB) - ✅
nonprofits_organizations_TX.parquet(Texas only, ~6 MB)
Benefits
- Smaller Downloads: Only download the data you need
- Faster Queries: Load and analyze state-specific data faster
- Better Organization: Easier to manage and share state-level datasets
- HuggingFace Friendly: Avoids file size limits, enables state-specific repos
File Structure
State-split files are located in data/gold/by_state/:
data/gold/by_state/
├── nonprofits_organizations_AL.parquet
├── nonprofits_organizations_AK.parquet
├── nonprofits_locations_AL.parquet
├── jurisdictions_cities_AL.parquet
├── jurisdictions_counties_AL.parquet
├── jurisdictions_school_districts_AL.parquet
└── ... (388 total files)
Files That Were Split
Nonprofit Data (62 states/territories each)
nonprofits_organizations_*.parquet- Organization detailsnonprofits_locations_*.parquet- Geographic locations
Jurisdiction Data (52 states each)
jurisdictions_cities_*.parquet- Cities and municipalitiesjurisdictions_counties_*.parquet- Countiesjurisdictions_school_districts_*.parquet- School districtsjurisdictions_townships_*.parquet- Townships
Other Data (56 states each)
domains_gsa_domains_*.parquet- Government domains
Usage
Load Alabama Nonprofits
import pandas as pd
# Load only Alabama data
df = pd.read_parquet('data/gold/by_state/nonprofits_organizations_AL.parquet')
print(f"Alabama nonprofits: {len(df):,}")
Load Multiple States
import pandas as pd
from pathlib import Path
# Load all southeastern states
states = ['AL', 'GA', 'FL', 'MS', 'TN', 'SC', 'NC']
dfs = []
for state in states:
path = f'data/gold/by_state/nonprofits_organizations_{state}.parquet'
df = pd.read_parquet(path)
dfs.append(df)
# Combine into one DataFrame
southeast = pd.concat(dfs, ignore_index=True)
print(f"Southeast nonprofits: {len(southeast):,}")
Recreate Full Dataset
import pandas as pd
from pathlib import Path
# Load all nonprofit organization files
files = Path('data/gold/by_state').glob('nonprofits_organizations_*.parquet')
dfs = [pd.read_parquet(f) for f in files]
# Combine
full_dataset = pd.concat(dfs, ignore_index=True)
print(f"All nonprofits: {len(full_dataset):,}")
Managing State Splits
Create/Update State Splits
# Split all files by state
python scripts/split_gold_by_state.py --all
# Split specific file
python scripts/split_gold_by_state.py --file nonprofits_organizations.parquet
# Dry run (see what would happen)
python scripts/split_gold_by_state.py --all --dry-run
# View statistics
python scripts/split_gold_by_state.py --stats
Upload to HuggingFace
Upload state-specific datasets to HuggingFace for public access:
# Upload all states
python scripts/upload_state_splits_to_hf.py --all
# Upload Alabama only
python scripts/upload_state_splits_to_hf.py --state AL
# Upload multiple states
python scripts/upload_state_splits_to_hf.py --states AL AK AZ CA
# Dry run
python scripts/upload_state_splits_to_hf.py --all --dry-run
This creates state-specific repos on HuggingFace:
CommunityOne/one-data-AL- All Alabama dataCommunityOne/one-data-CA- All California dataCommunityOne/one-data-TX- All Texas data
Statistics
Total State-Split Files: 388 files
Total Size: 172 MB
States/Territories: 62 (all US states, DC, territories, military addresses)
File Breakdown:
- 62 nonprofit organization files
- 62 nonprofit location files
- 56 government domain files
- 52 jurisdiction city files
- 52 jurisdiction county files
- 52 jurisdiction school district files
- 52 jurisdiction township files
Notes
- Original monolithic files are still in
data/gold/for backward compatibility - State-split files use standard 2-letter state codes (AL, AK, AZ, etc.)
- Includes US territories: PR, VI, GU, AS, MP
- Includes military addresses: AA, AE, AP
- Some files have fewer states if no data exists for that state