Form 990 Enrichment Guide

🎯 Goal

Enrich nonprofit data with FREE Form 990 data (website, mission, financials, officers) from GivingTuesday Data Lake.

✅ What We Now Have

Intelligent Backfill Strategy

The search API now uses this priority:

Cached Form 990 data (if exists and less than 30 days old)
- Source: form_990_cached
- Columns: form_990_website, form_990_mission, form_990_last_updated
Every.org fallback (for mission/logo/causes)
- Source: everyorg
- Used when Form 990 data is missing

Source tracking in metadata:

{
  "data_sources": ["form_990_cached", "everyorg"],
  "website": "https://www.carequest.org/",
  "mission": "Advancing oral health for all...",
  "logo_url": "...",
  "causes": ["health"]
}

Incremental Updates

Checks form_990_last_updated timestamp
Only re-enriches if data is >30 days old
Preserves existing enrichment data
Reduces API calls by 90%+

🚀 Step 1: Download GivingTuesday Index (One-Time)

source .venv/bin/activate

# Download index of ALL Form 990 XMLs (~500MB)
python scripts/enrich_nonprofits_gt990.py --download-index

This creates: data/cache/form990_gt_index.parquet with ~3M Form 990 filings (2010-2023).

📊 Step 2: Enrich Massachusetts Nonprofits

Option A: Full Enrichment (Recommended)

# Enrich all MA nonprofits with Form 990 data
python scripts/enrich_nonprofits_gt990.py \
  --input data/gold/states/MA/nonprofits_organizations.parquet \
  --output data/gold/states/MA/nonprofits_organizations_enriched.parquet \
  --concurrent 50

# Replace original file
mv data/gold/states/MA/nonprofits_organizations_enriched.parquet \
   data/gold/states/MA/nonprofits_organizations.parquet

Time estimate:

43,726 orgs × ~2 seconds = ~24 hours
But most won't have Form 990s in the index
Actual time: ~4-6 hours

Option B: Sample for Testing

# Test with 1000 orgs first
python scripts/enrich_nonprofits_gt990.py \
  --input data/gold/states/MA/nonprofits_organizations.parquet \
  --output /tmp/ma_sample_990.parquet \
  --sample 1000 \
  --concurrent 20

# Check results
python -c "
import pandas as pd
df = pd.read_parquet('/tmp/ma_sample_990.parquet')
enriched = df[df['form_990_website'].notna()]
print(f'Enriched: {len(enriched):,} / {len(df):,} ({len(enriched)/len(df)*100:.1f}%)')
print()
print('Sample enriched org:')
print(enriched.iloc[0][['name', 'form_990_website', 'form_990_mission']].to_dict())
"

Option C: Update In-Place (Incremental)

# Only enrich orgs without existing data or >30 days old
python scripts/enrich_nonprofits_gt990.py \
  --input data/gold/states/MA/nonprofits_organizations.parquet \
  --update-in-place \
  --incremental \
  --max-age-days 30 \
  --concurrent 50

🔍 Step 3: Verify Enrichment

source .venv/bin/activate

python -c "
import pandas as pd

df = pd.read_parquet('data/gold/states/MA/nonprofits_organizations.parquet')

# Check CareQuest specifically
cq = df[df['ein'] == '384016550'].iloc[0]

print('🏥 CareQuest Institute for Oral Health')
print(f'   Website: {cq.get(\"form_990_website\", \"N/A\")}')
print(f'   Mission: {cq.get(\"form_990_mission\", \"N/A\")[:150]}...')
print(f'   Tax Year: {cq.get(\"form_990_tax_year\", \"N/A\")}')
print(f'   Revenue: \${cq.get(\"form_990_total_revenue\", 0):,.0f}')
print()

# Overall stats
print('📊 Enrichment Statistics:')
print(f'   Total organizations: {len(df):,}')

if 'form_990_website' in df.columns:
    with_website = df['form_990_website'].notna().sum()
    with_mission = df['form_990_mission'].notna().sum()
    print(f'   With website: {with_website:,} ({with_website/len(df)*100:.1f}%)')
    print(f'   With mission: {with_mission:,} ({with_mission/len(df)*100:.1f}%)')
else:
    print('   ⚠️  Not enriched yet - run enrichment script first')
"

📈 Expected Results

After enrichment, you should see:

{
  "title": "CAREQUEST INSTITUTE FOR ORAL HEALTH",
  "description": "Advancing oral health for all, particularly those most vulnerable...",
  "metadata": {
    "ein": "384016550",
    "website": "https://www.carequest.org/",
    "data_sources": ["form_990_cached"],
    "revenue": "$297,919,860",
    "assets": "$2,601,509,658"
  }
}

🔄 Maintaining Fresh Data

Monthly Updates

# Add to cron or GitHub Actions
0 0 1 * * cd /path/to/project && python scripts/enrich_nonprofits_gt990.py \
  --input data/gold/states/MA/nonprofits_organizations.parquet \
  --update-in-place \
  --incremental \
  --max-age-days 30

After New 990 Index Release

GivingTuesday updates their index periodically. When they do:

# Re-download index
python scripts/enrich_nonprofits_gt990.py --download-index

# Re-enrich (will pick up new filings)
python scripts/enrich_nonprofits_gt990.py \
  --input data/gold/states/MA/nonprofits_organizations.parquet \
  --update-in-place \
  --incremental

💡 Why This is Better

Feature	Form 990 (GT)	Every.org	ProPublica
Website	✅ Direct from filing	❌ Often missing	❌ Not in API
Mission	✅ Official IRS text	⚠️ Often outdated	❌ Not in API
Cost	🆓 FREE	🆓 FREE (limited)	🆓 FREE
Historical	✅ 2010-2023	❌ Current only	✅ 2011-present
Officers	✅ Names + comp	❌ No	❌ No
Accuracy	✅ IRS verified	⚠️ Community	✅ IRS verified
Coverage	~60-70%	~40-50%	~90%

🎯 Optimization Tips

Start with sample: Test with --sample 1000 first
Increase concurrency: Use --concurrent 50 or --concurrent 100
Use incremental: Always use --incremental after initial load
Check cache: Results are cached in data/cache/form_990_xml/
Monitor progress: Script shows progress bar with tqdm

🐛 Troubleshooting

"Index not loaded"

python scripts/enrich_nonprofits_gt990.py --download-index

Slow performance

# Increase concurrency
--concurrent 100

# Or process in batches
--sample 5000

Out of memory

# Reduce batch size in script or process states separately
for state in MA AL NY; do
  python scripts/enrich_nonprofits_gt990.py \
    --input data/gold/states/$state/nonprofits_organizations.parquet \
    --update-in-place \
    --concurrent 20
done

🎯 Goal​

✅ What We Now Have​

Intelligent Backfill Strategy​

Incremental Updates​

🚀 Step 1: Download GivingTuesday Index (One-Time)​

📊 Step 2: Enrich Massachusetts Nonprofits​

Option A: Full Enrichment (Recommended)​

Option B: Sample for Testing​

Option C: Update In-Place (Incremental)​

🔍 Step 3: Verify Enrichment​

📈 Expected Results​

🔄 Maintaining Fresh Data​

Monthly Updates​

After New 990 Index Release​

💡 Why This is Better​

🎯 Optimization Tips​

🐛 Troubleshooting​

"Index not loaded"​

Slow performance​

Out of memory​