Form 990 XML Data (GivingTuesday Data Lake)
Extract detailed financial data from IRS Form 990 XML filings using GivingTuesday's 990 Data Infrastructure.
🎯 Overview
Current data: IRS EO-BMF CSV files (basic info - name, EIN, address, NTEE code)
Enhancement: Form 990 XML filings from GivingTuesday Data Lake (detailed financials - revenue, expenses, programs, grants)
What is the GivingTuesday 990 Data Lake?
The 990 Data Infrastructure is a collaborative data lake of clean, standardized 990 data in XML format maintained by GivingTuesday. This is the rawest form of 990 data in GivingTuesday's infrastructure.
Data Lake Structure:
- Bucket:
gt990datalake-rawdata(AWS S3, us-east-1 Virginia) - Access: Public, no AWS credentials required (
--no-sign-request) - E-filed 990s:
EfileData/XmlFiles/(individual XML returns) - Indices:
Indices/990xmls/(CSV files listing all available 990s)
Console Access: https://us-east-1.console.aws.amazon.com/s3/buckets/gt990datalake-rawdata
What's the Difference?
| Data Source | Type | Records | Data Richness | Access Method | Best For |
|---|---|---|---|---|---|
| EO-BMF CSV ✅ Currently using | Basic registry | 1.9M+ | ⭐ Low | Direct download | Initial org list |
| Google BigQuery ⚡ Recommended | SQL queries | 5M+ | ⭐⭐⭐⭐⭐ High | SQL (serverless) | Bulk mission/website extraction |
| GivingTuesday Data Lake 🚀 Advanced | XML files | 5.4M+ | ⭐⭐⭐⭐⭐ Very High | S3 download | Detailed parsing, custom fields |
📊 What Additional Data You Can Get
From Form 990 XML:
- Financials: Total revenue, program revenue, contributions, grants, investment income
- Expenses: Total expenses, program expenses, administrative, fundraising
- Assets: Total assets, liabilities, net assets
- Programs: Program service descriptions, accomplishments, expenses per program
- Governance: Board members, officer compensation, key employees
- Grants: Grants awarded, grant recipients
- Mission: Detailed mission statement and program descriptions
- Activities: Legislative activities, political expenditures, lobbying
Example: Instead of just knowing "Alabama Oral Health Foundation exists," you get:
- Revenue: $2.5M
- Program expenses: $1.8M
- Grants awarded: $500K to 10 community health centers
- Mission: "Improve oral health access in underserved communities"
- Officers: CEO Sarah Johnson ($150K salary)
- Website: https://alabamaoralhealth.org
⚡ Google BigQuery (Recommended for Bulk Queries)
Fastest way to enrich 1M+ organizations with missions and websites!
Why BigQuery?
Google Cloud hosts the complete IRS 990 dataset in BigQuery - a serverless SQL database that lets you query 5 million Form 990s in seconds without downloading any files.
Key advantages:
- ✅ No downloads: Query directly in the cloud
- ✅ Blazing fast: Bulk queries complete in <30 seconds
- ✅ Free tier: First 1 TB/month is free (enough for most research)
- ✅ SQL interface: Easy to extract specific fields
- ✅ No infrastructure: Serverless, nothing to manage
Cost: Form 990 text fields are small - you can query all 1.9M nonprofits for ~$0 using the free tier.
Quick Start
1. Set up Google Cloud (one-time)
# Install Google Cloud SDK
# Visit: https://cloud.google.com/sdk/docs/install
# Authenticate
gcloud auth login
# Set project
gcloud config set project YOUR_PROJECT_ID
2. Extract mission statements & websites for all Alabama health orgs
-- Query in BigQuery Console or via bq CLI
SELECT
ein,
organization_name,
website_address_txt AS website,
activity_or_mission_desc AS mission,
total_revenue_current_year AS revenue,
total_expenses_current_year AS expenses,
tax_period
FROM `bigquery-public-data.irs_990.irs_990_2023`
WHERE state = 'AL'
AND ntee_code LIKE 'E%'
AND activity_or_mission_desc IS NOT NULL
ORDER BY total_revenue_current_year DESC
LIMIT 10000;
3. Run query from Python
from google.cloud import bigquery
import pandas as pd
# Initialize BigQuery client
client = bigquery.Client()
# Query for Alabama + Michigan health nonprofits with missions
query = """
SELECT
ein,
organization_name,
website_address_txt AS website,
activity_or_mission_desc AS mission,
total_revenue_current_year AS revenue,
total_expenses_current_year AS expenses,
total_assets_eoy AS assets,
state,
tax_period
FROM `bigquery-public-data.irs_990.irs_990_2023`
WHERE state IN ('AL', 'MI')
AND ntee_code LIKE 'E%'
AND total_revenue_current_year > 0
AND activity_or_mission_desc IS NOT NULL
"""
# Execute query and load to DataFrame
df = client.query(query).to_dataframe()
print(f"Retrieved {len(df):,} organizations with missions")
# Clean XML tags from mission text
import re
df['mission_clean'] = df['mission'].str.replace(r'<[^>]+>', '', regex=True).str.strip()
# Save locally
df.to_parquet('data/gold/nonprofits_990_bigquery.parquet')
4. Merge with existing nonprofit data
import pandas as pd
# Load your existing nonprofit data (from IRS EO-BMF)
orgs = pd.read_parquet('data/gold/nonprofits_organizations.parquet')
print(f"Existing orgs: {len(orgs):,}")
# Load BigQuery results with missions & websites
bq_data = pd.read_parquet('data/gold/nonprofits_990_bigquery.parquet')
print(f"BigQuery results: {len(bq_data):,}")
# Merge on EIN
enriched = orgs.merge(
bq_data[['ein', 'mission_clean', 'website', 'revenue', 'expenses', 'assets']],
on='ein',
how='left',
suffixes=('', '_990')
)
# Fill missing data from 990 fields
if 'mission' not in enriched.columns:
enriched['mission'] = enriched['mission_clean']
if 'website' not in enriched.columns:
enriched['website'] = enriched['website_990']
# Show enrichment stats
missions_added = enriched['mission'].notna().sum()
websites_added = enriched['website'].notna().sum()
print(f"✅ Missions: {missions_added:,} ({100*missions_added/len(enriched):.1f}%)")
print(f"✅ Websites: {websites_added:,} ({100*websites_added/len(enriched):.1f}%)")
# Save enriched dataset
enriched.to_parquet('data/gold/nonprofits_enriched_bigquery.parquet')
print(f"💾 Saved {len(enriched):,} enriched organizations")
Expected results:
- 30-50% of orgs will have missions (larger orgs file Form 990)
- 20-40% will have websites listed
- 100% will have EIN matching for revenue/expense data
BigQuery Table Structure
The IRS 990 dataset is organized into multiple tables matching Form 990 schedules:
Master Index Table
bigquery-public-data.irs_990.irs_990_index
- Links all returns together
- Fields:
ein,organization_name,tax_period,return_id(foreign key)
Main Return Tables (by year)
bigquery-public-data.irs_990.irs_990_YYYY - Full Form 990
- Mission:
activity_or_mission_desc⭐ - Website:
website_address_txt⭐ - Financials:
total_revenue_current_year,total_expenses_current_year - Assets:
total_assets_eoy,total_liabilities_eoy - State/NTEE:
state,ntee_code
bigquery-public-data.irs_990.irs_990_ez_YYYY - Form 990-EZ (smaller orgs)
- Mission:
mission_description⭐ - Website:
website_address_txt⭐ - Financials:
total_revenue,total_expenses
bigquery-public-data.irs_990.irs_990_pf_YYYY - Form 990-PF (Private Foundations)
- Grants: Largest grants awarded (for grantmakers)
- Financials: Foundation-specific fields
Schedule Tables (Detailed Information)
bigquery-public-data.irs_990.irs_990_schedule_a_YYYY
- Public charity status and public support calculations
bigquery-public-data.irs_990.irs_990_schedule_j_YYYY
- Executive compensation (CEO, CFO, board member salaries) 💰
bigquery-public-data.irs_990.irs_990_schedule_r_YYYY
- Related organizations and transactions
Complete Field Mapping
| Data Point | Table Name | Field Name | Notes |
|---|---|---|---|
| Organization Name | irs_990_index | organization_name | Master list |
| EIN (Primary Key) | All tables | ein | 9-digit ID |
| Mission (990-EZ) | irs_990_ez_YYYY | mission_description | Smaller orgs |
| Mission (Full 990) | irs_990_YYYY | activity_or_mission_desc | Larger orgs |
| Website URL | irs_990_YYYY, irs_990_ez_YYYY | website_address_txt | Both forms |
| Total Revenue | irs_990_YYYY | total_revenue_current_year | Annual revenue |
| Total Expenses | irs_990_YYYY | total_expenses_current_year | Annual expenses |
| Program Expenses | irs_990_YYYY | program_service_revenue | Program revenue |
| Assets | irs_990_YYYY | total_assets_eoy | End of year |
| Liabilities | irs_990_YYYY | total_liabilities_eoy | End of year |
| Executive Salaries | irs_990_schedule_j_YYYY | Compensation fields | CEO, CFO pay |
| Grants Paid | irs_990_pf_YYYY | Grant fields | For foundations |
| Tax Period | All tables | tax_period | YYYYMMDD format |
| State | irs_990_YYYY | state | 2-letter code |
Query Examples for All Key Fields
Extract mission, website, AND revenue from both 990 and 990-EZ:
-- Combine Full 990 and 990-EZ for complete coverage
WITH full_990 AS (
SELECT
ein,
activity_or_mission_desc AS mission,
website_address_txt AS website,
total_revenue_current_year AS revenue,
total_expenses_current_year AS expenses,
'990' AS form_type
FROM `bigquery-public-data.irs_990.irs_990_2023`
WHERE state IN ('AL', 'MI')
AND activity_or_mission_desc IS NOT NULL
),
ez_990 AS (
SELECT
ein,
mission_description AS mission,
website_address_txt AS website,
total_revenue AS revenue,
total_expenses AS expenses,
'990-EZ' AS form_type
FROM `bigquery-public-data.irs_990.irs_990_ez_2023`
WHERE state IN ('AL', 'MI')
AND mission_description IS NOT NULL
)
SELECT * FROM full_990
UNION ALL
SELECT * FROM ez_990
ORDER BY revenue DESC;
Add executive compensation:
-- Get mission + CEO salary
SELECT
f.ein,
f.activity_or_mission_desc AS mission,
f.website_address_txt AS website,
f.total_revenue_current_year AS revenue,
j.compensation_amount AS ceo_compensation
FROM `bigquery-public-data.irs_990.irs_990_2023` f
LEFT JOIN `bigquery-public-data.irs_990.irs_990_schedule_j_2023` j
ON f.ein = j.ein
WHERE f.state = 'AL'
AND j.title_txt LIKE '%CEO%' OR j.title_txt LIKE '%President%'
ORDER BY revenue DESC
LIMIT 100;
Data Cleaning Tips
The catch: Some fields have messy XML tags embedded (like <MissionDesc>). Clean with regex:
import re
import pandas as pd
# Clean XML tags from missions
df['mission_clean'] = df['mission'].str.replace(r'<[^>]+>', '', regex=True)
# Trim whitespace
df['mission_clean'] = df['mission_clean'].str.strip()
# Remove common artifacts
df['mission_clean'] = df['mission_clean'].str.replace(r'\s+', ' ', regex=True)
Cost Estimation
Free tier: 1 TB of queries per month (resets monthly)
Typical query costs:
- Extract missions for 1M orgs: ~50 GB scanned = FREE
- Extract all fields for 1M orgs: ~200 GB scanned = FREE
- Full table scan of all years: ~2 TB = $10 (one-time cost)
Tip: Use WHERE clauses to filter by state/NTEE to reduce data scanned.
🚀 GivingTuesday Data Lake (For Advanced Parsing)
Option 1: Via AWS Console (Free Account)
- Visit aws.amazon.com and create a free AWS account (requires CC for validation, but no charges for accessing this data)
- Log in to AWS Console
- Open the data lake: https://s3.console.aws.amazon.com/s3/buckets/gt990datalake-rawdata/?region=us-east-1&tab=objects
Option 2: Via Command Line (Recommended for Automation)
Prerequisites: Install AWS CLI
# List main bucket contents
aws s3 ls gt990datalake-rawdata --no-sign-request
# List indices (CSV files listing all 990s)
aws s3 ls gt990datalake-rawdata/Indices/990xmls/ --no-sign-request
# Download the latest index
aws s3 cp \
s3://gt990datalake-rawdata/Indices/990xmls/index_all_years_efiledata_xmls_created_on_2023-10-29.csv \
data/cache/form990_index.csv \
--no-sign-request
# Download a specific 990 XML
aws s3 cp \
s3://gt990datalake-rawdata/EfileData/XmlFiles/[OBJECT_ID]_public.xml \
data/cache/form_990_xml/ \
--no-sign-request
Option 3: Automated Python Integration
Use our enrichment script that automates downloading and parsing:
import boto3
from botocore import UNSIGNED
from botocore.config import Config
import pandas as pd
# Configure S3 client for GivingTuesday Data Lake (no credentials needed)
s3 = boto3.client(
's3',
region_name='us-east-1',
config=Config(signature_version=UNSIGNED)
)
# Download index to find available 990s
index_response = s3.get_object(
Bucket='gt990datalake-rawdata',
Key='Indices/990xmls/index_all_years_efiledata_xmls_created_on_2023-10-29.csv'
)
index_df = pd.read_csv(index_response['Body'])
print(f"Available 990s: {len(index_df):,}")
print(f"Columns: {index_df.columns.tolist()}")
# Find 990s for a specific EIN
ein = "123456789"
org_filings = index_df[index_df['EIN'] == ein]
# Download latest filing
if len(org_filings) > 0:
latest = org_filings.iloc[0]
xml_key = f"EfileData/XmlFiles/{latest['OBJECT_ID']}_public.xml"
xml_obj = s3.get_object(Bucket='gt990datalake-rawdata', Key=xml_key)
xml_content = xml_obj['Body'].read()
# Parse with xmltodict (simplified approach)
import xmltodict
doc = xmltodict.parse(xml_content)
# Extract fields from doc['Return']['ReturnData']['IRS990']
🤖 Automated Enrichment Script
We provide scripts/enrich_nonprofits_gt990.py - a complete automated solution.
Quick Start
Step 1: Download Index (One-Time Setup)
# Install dependencies
pip install boto3 xmltodict pandas pyarrow tqdm loguru
# Download GivingTuesday Data Lake index (~200MB CSV, 1M+ records)
python scripts/enrich_nonprofits_gt990.py --download-index
This creates data/cache/form990_gt_index.parquet for fast EIN→OBJECT_ID lookups.
Step 2: Enrich Your Data
# Enrich all Tuscaloosa nonprofits
python scripts/enrich_nonprofits_gt990.py \
--input data/gold/nonprofits_tuscaloosa.parquet \
--output data/gold/nonprofits_tuscaloosa_form990.parquet \
--concurrent 20
# Enrich Alabama + Michigan health orgs
python scripts/enrich_nonprofits_gt990.py \
--input data/gold/nonprofits_organizations.parquet \
--output data/gold/nonprofits_990_enriched.parquet \
--states AL MI \
--ntee E \
--concurrent 50
# Test with sample
python scripts/enrich_nonprofits_gt990.py \
--input data/gold/nonprofits_organizations.parquet \
--output /tmp/test_990.parquet \
--sample 100
Features
✅ Index-based lookup: Uses OBJECT_ID from GivingTuesday index (no filename guessing)
✅ Async/parallel: Process 20-50 organizations concurrently
✅ Smart caching: JSON cache prevents re-downloading same 990s
✅ Automatic retries: Handles S3 errors gracefully
✅ Progress tracking: tqdm progress bar with ETA
✅ Comprehensive logging: Detailed logs with statistics
Performance
- Speed: ~2-3 sec/org (download + parse)
- Concurrent=20: ~450 orgs/hour
- Concurrent=50: ~1,100 orgs/hour
- 1,000 orgs @ 50% success: ~15-20 minutes
Enriched Fields
The script adds these columns to your DataFrame:
form_990_status # 'found', 'not_found', or 'parse_error'
form_990_tax_year # e.g., 202312
form_990_filing_type # 990, 990EZ, or 990PF
form_990_total_revenue # Total revenue
form_990_total_expenses # Total expenses
form_990_net_income # Revenue - Expenses
form_990_contributions # Donations and grants received
form_990_program_revenue # Revenue from programs/services
form_990_investment_income # Investment income
form_990_program_expenses # Program service expenses
form_990_admin_expenses # Administrative expenses
form_990_fundraising_expenses # Fundraising expenses
form_990_total_assets # Total assets
form_990_total_liabilities # Total liabilities
form_990_net_assets # Assets - Liabilities
form_990_grants_paid # Grants awarded to others
form_990_mission # Mission statement
form_990_last_updated # Timestamp of enrichment
# Filter to your state
alabama_filings = filings_index[filings_index['State'] == 'AL']
print(f"Alabama filings: {len(alabama_filings):,}")
# Sample columns
# EIN, OrganizationName, State, URL, SubmittedOn, TaxPeriod
🔗 Integration with Current Data
Enrich Existing Nonprofits
import pandas as pd
from form990_parser import Form990Parser
import boto3
# Load your current nonprofit data
orgs = pd.read_parquet('data/gold/nonprofits_organizations.parquet')
# Filter to Alabama health organizations (NTEE code E)
health_orgs = orgs[
(orgs['state'] == 'AL') &
(orgs['ntee_code'].str.startswith('E', na=False))
]
print(f"Alabama health nonprofits: {len(health_orgs):,}")
# Enrich with Form 990 data
s3 = boto3.client('s3', region_name='us-east-1')
parser = Form990Parser()
enriched = []
for idx, org in health_orgs.iterrows():
ein = org['ein']
# Try to find most recent 990
try:
# Construct likely S3 key (simplified - actual naming varies)
key = f"{ein}_202312_990.xml"
xml_obj = s3.get_object(Bucket='irs-form-990', Key=key)
xml_content = xml_obj['Body'].read()
filing_data = parser.parse_xml(xml_content)
# Merge with org data
org_enriched = org.to_dict()
org_enriched.update(filing_data)
enriched.append(org_enriched)
except Exception as e:
# No 990 found for this org
continue
enriched_df = pd.DataFrame(enriched)
enriched_df.to_parquet('data/gold/nonprofits_alabama_health_990.parquet')
print(f"Enriched {len(enriched_df):,} organizations with Form 990 data")
💾 Data Schema
Form 990 Parser Output
{
# Basic Info
'ein': '123456789',
'organization_name': 'Alabama Oral Health Foundation',
'tax_year': 2023,
'tax_period': '202312',
# Financials
'total_revenue': 2500000,
'total_expenses': 2100000,
'net_income': 400000,
'total_assets': 5000000,
'total_liabilities': 500000,
# Revenue Breakdown
'contributions_grants': 1200000,
'program_service_revenue': 800000,
'investment_income': 300000,
'other_revenue': 200000,
# Expense Breakdown
'program_expenses': 1800000,
'administrative_expenses': 200000,
'fundraising_expenses': 100000,
# Programs
'program_service_descriptions': [
{
'description': 'Community dental clinics',
'expenses': 1000000,
'grants': 200000
},
{
'description': 'School fluoride programs',
'expenses': 500000,
'grants': 100000
}
],
# Governance
'officers': [
{
'name': 'Sarah Johnson',
'title': 'CEO',
'compensation': 150000
},
{
'name': 'John Smith',
'title': 'CFO',
'compensation': 120000
}
],
# Mission
'mission_statement': 'Improve oral health access in underserved communities...',
'program_accomplishments': 'Served 10,000 patients in 2023...'
}
📈 Performance Considerations
Data Volume
- Form 990 XMLs: ~300,000 new filings per year
- Average XML size: 500KB - 5MB
- Total storage: ~500GB for all historical 990s (2011-present)
Processing Speed
Sequential (current approach):
- Download + parse: ~2-5 seconds per 990
- 300,000 filings × 3 sec = 250 hours 😱
Async parallel (recommended):
- 50 concurrent workers
- 300,000 filings × 3 sec / 50 = 5 hours ⚡
Smart Strategies
-
Filter first: Only download 990s for organizations you care about
# Only health orgs in your stateshealth_eins = orgs[(orgs['state'].isin(['AL', 'MI'])) &(orgs['ntee_code'].str.startswith('E'))]['ein'].tolist()# Result: ~50,000 instead of 300,000 = 1 hour -
Use index files: Download the index first, filter, then fetch XMLs
# Get indexindex = pd.read_json('https://s3.amazonaws.com/irs-form-990/index_2023.json')# Filter to your EINsrelevant = index[index['EIN'].isin(health_eins)]# Only download thesefor url in relevant['URL']:# download and parse -
Cache aggressively: 990s don't change after filing
cache_dir = Path('data/cache/form_990_xml')cache_file = cache_dir / f"{ein}_{tax_year}.parquet"if cache_file.exists():return pd.read_parquet(cache_file)else:# download, parse, cache
🎯 Use Cases
1. Financial Health Analysis
# Which nonprofits are most financially stable?
df['efficiency_ratio'] = df['program_expenses'] / df['total_expenses']
df['reserve_months'] = df['net_assets'] / (df['total_expenses'] / 12)
efficient = df[df['efficiency_ratio'] > 0.75] # >75% on programs
print(f"Efficient organizations: {len(efficient):,}")
2. Grant Research
# Who's giving grants in oral health?
grantmakers = df[
(df['grants_paid'] > 0) &
(df['ntee_code'].str.startswith('E'))
]
print(f"Oral health grantmakers: {len(grantmakers):,}")
print(f"Total grants: ${grantmakers['grants_paid'].sum():,.0f}")
3. Program Discovery
# Find organizations running specific programs
fluoride_programs = df[
df['program_service_descriptions'].str.contains('fluoride', case=False, na=False)
]
print(f"Orgs with fluoride programs: {len(fluoride_programs):,}")
🚀 Next Steps
Quick Test
# Install dependencies
pip install form-990-xml-parser boto3
# Test with single organization
python -c "
from form990_parser import Form990Parser
import boto3
s3 = boto3.client('s3', region_name='us-east-1')
ein = '631307851' # Delta Dental of Alabama (example)
try:
# Try to fetch 2023 filing
key = f'{ein}_202312_990.xml'
obj = s3.get_object(Bucket='irs-form-990', Key=key)
parser = Form990Parser()
data = parser.parse_xml(obj['Body'].read())
print(f'Organization: {data.get(\"organization_name\")}')
print(f'Revenue: \${data.get(\"total_revenue\", 0):,.0f}')
print(f'Assets: \${data.get(\"total_assets\", 0):,.0f}')
except Exception as e:
print(f'Error: {e}')
"
Full Integration
Create a new enrichment pipeline:
# Create new script
python scripts/enrich_nonprofits_form990.py \
--input data/gold/nonprofits_organizations.parquet \
--output data/gold/nonprofits_organizations_990.parquet \
--states AL MI \
--ntee E \
--concurrent 50
📚 Resources
- Giving Tuesday GitHub: https://github.com/Giving-Tuesday
- Form 990 XML Parser: https://github.com/Giving-Tuesday/form-990-xml-parser
- Form 990 XML Mapper: https://github.com/Giving-Tuesday/form-990-xml-mapper
- IRS 990 AWS Bucket: https://registry.opendata.aws/irs990/
- IRS Index Files: https://s3.amazonaws.com/irs-form-990/index_YYYY.json
- 990 Schema Documentation: https://www.irs.gov/e-file-providers/current-valid-xml-schemas-and-business-rules
❓ FAQ
Q: Are we currently downloading XML?
A: No. Currently using EO-BMF CSV files (basic data). Form 990 XML would be an enhancement for detailed financials.
Q: Can we use Giving Tuesday libraries?
A: Yes! They're open source and designed exactly for this purpose. Would provide much richer data.
Q: How much data is it?
A:
- All 990s (2011-present): ~500GB
- Alabama only: ~5GB
- Alabama health orgs: ~500MB
- Very manageable!
Q: What's the license?
A: Public domain (U.S. Government data) + Giving Tuesday tools are open source
Q: Integration effort?
A: Low - can reuse existing async enrichment patterns. Estimated: 1-2 days for initial integration.