Form 990 XML Data (GivingTuesday Data Lake)

Extract detailed financial data from IRS Form 990 XML filings using GivingTuesday's 990 Data Infrastructure.

🎯 Overview

Current data: IRS EO-BMF CSV files (basic info - name, EIN, address, NTEE code)
Enhancement: Form 990 XML filings from GivingTuesday Data Lake (detailed financials - revenue, expenses, programs, grants)

What is the GivingTuesday 990 Data Lake?

The 990 Data Infrastructure is a collaborative data lake of clean, standardized 990 data in XML format maintained by GivingTuesday. This is the rawest form of 990 data in GivingTuesday's infrastructure.

Data Lake Structure:

Bucket: gt990datalake-rawdata (AWS S3, us-east-1 Virginia)
Access: Public, no AWS credentials required (--no-sign-request)
E-filed 990s: EfileData/XmlFiles/ (individual XML returns)
Indices: Indices/990xmls/ (CSV files listing all available 990s)

Console Access: https://us-east-1.console.aws.amazon.com/s3/buckets/gt990datalake-rawdata

What's the Difference?

Data Source	Type	Records	Data Richness	Access Method	Best For
EO-BMF CSV ✅ Currently using	Basic registry	1.9M+	⭐ Low	Direct download	Initial org list
Google BigQuery ⚡ Recommended	SQL queries	5M+	⭐⭐⭐⭐⭐ High	SQL (serverless)	Bulk mission/website extraction
GivingTuesday Data Lake 🚀 Advanced	XML files	5.4M+	⭐⭐⭐⭐⭐ Very High	S3 download	Detailed parsing, custom fields

📊 What Additional Data You Can Get

From Form 990 XML:

Financials: Total revenue, program revenue, contributions, grants, investment income
Expenses: Total expenses, program expenses, administrative, fundraising
Assets: Total assets, liabilities, net assets
Programs: Program service descriptions, accomplishments, expenses per program
Governance: Board members, officer compensation, key employees
Grants: Grants awarded, grant recipients
Mission: Detailed mission statement and program descriptions
Activities: Legislative activities, political expenditures, lobbying

Example: Instead of just knowing "Alabama Oral Health Foundation exists," you get:

Revenue: $2.5M
Program expenses: $1.8M
Grants awarded: $500K to 10 community health centers
Mission: "Improve oral health access in underserved communities"
Officers: CEO Sarah Johnson ($150K salary)
Website: https://alabamaoralhealth.org

⚡ Google BigQuery (Recommended for Bulk Queries)

Fastest way to enrich 1M+ organizations with missions and websites!

Why BigQuery?

Google Cloud hosts the complete IRS 990 dataset in BigQuery - a serverless SQL database that lets you query 5 million Form 990s in seconds without downloading any files.

Key advantages:

✅ No downloads: Query directly in the cloud
✅ Blazing fast: Bulk queries complete in <30 seconds
✅ Free tier: First 1 TB/month is free (enough for most research)
✅ SQL interface: Easy to extract specific fields
✅ No infrastructure: Serverless, nothing to manage

Cost: Form 990 text fields are small - you can query all 1.9M nonprofits for ~$0 using the free tier.

Quick Start

1. Set up Google Cloud (one-time)

# Install Google Cloud SDK
# Visit: https://cloud.google.com/sdk/docs/install

# Authenticate
gcloud auth login

# Set project
gcloud config set project YOUR_PROJECT_ID

2. Extract mission statements & websites for all Alabama health orgs

-- Query in BigQuery Console or via bq CLI
SELECT 
  ein,
  organization_name,
  website_address_txt AS website,
  activity_or_mission_desc AS mission,
  total_revenue_current_year AS revenue,
  total_expenses_current_year AS expenses,
  tax_period
FROM `bigquery-public-data.irs_990.irs_990_2023`
WHERE state = 'AL'
  AND ntee_code LIKE 'E%'
  AND activity_or_mission_desc IS NOT NULL
ORDER BY total_revenue_current_year DESC
LIMIT 10000;

3. Run query from Python

from google.cloud import bigquery
import pandas as pd

# Initialize BigQuery client
client = bigquery.Client()

# Query for Alabama + Michigan health nonprofits with missions
query = """
SELECT 
  ein,
  organization_name,
  website_address_txt AS website,
  activity_or_mission_desc AS mission,
  total_revenue_current_year AS revenue,
  total_expenses_current_year AS expenses,
  total_assets_eoy AS assets,
  state,
  tax_period
FROM `bigquery-public-data.irs_990.irs_990_2023`
WHERE state IN ('AL', 'MI')
  AND ntee_code LIKE 'E%'
  AND total_revenue_current_year > 0
  AND activity_or_mission_desc IS NOT NULL
"""

# Execute query and load to DataFrame
df = client.query(query).to_dataframe()
print(f"Retrieved {len(df):,} organizations with missions")

# Clean XML tags from mission text
import re
df['mission_clean'] = df['mission'].str.replace(r'<[^>]+>', '', regex=True).str.strip()

# Save locally
df.to_parquet('data/gold/nonprofits_990_bigquery.parquet')

4. Merge with existing nonprofit data

import pandas as pd

# Load your existing nonprofit data (from IRS EO-BMF)
orgs = pd.read_parquet('data/gold/nonprofits_organizations.parquet')
print(f"Existing orgs: {len(orgs):,}")

# Load BigQuery results with missions & websites
bq_data = pd.read_parquet('data/gold/nonprofits_990_bigquery.parquet')
print(f"BigQuery results: {len(bq_data):,}")

# Merge on EIN
enriched = orgs.merge(
    bq_data[['ein', 'mission_clean', 'website', 'revenue', 'expenses', 'assets']],
    on='ein',
    how='left',
    suffixes=('', '_990')
)

# Fill missing data from 990 fields
if 'mission' not in enriched.columns:
    enriched['mission'] = enriched['mission_clean']
if 'website' not in enriched.columns:
    enriched['website'] = enriched['website_990']

# Show enrichment stats
missions_added = enriched['mission'].notna().sum()
websites_added = enriched['website'].notna().sum()
print(f"✅ Missions: {missions_added:,} ({100*missions_added/len(enriched):.1f}%)")
print(f"✅ Websites: {websites_added:,} ({100*websites_added/len(enriched):.1f}%)")

# Save enriched dataset
enriched.to_parquet('data/gold/nonprofits_enriched_bigquery.parquet')
print(f"💾 Saved {len(enriched):,} enriched organizations")

Expected results:

30-50% of orgs will have missions (larger orgs file Form 990)
20-40% will have websites listed
100% will have EIN matching for revenue/expense data

BigQuery Table Structure

The IRS 990 dataset is organized into multiple tables matching Form 990 schedules:

Master Index Table

bigquery-public-data.irs_990.irs_990_index

Links all returns together
Fields: ein, organization_name, tax_period, return_id (foreign key)

Main Return Tables (by year)

bigquery-public-data.irs_990.irs_990_YYYY - Full Form 990

Mission: activity_or_mission_desc ⭐
Website: website_address_txt ⭐
Financials: total_revenue_current_year, total_expenses_current_year
Assets: total_assets_eoy, total_liabilities_eoy
State/NTEE: state, ntee_code

bigquery-public-data.irs_990.irs_990_ez_YYYY - Form 990-EZ (smaller orgs)

Mission: mission_description ⭐
Website: website_address_txt ⭐
Financials: total_revenue, total_expenses

bigquery-public-data.irs_990.irs_990_pf_YYYY - Form 990-PF (Private Foundations)

Grants: Largest grants awarded (for grantmakers)
Financials: Foundation-specific fields

Schedule Tables (Detailed Information)

bigquery-public-data.irs_990.irs_990_schedule_a_YYYY

Public charity status and public support calculations

bigquery-public-data.irs_990.irs_990_schedule_j_YYYY

Executive compensation (CEO, CFO, board member salaries) 💰

bigquery-public-data.irs_990.irs_990_schedule_r_YYYY

Related organizations and transactions

Complete Field Mapping

Data Point	Table Name	Field Name	Notes
Organization Name	`irs_990_index`	`organization_name`	Master list
EIN (Primary Key)	All tables	`ein`	9-digit ID
Mission (990-EZ)	`irs_990_ez_YYYY`	`mission_description`	Smaller orgs
Mission (Full 990)	`irs_990_YYYY`	`activity_or_mission_desc`	Larger orgs
Website URL	`irs_990_YYYY`, `irs_990_ez_YYYY`	`website_address_txt`	Both forms
Total Revenue	`irs_990_YYYY`	`total_revenue_current_year`	Annual revenue
Total Expenses	`irs_990_YYYY`	`total_expenses_current_year`	Annual expenses
Program Expenses	`irs_990_YYYY`	`program_service_revenue`	Program revenue
Assets	`irs_990_YYYY`	`total_assets_eoy`	End of year
Liabilities	`irs_990_YYYY`	`total_liabilities_eoy`	End of year
Executive Salaries	`irs_990_schedule_j_YYYY`	Compensation fields	CEO, CFO pay
Grants Paid	`irs_990_pf_YYYY`	Grant fields	For foundations
Tax Period	All tables	`tax_period`	YYYYMMDD format
State	`irs_990_YYYY`	`state`	2-letter code

Query Examples for All Key Fields

Extract mission, website, AND revenue from both 990 and 990-EZ:

-- Combine Full 990 and 990-EZ for complete coverage
WITH full_990 AS (
  SELECT 
    ein,
    activity_or_mission_desc AS mission,
    website_address_txt AS website,
    total_revenue_current_year AS revenue,
    total_expenses_current_year AS expenses,
    '990' AS form_type
  FROM `bigquery-public-data.irs_990.irs_990_2023`
  WHERE state IN ('AL', 'MI')
    AND activity_or_mission_desc IS NOT NULL
),
ez_990 AS (
  SELECT 
    ein,
    mission_description AS mission,
    website_address_txt AS website,
    total_revenue AS revenue,
    total_expenses AS expenses,
    '990-EZ' AS form_type
  FROM `bigquery-public-data.irs_990.irs_990_ez_2023`
  WHERE state IN ('AL', 'MI')
    AND mission_description IS NOT NULL
)
SELECT * FROM full_990
UNION ALL
SELECT * FROM ez_990
ORDER BY revenue DESC;

Add executive compensation:

-- Get mission + CEO salary
SELECT 
  f.ein,
  f.activity_or_mission_desc AS mission,
  f.website_address_txt AS website,
  f.total_revenue_current_year AS revenue,
  j.compensation_amount AS ceo_compensation
FROM `bigquery-public-data.irs_990.irs_990_2023` f
LEFT JOIN `bigquery-public-data.irs_990.irs_990_schedule_j_2023` j
  ON f.ein = j.ein
WHERE f.state = 'AL'
  AND j.title_txt LIKE '%CEO%' OR j.title_txt LIKE '%President%'
ORDER BY revenue DESC
LIMIT 100;

Data Cleaning Tips

The catch: Some fields have messy XML tags embedded (like <MissionDesc>). Clean with regex:

import re
import pandas as pd

# Clean XML tags from missions
df['mission_clean'] = df['mission'].str.replace(r'<[^>]+>', '', regex=True)

# Trim whitespace
df['mission_clean'] = df['mission_clean'].str.strip()

# Remove common artifacts
df['mission_clean'] = df['mission_clean'].str.replace(r'\s+', ' ', regex=True)

Cost Estimation

Free tier: 1 TB of queries per month (resets monthly)

Typical query costs:

Extract missions for 1M orgs: ~50 GB scanned = FREE
Extract all fields for 1M orgs: ~200 GB scanned = FREE
Full table scan of all years: ~2 TB = $10 (one-time cost)

Tip: Use WHERE clauses to filter by state/NTEE to reduce data scanned.

🚀 GivingTuesday Data Lake (For Advanced Parsing)

Option 1: Via AWS Console (Free Account)

Visit aws.amazon.com and create a free AWS account (requires CC for validation, but no charges for accessing this data)
Log in to AWS Console
Open the data lake: https://s3.console.aws.amazon.com/s3/buckets/gt990datalake-rawdata/?region=us-east-1&tab=objects

Option 2: Via Command Line (Recommended for Automation)

Prerequisites: Install AWS CLI

# List main bucket contents
aws s3 ls gt990datalake-rawdata --no-sign-request

# List indices (CSV files listing all 990s)
aws s3 ls gt990datalake-rawdata/Indices/990xmls/ --no-sign-request

# Download the latest index
aws s3 cp \
  s3://gt990datalake-rawdata/Indices/990xmls/index_all_years_efiledata_xmls_created_on_2023-10-29.csv \
  data/cache/form990_index.csv \
  --no-sign-request

# Download a specific 990 XML
aws s3 cp \
  s3://gt990datalake-rawdata/EfileData/XmlFiles/[OBJECT_ID]_public.xml \
  data/cache/form_990_xml/ \
  --no-sign-request

Option 3: Automated Python Integration

Use our enrichment script that automates downloading and parsing:

import boto3
from botocore import UNSIGNED
from botocore.config import Config
import pandas as pd

# Configure S3 client for GivingTuesday Data Lake (no credentials needed)
s3 = boto3.client(
    's3',
    region_name='us-east-1',
    config=Config(signature_version=UNSIGNED)
)

# Download index to find available 990s
index_response = s3.get_object(
    Bucket='gt990datalake-rawdata',
    Key='Indices/990xmls/index_all_years_efiledata_xmls_created_on_2023-10-29.csv'
)
index_df = pd.read_csv(index_response['Body'])

print(f"Available 990s: {len(index_df):,}")
print(f"Columns: {index_df.columns.tolist()}")

# Find 990s for a specific EIN
ein = "123456789"
org_filings = index_df[index_df['EIN'] == ein]

# Download latest filing
if len(org_filings) > 0:
    latest = org_filings.iloc[0]
    xml_key = f"EfileData/XmlFiles/{latest['OBJECT_ID']}_public.xml"
    
    xml_obj = s3.get_object(Bucket='gt990datalake-rawdata', Key=xml_key)
    xml_content = xml_obj['Body'].read()
    
    # Parse with xmltodict (simplified approach)
    import xmltodict
    doc = xmltodict.parse(xml_content)
    # Extract fields from doc['Return']['ReturnData']['IRS990']

🤖 Automated Enrichment Script

We provide scripts/enrich_nonprofits_gt990.py - a complete automated solution.

Quick Start

Step 1: Download Index (One-Time Setup)

# Install dependencies
pip install boto3 xmltodict pandas pyarrow tqdm loguru

# Download GivingTuesday Data Lake index (~200MB CSV, 1M+ records)
python scripts/enrich_nonprofits_gt990.py --download-index

This creates data/cache/form990_gt_index.parquet for fast EIN→OBJECT_ID lookups.

Step 2: Enrich Your Data

# Enrich all Tuscaloosa nonprofits
python scripts/enrich_nonprofits_gt990.py \
    --input data/gold/nonprofits_tuscaloosa.parquet \
    --output data/gold/nonprofits_tuscaloosa_form990.parquet \
    --concurrent 20

# Enrich Alabama + Michigan health orgs
python scripts/enrich_nonprofits_gt990.py \
    --input data/gold/nonprofits_organizations.parquet \
    --output data/gold/nonprofits_990_enriched.parquet \
    --states AL MI \
    --ntee E \
    --concurrent 50

# Test with sample
python scripts/enrich_nonprofits_gt990.py \
    --input data/gold/nonprofits_organizations.parquet \
    --output /tmp/test_990.parquet \
    --sample 100

Features

✅ Index-based lookup: Uses OBJECT_ID from GivingTuesday index (no filename guessing)
✅ Async/parallel: Process 20-50 organizations concurrently
✅ Smart caching: JSON cache prevents re-downloading same 990s
✅ Automatic retries: Handles S3 errors gracefully
✅ Progress tracking: tqdm progress bar with ETA
✅ Comprehensive logging: Detailed logs with statistics

Performance

Speed: ~2-3 sec/org (download + parse)
Concurrent=20: ~450 orgs/hour
Concurrent=50: ~1,100 orgs/hour
1,000 orgs @ 50% success: ~15-20 minutes

Enriched Fields

The script adds these columns to your DataFrame:

form_990_status             # 'found', 'not_found', or 'parse_error'
form_990_tax_year          # e.g., 202312
form_990_filing_type       # 990, 990EZ, or 990PF
form_990_total_revenue     # Total revenue
form_990_total_expenses    # Total expenses
form_990_net_income        # Revenue - Expenses
form_990_contributions     # Donations and grants received
form_990_program_revenue   # Revenue from programs/services
form_990_investment_income # Investment income
form_990_program_expenses  # Program service expenses
form_990_admin_expenses    # Administrative expenses
form_990_fundraising_expenses  # Fundraising expenses
form_990_total_assets      # Total assets
form_990_total_liabilities # Total liabilities
form_990_net_assets        # Assets - Liabilities
form_990_grants_paid       # Grants awarded to others
form_990_mission           # Mission statement
form_990_last_updated      # Timestamp of enrichment

# Filter to your state
alabama_filings = filings_index[filings_index['State'] == 'AL']
print(f"Alabama filings: {len(alabama_filings):,}")

# Sample columns
# EIN, OrganizationName, State, URL, SubmittedOn, TaxPeriod

🔗 Integration with Current Data

Enrich Existing Nonprofits

import pandas as pd
from form990_parser import Form990Parser
import boto3

# Load your current nonprofit data
orgs = pd.read_parquet('data/gold/nonprofits_organizations.parquet')

# Filter to Alabama health organizations (NTEE code E)
health_orgs = orgs[
    (orgs['state'] == 'AL') & 
    (orgs['ntee_code'].str.startswith('E', na=False))
]

print(f"Alabama health nonprofits: {len(health_orgs):,}")

# Enrich with Form 990 data
s3 = boto3.client('s3', region_name='us-east-1')
parser = Form990Parser()

enriched = []
for idx, org in health_orgs.iterrows():
    ein = org['ein']
    
    # Try to find most recent 990
    try:
        # Construct likely S3 key (simplified - actual naming varies)
        key = f"{ein}_202312_990.xml"
        
        xml_obj = s3.get_object(Bucket='irs-form-990', Key=key)
        xml_content = xml_obj['Body'].read()
        
        filing_data = parser.parse_xml(xml_content)
        
        # Merge with org data
        org_enriched = org.to_dict()
        org_enriched.update(filing_data)
        enriched.append(org_enriched)
        
    except Exception as e:
        # No 990 found for this org
        continue

enriched_df = pd.DataFrame(enriched)
enriched_df.to_parquet('data/gold/nonprofits_alabama_health_990.parquet')

print(f"Enriched {len(enriched_df):,} organizations with Form 990 data")

💾 Data Schema

Form 990 Parser Output

{
    # Basic Info
    'ein': '123456789',
    'organization_name': 'Alabama Oral Health Foundation',
    'tax_year': 2023,
    'tax_period': '202312',
    
    # Financials
    'total_revenue': 2500000,
    'total_expenses': 2100000,
    'net_income': 400000,
    'total_assets': 5000000,
    'total_liabilities': 500000,
    
    # Revenue Breakdown
    'contributions_grants': 1200000,
    'program_service_revenue': 800000,
    'investment_income': 300000,
    'other_revenue': 200000,
    
    # Expense Breakdown
    'program_expenses': 1800000,
    'administrative_expenses': 200000,
    'fundraising_expenses': 100000,
    
    # Programs
    'program_service_descriptions': [
        {
            'description': 'Community dental clinics',
            'expenses': 1000000,
            'grants': 200000
        },
        {
            'description': 'School fluoride programs',
            'expenses': 500000,
            'grants': 100000
        }
    ],
    
    # Governance
    'officers': [
        {
            'name': 'Sarah Johnson',
            'title': 'CEO',
            'compensation': 150000
        },
        {
            'name': 'John Smith',
            'title': 'CFO',
            'compensation': 120000
        }
    ],
    
    # Mission
    'mission_statement': 'Improve oral health access in underserved communities...',
    'program_accomplishments': 'Served 10,000 patients in 2023...'
}

📈 Performance Considerations

Data Volume

Form 990 XMLs: ~300,000 new filings per year
Average XML size: 500KB - 5MB
Total storage: ~500GB for all historical 990s (2011-present)

Processing Speed

Sequential (current approach):

Download + parse: ~2-5 seconds per 990
300,000 filings × 3 sec = 250 hours 😱

Async parallel (recommended):

50 concurrent workers
300,000 filings × 3 sec / 50 = 5 hours ⚡

Smart Strategies

Filter first: Only download 990s for organizations you care about

# Only health orgs in your states
health_eins = orgs[
    (orgs['state'].isin(['AL', 'MI'])) &
    (orgs['ntee_code'].str.startswith('E'))
]['ein'].tolist()

# Result: ~50,000 instead of 300,000 = 1 hour

Use index files: Download the index first, filter, then fetch XMLs

# Get index
index = pd.read_json('https://s3.amazonaws.com/irs-form-990/index_2023.json')

# Filter to your EINs
relevant = index[index['EIN'].isin(health_eins)]

# Only download these
for url in relevant['URL']:
    # download and parse

Cache aggressively: 990s don't change after filing

cache_dir = Path('data/cache/form_990_xml')
cache_file = cache_dir / f"{ein}_{tax_year}.parquet"

if cache_file.exists():
    return pd.read_parquet(cache_file)
else:
    # download, parse, cache

🎯 Use Cases

1. Financial Health Analysis

# Which nonprofits are most financially stable?
df['efficiency_ratio'] = df['program_expenses'] / df['total_expenses']
df['reserve_months'] = df['net_assets'] / (df['total_expenses'] / 12)

efficient = df[df['efficiency_ratio'] > 0.75]  # >75% on programs
print(f"Efficient organizations: {len(efficient):,}")

2. Grant Research

# Who's giving grants in oral health?
grantmakers = df[
    (df['grants_paid'] > 0) &
    (df['ntee_code'].str.startswith('E'))
]

print(f"Oral health grantmakers: {len(grantmakers):,}")
print(f"Total grants: ${grantmakers['grants_paid'].sum():,.0f}")

3. Program Discovery

# Find organizations running specific programs
fluoride_programs = df[
    df['program_service_descriptions'].str.contains('fluoride', case=False, na=False)
]

print(f"Orgs with fluoride programs: {len(fluoride_programs):,}")

🚀 Next Steps

Quick Test

# Install dependencies
pip install form-990-xml-parser boto3

# Test with single organization
python -c "
from form990_parser import Form990Parser
import boto3

s3 = boto3.client('s3', region_name='us-east-1')
ein = '631307851'  # Delta Dental of Alabama (example)

try:
    # Try to fetch 2023 filing
    key = f'{ein}_202312_990.xml'
    obj = s3.get_object(Bucket='irs-form-990', Key=key)
    
    parser = Form990Parser()
    data = parser.parse_xml(obj['Body'].read())
    
    print(f'Organization: {data.get(\"organization_name\")}')
    print(f'Revenue: \${data.get(\"total_revenue\", 0):,.0f}')
    print(f'Assets: \${data.get(\"total_assets\", 0):,.0f}')
except Exception as e:
    print(f'Error: {e}')
"

Full Integration

Create a new enrichment pipeline:

# Create new script
python scripts/enrich_nonprofits_form990.py \
    --input data/gold/nonprofits_organizations.parquet \
    --output data/gold/nonprofits_organizations_990.parquet \
    --states AL MI \
    --ntee E \
    --concurrent 50

📚 Resources

Giving Tuesday GitHub: https://github.com/Giving-Tuesday
Form 990 XML Parser: https://github.com/Giving-Tuesday/form-990-xml-parser
Form 990 XML Mapper: https://github.com/Giving-Tuesday/form-990-xml-mapper
IRS 990 AWS Bucket: https://registry.opendata.aws/irs990/
IRS Index Files: https://s3.amazonaws.com/irs-form-990/index_YYYY.json
990 Schema Documentation: https://www.irs.gov/e-file-providers/current-valid-xml-schemas-and-business-rules

❓ FAQ

Q: Are we currently downloading XML?

A: No. Currently using EO-BMF CSV files (basic data). Form 990 XML would be an enhancement for detailed financials.

Q: Can we use Giving Tuesday libraries?

A: Yes! They're open source and designed exactly for this purpose. Would provide much richer data.

Q: How much data is it?

All 990s (2011-present): ~500GB
Alabama only: ~5GB
Alabama health orgs: ~500MB
Very manageable!

Q: What's the license?

A: Public domain (U.S. Government data) + Giving Tuesday tools are open source

Q: Integration effort?

A: Low - can reuse existing async enrichment patterns. Estimated: 1-2 days for initial integration.

🎯 Overview​

What is the GivingTuesday 990 Data Lake?​

What's the Difference?​

📊 What Additional Data You Can Get​

From Form 990 XML:​

⚡ Google BigQuery (Recommended for Bulk Queries)​

Why BigQuery?​

Quick Start​

BigQuery Table Structure​

Master Index Table​

Main Return Tables (by year)​

Schedule Tables (Detailed Information)​

Complete Field Mapping​

Query Examples for All Key Fields​

Data Cleaning Tips​

Cost Estimation​

🚀 GivingTuesday Data Lake (For Advanced Parsing)​

Option 1: Via AWS Console (Free Account)​

Option 2: Via Command Line (Recommended for Automation)​

Option 3: Automated Python Integration​

🤖 Automated Enrichment Script​

Quick Start​

Features​

Performance​

Enriched Fields​

🔗 Integration with Current Data​

Enrich Existing Nonprofits​

💾 Data Schema​

Form 990 Parser Output​

📈 Performance Considerations​

Data Volume​

Processing Speed​

Smart Strategies​

🎯 Use Cases​

1. Financial Health Analysis​

2. Grant Research​

3. Program Discovery​

🚀 Next Steps​

Quick Test​

Full Integration​

📚 Resources​

❓ FAQ​

Q: Are we currently downloading XML?​

Q: Can we use Giving Tuesday libraries?​

Q: How much data is it?​

Q: What's the license?​

Q: Integration effort?​

🎯 Overview

What is the GivingTuesday 990 Data Lake?

What's the Difference?

📊 What Additional Data You Can Get

From Form 990 XML:

⚡ Google BigQuery (Recommended for Bulk Queries)

Why BigQuery?

Quick Start

BigQuery Table Structure

Master Index Table

Main Return Tables (by year)

Schedule Tables (Detailed Information)

Complete Field Mapping

Query Examples for All Key Fields

Data Cleaning Tips

Cost Estimation

🚀 GivingTuesday Data Lake (For Advanced Parsing)

Option 1: Via AWS Console (Free Account)

Option 2: Via Command Line (Recommended for Automation)

Option 3: Automated Python Integration

🤖 Automated Enrichment Script

Quick Start

Features

Performance

Enriched Fields

🔗 Integration with Current Data

Enrich Existing Nonprofits

💾 Data Schema

Form 990 Parser Output

📈 Performance Considerations

Data Volume

Processing Speed

Smart Strategies

🎯 Use Cases

1. Financial Health Analysis

2. Grant Research

3. Program Discovery

🚀 Next Steps

Quick Test

Full Integration

📚 Resources

❓ FAQ

Q: Are we currently downloading XML?

Q: Can we use Giving Tuesday libraries?

Q: How much data is it?

Q: What's the license?

Q: Integration effort?