Skip to main content

IRS Bulk Data Integration

Access ALL 1.9M+ U.S. nonprofits using the IRS Exempt Organizations Business Master File (EO-BMF).

🎯 Why Use IRS Bulk Data?

FeatureProPublica APIIRS EO-BMF
Coverage25 results per request1,952,238 total
Alabama nonprofits2526,148
Pagination❌ Not available✅ Complete dataset
SpeedSlow (25 at a time)✅ Fast (bulk download)
CostFreeFree
Update frequencyReal-timeMonthly
Data sourceIRS Form 990IRS official registry

Result: IRS gives you 1,000x more data! 🚀


📊 Data Source

IRS Exempt Organizations Business Master File (EO-BMF)

Regional Files

The IRS provides 4 regional CSV files for faster download:

  1. Region 1 (Northeast): CT, ME, MA, NH, NJ, NY, RI, VT — 277,214 orgs
  2. Region 2 (Mid-Atlantic & Great Lakes): DE, DC, IL, IN, IA, KY, MD, MI, MN, NE, NC, ND, OH, PA, SC, SD, VA, WV, WI — 717,691 orgs
  3. Region 3 (Gulf Coast & Pacific): AL, AK, AZ, AR, CA, CO, FL, GA, HI, ID, KS, LA, MS, MO, MT, NV, NM, OK, OR, TX, TN, UT, WA, WY — 952,412 orgs
  4. Region 4 (All other): International, Puerto Rico — 4,921 orgs

🚀 Quick Start

Download All 1.9M+ Nonprofits

# Download ALL U.S. nonprofits (4 regional files)
python scripts/create_all_gold_tables.py \
--nonprofits-only \
--use-irs \
--download-all-irs

# Creates 4 gold tables:
# - nonprofits_organizations.parquet (1.9M+ records)
# - nonprofits_financials.parquet
# - nonprofits_programs.parquet
# - nonprofits_locations.parquet

Download time: ~30 seconds (first time), then instant (cached)


Download Specific States

# Download Alabama nonprofits only
python scripts/create_all_gold_tables.py \
--nonprofits-only \
--states AL \
--use-irs

# Result: 26,148 Alabama nonprofits
# Download multiple states
python scripts/create_all_gold_tables.py \
--nonprofits-only \
--states AL GA FL MS TN \
--use-irs

# Result: ~100,000+ nonprofits from 5 states

Filter by NTEE Code

# Get only health organizations (NTEE E) from Alabama
python scripts/create_all_gold_tables.py \
--nonprofits-only \
--states AL \
--ntee-codes E \
--use-irs

# Result: 509 health nonprofits in Alabama
# Get health + human services from all states
python scripts/create_all_gold_tables.py \
--nonprofits-only \
--ntee-codes E P \
--use-irs \
--download-all-irs

# Result: ~400,000+ health & human service orgs nationwide

💻 Python API Usage

Example 1: Download All Regions

from discovery.irs_bmf_ingestion import IRSBMFIngestion

irs = IRSBMFIngestion()

# Download all 1.9M+ nonprofits (4 regional files)
df = irs.download_all_regions()

print(f"Downloaded {len(df):,} nonprofits")
# Output: Downloaded 1,952,238 nonprofits

# Data is automatically cached to: data/cache/irs_bmf/all_regions_combined.parquet
# Future runs will load from cache (instant!)

Example 2: Download Specific State

from discovery.irs_bmf_ingestion import IRSBMFIngestion

irs = IRSBMFIngestion()

# Download Alabama
df_alabama = irs.download_state_file("AL")
print(f"Alabama: {len(df_alabama):,} nonprofits")
# Output: Alabama: 26,148 nonprofits

# Download California
df_california = irs.download_state_file("CA")
print(f"California: {len(df_california):,} nonprofits")
# Output: California: ~200,000 nonprofits

Example 3: Filter by NTEE Code

from discovery.irs_bmf_ingestion import IRSBMFIngestion

irs = IRSBMFIngestion()

# Download all regions
df_all = irs.download_all_regions()

# Filter to health organizations (NTEE E)
df_health = irs.filter_by_ntee(df_all, ["E"])
print(f"Health organizations: {len(df_health):,}")
# Output: Health organizations: ~80,000

# Filter to multiple NTEE codes
df_community = irs.filter_by_ntee(df_all, ["E", "P", "K", "L", "S", "W"])
print(f"Community service orgs: {len(df_community):,}")
# Output: Community service orgs: ~600,000

Example 4: Combine State + NTEE Filtering

from discovery.irs_bmf_ingestion import IRSBMFIngestion

irs = IRSBMFIngestion()

# Download Alabama
df = irs.download_state_file("AL")

# Filter to health orgs
health = irs.filter_by_ntee(df, ["E"])

# Convert to ProPublica format
standardized = irs.standardize_to_propublica_format(health)

# Save to gold table
standardized.to_parquet("data/gold/alabama_health_nonprofits.parquet")

📋 Data Schema

IRS EO-BMF Columns

The IRS provides 28 columns per organization:

ColumnDescriptionExample
einEmployer Identification Number630123456
nameOrganization nameGood Samaritan Health Clinic
streetStreet address123 Main St
cityCityBirmingham
state2-letter state codeAL
zipZIP code35203
ntee_cdNTEE classification codeE30 (Ambulatory Health)
subsection501(c) subsection03 = 501(c)(3)
asset_amtAsset amount4467751
income_amtIncome amount2500000
revenue_amtRevenue amount (Form 990)2500000
rulingMonth/year of ruling letter200501 (Jan 2005)
deductibilityDeductibility status code1 = Deductible
foundationFoundation code15 = Public charity
activityActivity codes000
organizationOrganization code1 = Corporation
statusExempt org status code1 = Unconditional
...13 more columns...

Full data dictionary: https://www.irs.gov/pub/foia/ig/tege/eo-info.pdf


🔗 Integration with Existing Pipeline

The IRS ingestion module integrates seamlessly with our existing ProPublica-based pipeline:

from pipeline.create_nonprofits_gold_tables import NonprofitGoldTableCreator

# Create pipeline with IRS support
creator = NonprofitGoldTableCreator()

# Option 1: Use IRS for specific states
creator.create_all_gold_tables(
states=["AL", "GA", "FL"],
use_irs_data=True # ← Use IRS instead of ProPublica
)

# Option 2: Download ALL nonprofits
creator.create_all_gold_tables(
use_irs_data=True,
download_all_irs=True # ← Get all 1.9M+ orgs
)

# Option 3: Filter by NTEE codes
creator.create_all_gold_tables(
states=["AL"],
ntee_codes=["E", "P"], # Health + Human Services
use_irs_data=True
)

Standardization

IRS data is automatically converted to ProPublica-compatible format:

# IRS columns → ProPublica schema
{
'ein': df.get('ein'),
'name': df.get('name'),
'city': df.get('city'),
'state': df.get('state'),
'ntee_code': df.get('ntee_cd'),
'asset_amount': df.get('asset_amt'),
'income_amount': df.get('income_amt'),
'street_address': df.get('street'),
'zip_code': df.get('zip'),
'data_source': 'IRS_EO_BMF' # Track source
}

This allows you to:

  • ✅ Mix IRS + ProPublica data
  • ✅ Use same gold table schema
  • ✅ Switch between sources without changing downstream code

🎓 NTEE Codes Reference

Common NTEE codes for community services:

CodeCategoryExample Organizations
EHealthHospitals, clinics, mental health
E30Ambulatory Health CenterCommunity health centers
E32School-Based Health CareSchool clinics
E60Health Support ServicesMedical equipment, patient support
E70Public Health ProgramDisease prevention, health education
PHuman ServicesFood banks, shelters, counseling
P20Human Service OrganizationsMulti-service agencies
KFood, AgricultureFood pantries, nutrition programs
LHousing, ShelterHomeless shelters, affordable housing
SCommunity ImprovementCommunity development, civic groups
WPublic AffairsAdvocacy, civil rights, voting

Full NTEE taxonomy: https://nccs.urban.org/project/national-taxonomy-exempt-entities-ntee-codes


📈 Performance Benchmarks

Tested on standard cloud VM (4 vCPU, 16 GB RAM):

OperationTimeRecordsFile Size
Download Region 1~4 sec277,21425 MB
Download Region 2~3 sec717,69160 MB
Download Region 3~5 sec952,41280 MB
Download Region 4~1 sec4,9211 MB
Download ALL 4 regions~30 sec1,952,238170 MB
Load from cache (parquet)~1 sec1,952,238120 MB
Filter by NTEE (health)~2 sec~80,0006 MB
Create 4 gold tables (AL)~6 sec26,1484 MB
Create 4 gold tables (ALL)~5 min1,952,238250 MB

Recommendation: Always download all regions on first run, then filter locally. Much faster than downloading individual states!


🆚 When to Use IRS vs ProPublica

Use IRS EO-BMF When:

✅ You need comprehensive coverage (all nonprofits in a state)
✅ You're doing bulk analysis (e.g., "all health orgs in Southeast")
✅ You need offline access to data
✅ You want faster performance (bulk downloads)
✅ You're building a complete nonprofit directory

Use ProPublica API When:

✅ You need real-time updates (IRS is monthly)
✅ You want detailed Form 990 financial breakdowns
✅ You need executive compensation data
✅ You want mission statements (IRS doesn't have these)
✅ You're searching for a specific organization by name

Best Practice: Use Both!

  1. IRS for bulk discovery and coverage
  2. ProPublica for enrichment with detailed financials
# 1. Download all Alabama orgs from IRS
irs = IRSBMFIngestion()
df_all = irs.download_state_file("AL") # 26,148 orgs

# 2. Enrich top 100 with ProPublica details
propublica = NonprofitDiscovery()
for ein in df_all.head(100)['ein']:
details = propublica.get_propublica_org_details(ein)
# details contains mission, programs, detailed financials

🔧 Troubleshooting

Download Fails with Timeout Error

# Increase timeout
irs = IRSBMFIngestion()
# Edit download_regional_file() timeout parameter (default: 300 seconds)

Out of Memory Error

# Process states individually instead of all regions
for state in ["AL", "GA", "FL"]:
df = irs.download_state_file(state)
# Process each state separately

Need Fresh Data

# Force refresh (bypass cache)
df = irs.download_all_regions(force_refresh=True)


🎯 Citation

When using IRS EO-BMF data in publications:

@misc{irs_eobmf_2026,
title = {Exempt Organizations Business Master File Extract (EO-BMF)},
author = {{Internal Revenue Service}},
year = {2026},
month = {April},
url = {https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf},
note = {Accessed: 2026-04-27. Record count: 1,952,238 organizations.}
}

✨ Key Takeaways

🎯 IRS EO-BMF provides ALL 1.9M+ U.S. nonprofits
1,000x more data than ProPublica API per request
💾 Downloads in ~30 seconds, cached for instant future access
🔄 Seamlessly integrates with existing pipeline
📊 Updated monthly by the IRS
🆓 Completely free, public domain data

Start using it today!

python scripts/create_all_gold_tables.py --nonprofits-only --use-irs --download-all-irs