Skip to main content

🚀 HuggingFace Dataset Integration - Quick Start Guide

📋 Overview

You now have 3 new files to push your 1.9M+ nonprofit datasets to HuggingFace and query them from React:

  1. scripts/upload_nonprofits_to_hf.py - Upload script
  2. frontend/src/utils/huggingface.ts - TypeScript API client
  3. frontend/src/pages/NonprofitsHF.tsx - Example React page
  4. website/docs/guides/huggingface-datasets.md - Complete documentation

⚡ Quick Start (5 Steps)

Step 1: Get HuggingFace Token

  1. Visit: https://huggingface.co/settings/tokens
  2. Click "New token"
  3. Name it: oral-health-upload
  4. Permission: Write
  5. Copy the token (starts with hf_...)

Step 2: Set Environment Variable

# Add to .env file
echo 'HUGGINGFACE_TOKEN=hf_YOUR_TOKEN_HERE' >> .env

# Or export for current session
export HUGGINGFACE_TOKEN="hf_YOUR_TOKEN_HERE"

Step 3: Install Dependencies

# Python dependencies
pip install huggingface_hub datasets pyarrow

# Already installed in your project
# datasets and huggingface-hub

Step 4: Upload Datasets

cd /home/developer/projects/open-navigator

# Upload all 4 nonprofit tables
python scripts/upload_nonprofits_to_hf.py --all

# Output:
# ✅ Logged in to Hugging Face
# ✅ Repository ready: https://huggingface.co/datasets/CommunityOne/one-nonprofits
# 📤 Uploading organizations from data/gold/nonprofits_organizations.parquet
# Rows: 1,952,238
# Columns: 28
# Size: 156.43 MB
# ✅ Uploaded organizations: 1,952,238 records
# ... (uploads financials, programs, locations)
# 🎉 All uploads complete!

What gets uploaded:

  • nonprofits_organizations.parquet → 1.9M+ orgs (split: "organizations")
  • nonprofits_financials.parquet → Financial data (split: "financials")
  • nonprofits_programs.parquet → Programs (split: "programs")
  • nonprofits_locations.parquet → Locations (split: "locations")

Step 5: Test the Dataset

# Test with curl (no auth required for public datasets!)
curl "https://datasets-server.huggingface.co/rows?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&offset=0&length=10" | jq .

# Search for "dental"
curl "https://datasets-server.huggingface.co/search?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&query=dental" | jq .

Expected response:

{
"features": [...],
"rows": [
{
"row_idx": 0,
"row": {
"ein": "630123456",
"name": "ALABAMA DENTAL ASSOCIATION",
"city": "MONTGOMERY",
"state": "AL",
"ntee_code": "E12",
...
}
}
],
"num_rows_total": 1952238,
"num_rows_per_page": 100
}

🌐 Using in React

Option A: Replace Current Nonprofits Page

# Backup current page
mv frontend/src/pages/Nonprofits.tsx frontend/src/pages/Nonprofits.backup.tsx

# Use HuggingFace version
mv frontend/src/pages/NonprofitsHF.tsx frontend/src/pages/Nonprofits.tsx

Option B: Add New Route

Edit frontend/src/App.tsx:

import NonprofitsHF from './pages/NonprofitsHF'

// Add route
<Route path="/nonprofits-hf" element={<NonprofitsHF />} />

Test Locally

cd frontend
npm run dev

# Visit: http://localhost:5173/nonprofits
# or: http://localhost:5173/nonprofits-hf

🔍 Query Examples

Python

from datasets import load_dataset
import pandas as pd

# Load dataset
dataset = load_dataset("CommunityOne/one-nonprofits")

# Get organizations table
orgs = dataset["organizations"]

# Convert to pandas
df = pd.DataFrame(orgs)

# Filter by state
alabama = df[df['state'] == 'AL']
print(f"Alabama nonprofits: {len(alabama):,}")
# Output: Alabama nonprofits: 26,148

# Filter by NTEE (E = Health)
health = df[df['ntee_code'].str.startswith('E', na=False)]
print(f"Health organizations: {len(health):,}")
# Output: Health organizations: 80,000+

# Search for "dental"
dental = df[df['name'].str.contains('dental', case=False, na=False)]
print(f"Dental organizations: {len(dental):,}")

JavaScript/TypeScript

import { searchNonprofits } from '../utils/huggingface'

// Search for dental orgs in California
const results = await searchNonprofits({
dataset: "CommunityOne/one-nonprofits",
query: "dental",
state: "CA",
nteeCode: "E",
limit: 100
})

console.log(`Found ${results.length} dental orgs in California`)

REST API (curl)

# Get first 100 organizations
curl "https://datasets-server.huggingface.co/rows?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&offset=0&length=100"

# Search for "dental"
curl "https://datasets-server.huggingface.co/search?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&query=dental"

# Get dataset size
curl "https://datasets-server.huggingface.co/size?dataset=CommunityOne/one-nonprofits&config=default&split=organizations"

📊 What's in the Dataset?

organizations (main table)

  • Records: 1,952,238
  • Fields: ein, name, sort_name, city, state, zip_code, street_address, ntee_code, subsection_code, foundation_code, tax_exempt_status, deductibility_status, ruling_date, organization_code, activity_codes, group_exemption, affiliation_code, data_source

financials

  • Records: 1,952,238
  • Fields: ein, asset_amount, income_amount, revenue_amount, tax_period

programs

  • Records: 1,952,238
  • Fields: ein, activity_codes, group_exemption, affiliation_code

locations

  • Records: 1,952,238
  • Fields: ein, street_address, city, state, zip_code

🎯 Key Features

✅ FREE

  • Unlimited storage (public datasets)
  • No authentication required for reading
  • Free bandwidth and API calls

✅ FAST

  • CDN-backed by HuggingFace
  • Automatic caching
  • Pagination built-in (100 rows max per request)

✅ SEARCHABLE

  • Full-text search included
  • Filter by columns (state, NTEE code, etc.)
  • REST API - works from any language

✅ SCALABLE

  • 1.9M+ records available instantly
  • No database setup required
  • Global availability

🛠️ Customization

Change Dataset Name

Edit scripts/upload_nonprofits_to_hf.py:

# Line 84
self.repo_name = repo_name or "YOUR_USERNAME/YOUR_DATASET_NAME"

Then upload:

python scripts/upload_nonprofits_to_hf.py --all --repo "your-username/nonprofits"

Update React Components

Edit frontend/src/pages/NonprofitsHF.tsx:

// Line 115
const DATASET_NAME = "your-username/nonprofits"

📚 Documentation

Full Guide

HuggingFace Docs

IRS Data Source


🔧 Troubleshooting

Error: "Hugging Face token required"

Solution:

export HUGGINGFACE_TOKEN="hf_YOUR_TOKEN"
# Or add to .env file

Error: "File not found: nonprofits_organizations.parquet"

Solution: Generate the gold tables first:

python scripts/create_all_gold_tables.py --nonprofits-only --use-irs --download-all-irs

Error: "Repository does not exist"

Solution: Change the repo name or create it manually:

  1. Visit: https://huggingface.co/new-dataset
  2. Name: one-nonprofits
  3. License: CC0-1.0 (Public Domain)
  4. Click "Create"

Dataset shows 0 rows

Solution: Wait 5-10 minutes after upload for HuggingFace to process the dataset. Then refresh the viewer.


🎉 Next Steps

  1. Upload datasets: python scripts/upload_nonprofits_to_hf.py --all
  2. Test API: Visit https://huggingface.co/datasets/CommunityOne/one-nonprofits
  3. Update React app: Use NonprofitsHF.tsx example
  4. Add features:
    • Map visualization with locations table
    • Financial charts with financials table
    • Advanced filters (subsection_code, foundation_code)
    • Autocomplete search
    • Export to CSV

📧 Support

  • Documentation: website/docs/guides/huggingface-datasets.md
  • HuggingFace Support: https://discuss.huggingface.co
  • IRS EO-BMF Guide: website/docs/data-sources/irs-bulk-data.md

Happy querying! 🚀