🚀 HuggingFace Dataset Integration - Quick Start Guide
📋 Overview
You now have 3 new files to push your 1.9M+ nonprofit datasets to HuggingFace and query them from React:
scripts/upload_nonprofits_to_hf.py- Upload scriptfrontend/src/utils/huggingface.ts- TypeScript API clientfrontend/src/pages/NonprofitsHF.tsx- Example React pagewebsite/docs/guides/huggingface-datasets.md- Complete documentation
⚡ Quick Start (5 Steps)
Step 1: Get HuggingFace Token
- Visit: https://huggingface.co/settings/tokens
- Click "New token"
- Name it:
oral-health-upload - Permission: Write
- Copy the token (starts with
hf_...)
Step 2: Set Environment Variable
# Add to .env file
echo 'HUGGINGFACE_TOKEN=hf_YOUR_TOKEN_HERE' >> .env
# Or export for current session
export HUGGINGFACE_TOKEN="hf_YOUR_TOKEN_HERE"
Step 3: Install Dependencies
# Python dependencies
pip install huggingface_hub datasets pyarrow
# Already installed in your project
# datasets and huggingface-hub
Step 4: Upload Datasets
cd /home/developer/projects/open-navigator
# Upload all 4 nonprofit tables
python scripts/upload_nonprofits_to_hf.py --all
# Output:
# ✅ Logged in to Hugging Face
# ✅ Repository ready: https://huggingface.co/datasets/CommunityOne/one-nonprofits
# 📤 Uploading organizations from data/gold/nonprofits_organizations.parquet
# Rows: 1,952,238
# Columns: 28
# Size: 156.43 MB
# ✅ Uploaded organizations: 1,952,238 records
# ... (uploads financials, programs, locations)
# 🎉 All uploads complete!
What gets uploaded:
nonprofits_organizations.parquet→ 1.9M+ orgs (split: "organizations")nonprofits_financials.parquet→ Financial data (split: "financials")nonprofits_programs.parquet→ Programs (split: "programs")nonprofits_locations.parquet→ Locations (split: "locations")
Step 5: Test the Dataset
# Test with curl (no auth required for public datasets!)
curl "https://datasets-server.huggingface.co/rows?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&offset=0&length=10" | jq .
# Search for "dental"
curl "https://datasets-server.huggingface.co/search?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&query=dental" | jq .
Expected response:
{
"features": [...],
"rows": [
{
"row_idx": 0,
"row": {
"ein": "630123456",
"name": "ALABAMA DENTAL ASSOCIATION",
"city": "MONTGOMERY",
"state": "AL",
"ntee_code": "E12",
...
}
}
],
"num_rows_total": 1952238,
"num_rows_per_page": 100
}
🌐 Using in React
Option A: Replace Current Nonprofits Page
# Backup current page
mv frontend/src/pages/Nonprofits.tsx frontend/src/pages/Nonprofits.backup.tsx
# Use HuggingFace version
mv frontend/src/pages/NonprofitsHF.tsx frontend/src/pages/Nonprofits.tsx
Option B: Add New Route
Edit frontend/src/App.tsx:
import NonprofitsHF from './pages/NonprofitsHF'
// Add route
<Route path="/nonprofits-hf" element={<NonprofitsHF />} />
Test Locally
cd frontend
npm run dev
# Visit: http://localhost:5173/nonprofits
# or: http://localhost:5173/nonprofits-hf
🔍 Query Examples
Python
from datasets import load_dataset
import pandas as pd
# Load dataset
dataset = load_dataset("CommunityOne/one-nonprofits")
# Get organizations table
orgs = dataset["organizations"]
# Convert to pandas
df = pd.DataFrame(orgs)
# Filter by state
alabama = df[df['state'] == 'AL']
print(f"Alabama nonprofits: {len(alabama):,}")
# Output: Alabama nonprofits: 26,148
# Filter by NTEE (E = Health)
health = df[df['ntee_code'].str.startswith('E', na=False)]
print(f"Health organizations: {len(health):,}")
# Output: Health organizations: 80,000+
# Search for "dental"
dental = df[df['name'].str.contains('dental', case=False, na=False)]
print(f"Dental organizations: {len(dental):,}")
JavaScript/TypeScript
import { searchNonprofits } from '../utils/huggingface'
// Search for dental orgs in California
const results = await searchNonprofits({
dataset: "CommunityOne/one-nonprofits",
query: "dental",
state: "CA",
nteeCode: "E",
limit: 100
})
console.log(`Found ${results.length} dental orgs in California`)
REST API (curl)
# Get first 100 organizations
curl "https://datasets-server.huggingface.co/rows?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&offset=0&length=100"
# Search for "dental"
curl "https://datasets-server.huggingface.co/search?dataset=CommunityOne/one-nonprofits&config=default&split=organizations&query=dental"
# Get dataset size
curl "https://datasets-server.huggingface.co/size?dataset=CommunityOne/one-nonprofits&config=default&split=organizations"
📊 What's in the Dataset?
organizations (main table)
- Records: 1,952,238
- Fields: ein, name, sort_name, city, state, zip_code, street_address, ntee_code, subsection_code, foundation_code, tax_exempt_status, deductibility_status, ruling_date, organization_code, activity_codes, group_exemption, affiliation_code, data_source
financials
- Records: 1,952,238
- Fields: ein, asset_amount, income_amount, revenue_amount, tax_period
programs
- Records: 1,952,238
- Fields: ein, activity_codes, group_exemption, affiliation_code
locations
- Records: 1,952,238
- Fields: ein, street_address, city, state, zip_code
🎯 Key Features
✅ FREE
- Unlimited storage (public datasets)
- No authentication required for reading
- Free bandwidth and API calls
✅ FAST
- CDN-backed by HuggingFace
- Automatic caching
- Pagination built-in (100 rows max per request)
✅ SEARCHABLE
- Full-text search included
- Filter by columns (state, NTEE code, etc.)
- REST API - works from any language
✅ SCALABLE
- 1.9M+ records available instantly
- No database setup required
- Global availability
🛠️ Customization
Change Dataset Name
Edit scripts/upload_nonprofits_to_hf.py:
# Line 84
self.repo_name = repo_name or "YOUR_USERNAME/YOUR_DATASET_NAME"
Then upload:
python scripts/upload_nonprofits_to_hf.py --all --repo "your-username/nonprofits"
Update React Components
Edit frontend/src/pages/NonprofitsHF.tsx:
// Line 115
const DATASET_NAME = "your-username/nonprofits"
📚 Documentation
Full Guide
- Location:
website/docs/guides/huggingface-datasets.md - URL: http://localhost:3000/docs/guides/huggingface-datasets
HuggingFace Docs
- Datasets: https://huggingface.co/docs/datasets
- API: https://huggingface.co/docs/datasets-server
- Hub: https://huggingface.co/docs/hub
IRS Data Source
- EO-BMF: https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf
- Search Tool: https://www.irs.gov/charities-non-profits/tax-exempt-organization-search
🔧 Troubleshooting
Error: "Hugging Face token required"
Solution:
export HUGGINGFACE_TOKEN="hf_YOUR_TOKEN"
# Or add to .env file
Error: "File not found: nonprofits_organizations.parquet"
Solution: Generate the gold tables first:
python scripts/create_all_gold_tables.py --nonprofits-only --use-irs --download-all-irs
Error: "Repository does not exist"
Solution: Change the repo name or create it manually:
- Visit: https://huggingface.co/new-dataset
- Name:
one-nonprofits - License: CC0-1.0 (Public Domain)
- Click "Create"
Dataset shows 0 rows
Solution: Wait 5-10 minutes after upload for HuggingFace to process the dataset. Then refresh the viewer.
🎉 Next Steps
- Upload datasets:
python scripts/upload_nonprofits_to_hf.py --all - Test API: Visit https://huggingface.co/datasets/CommunityOne/one-nonprofits
- Update React app: Use
NonprofitsHF.tsxexample - Add features:
- Map visualization with locations table
- Financial charts with financials table
- Advanced filters (subsection_code, foundation_code)
- Autocomplete search
- Export to CSV
📧 Support
- Documentation:
website/docs/guides/huggingface-datasets.md - HuggingFace Support: https://discuss.huggingface.co
- IRS EO-BMF Guide:
website/docs/data-sources/irs-bulk-data.md
Happy querying! 🚀