✅ HuggingFace Dataset Sharing Added!
What's New
You can now publish your jurisdiction discovery datasets to HuggingFace Hub for public sharing and collaboration!
🎯 New Capabilities
1. HuggingFace Publisher Module
- File: pipeline/huggingface_publisher.py
- Publishes datasets to HuggingFace Hub
- Supports all discovery data layers (Bronze/Silver/Gold)
2. CLI Command
python main.py publish-to-hf --dataset all
3. 5 Publishable Datasets
census-gid- Census Bureau GID (90,735 jurisdictions)gov-domains- CISA .gov domains (15,000+)nces-schools- NCES school districts (13,000+)discovered-urls- Discovered URLs with metadatascraping-targets- Prioritized scraping targets
📦 Files Added/Updated
New Files
- ✅ pipeline/huggingface_publisher.py - HuggingFace publisher (~400 lines)
- ✅ docs/HUGGINGFACE_PUBLISHING.md - Complete publishing guide
Updated Files
- ✅ requirements.txt - Added
datasets>=2.16.0andhuggingface-hub>=0.20.0 - ✅ config/settings.py - Added
huggingface_token,hf_organization,hf_dataset_prefix - ✅ .env.example - Added HuggingFace configuration
- ✅ main.py - Added
publish-to-hfCLI command - ✅ README.md - Added HuggingFace publishing section
🚀 Quick Start
1. Get HuggingFace Token
Visit: https://huggingface.co/settings/tokens
Create a Write token
2. Configure
Add to .env:
HUGGINGFACE_TOKEN=hf_your_write_token_here
HF_ORGANIZATION=CommunityOne
HF_DATASET_PREFIX=oral-health-policy-pulse
3. Install Dependencies
pip install datasets huggingface-hub
4. Publish
# Publish all datasets
python main.py publish-to-hf --dataset all
# Or publish individually
python main.py publish-to-hf --dataset census
python main.py publish-to-hf --dataset discovered-urls
📊 What Gets Published
Dataset URLs
Your datasets will be available at:
- https://huggingface.co/datasets/CommunityOne/oral-health-policy-pulse-census-gid
- https://huggingface.co/datasets/CommunityOne/oral-health-policy-pulse-gov-domains
- https://huggingface.co/datasets/CommunityOne/oral-health-policy-pulse-nces-schools
- https://huggingface.co/datasets/CommunityOne/oral-health-policy-pulse-discovered-urls
- https://huggingface.co/datasets/CommunityOne/oral-health-policy-pulse-scraping-targets
Public Access
Anyone can load your datasets:
from datasets import load_dataset
# Load census data
census = load_dataset("CommunityOne/oral-health-policy-pulse-census-gid")
# Load discovered URLs
urls = load_dataset("CommunityOne/oral-health-policy-pulse-discovered-urls")
# Access specific split
counties = census["counties"]
print(f"Total counties: {len(counties)}")
💡 Use Cases
For Researchers
# Analyze jurisdiction coverage
from datasets import load_dataset
import pandas as pd
census = load_dataset("CommunityOne/oral-health-policy-pulse-census-gid")
df = pd.DataFrame(census["municipalities"])
# Cities by state
df.groupby("state_name")["population"].sum().sort_values(ascending=False)
For Civic Hackers
# Get all county .gov domains
domains = load_dataset("CommunityOne/oral-health-policy-pulse-gov-domains")
counties = domains.filter(lambda x: x['Domain Type'] == 'County')
For Data Scientists
# High-confidence discovered URLs
urls = load_dataset("CommunityOne/oral-health-policy-pulse-discovered-urls")
high_conf = urls.filter(lambda x: x['confidence_score'] > 0.8)
🔄 Update Workflow
After Each Discovery Run
# Run discovery
python main.py discover-jurisdictions
# Publish updated datasets
python main.py publish-to-hf --dataset discovered-urls
python main.py publish-to-hf --dataset scraping-targets
Monthly Source Data Updates
# Re-ingest source data
python main.py discover-jurisdictions
# Publish refreshed datasets
python main.py publish-to-hf --dataset census
python main.py publish-to-hf --dataset gov-domains
python main.py publish-to-hf --dataset nces-schools
🎯 CLI Options
# Publish all datasets
python main.py publish-to-hf --dataset all
# Publish specific dataset
python main.py publish-to-hf --dataset census
python main.py publish-to-hf --dataset gov-domains
python main.py publish-to-hf --dataset nces-schools
python main.py publish-to-hf --dataset discovered-urls
python main.py publish-to-hf --dataset scraping-targets
# Make datasets private
python main.py publish-to-hf --dataset all --private
# Sample census data (faster for testing)
python main.py publish-to-hf --dataset census --sample
🔒 Privacy & Security
What's Safe to Publish
✅ Public Data:
- Census Bureau GID (already public)
- CISA .gov domains (already public)
- NCES school districts (already public)
- Discovered government URLs (public websites)
- Scraping targets (public information)
⚠️ Use --private for:
- Scraped meeting minutes content
- Internal analysis results
- Custom annotations
❌ Never Publish:
- Personal information (PII)
- API keys or tokens
- Internal comments/notes
Token Security
- Store token in
.envfile (gitignored) - Use write token (not fine-grained)
- Revoke token if compromised
📚 Documentation
Complete guide: HUGGINGFACE_PUBLISHING.md
Covers:
- Detailed setup instructions
- Dataset structure and schemas
- Programmatic publishing in Python
- Loading datasets in Python/R
- Collaboration features
- Troubleshooting
🌍 Community Impact
By publishing your datasets, you enable:
- 📊 Reproducible research on government accessibility
- 🤝 Cross-project collaboration
- 🔍 Discovery of missing government websites
- 📈 Tracking government digital infrastructure over time
- 🎓 Educational use for civic tech training
Your jurisdiction discovery data helps the entire civic tech community! 🙏
✅ Benefits
| Feature | Before | After |
|---|---|---|
| Data Storage | Local only | Local + HuggingFace Hub |
| Data Sharing | Manual export | One-command publish |
| Collaboration | Email/Dropbox | Public datasets w/ versioning |
| Discovery | None | Searchable on HuggingFace |
| Access | Your team only | Anyone worldwide |
| Versioning | Manual | Automatic Git-style tracking |
Ready to share your jurisdiction discovery data with the world! 🌍🦷✨