Skip to main content

🚀 QUICK START: FREE STORAGE WITH HUGGING FACE

TL;DR: Store unlimited data for FREE on Hugging Face!

⚠️ IMPORTANT: Use Parquet format, NOT individual PDFs! See file limits guide


⚡ 3-MINUTE SETUP

1. Create Hugging Face Account (1 minute)

# Go to https://huggingface.co/join
# Sign up (FREE)
# Verify email

2. Get API Token (1 minute)

# Go to https://huggingface.co/settings/tokens
# Click "New token"
# Name it "oral-health-upload"
# Token Type: Write (required for publishing datasets)
# Repository permissions: All repositories
# Copy the token (hf_xxxxxxxxxxxx)

⚠️ Important: Token Permissions

  • Write access required for publishing datasets
  • Read access sufficient for downloading public datasets only
  • For this project: Use Write token to publish your scraped data

3. Install & Login (1 minute)

pip install huggingface_hub datasets

# Set your token
export HUGGINGFACE_TOKEN="hf_YOUR_TOKEN_HERE"

⚠️ CRITICAL: FILE LIMITS

Hugging Face Limits:

  • Files per folder: <10,000
  • Total files per repo: <100,000
  • For large datasets: Use Parquet or WebDataset format

Your Scale:

  • 22,000 jurisdictions × 1,000 docs = 22 MILLION files ❌

Solution:

  • Extract text from PDFs
  • Store in Parquet format
  • Result: 50 files instead of 22 million ✅

See detailed guide: HUGGINGFACE_FILE_LIMITS.md


📤 UPLOAD YOUR DATA

For discovery data:

# Go to your project
cd /home/developer/projects/open-navigator

# Activate environment
source venv/bin/activate

# Upload discovery results
python scripts/upload_to_huggingface.py \
--repo "YOUR_USERNAME/oral-health-policy-data" \
--discovery

# View your dataset
# https://huggingface.co/datasets/YOUR_USERNAME/oral-health-policy-data

For meeting PDFs (extract text first!):

# DON'T upload individual PDFs!
# Instead, extract text and save as Parquet

# 1. Create a file with PDF URLs (one per line)
cat > pdf_urls.txt << EOF
https://tuscaloosaal.suiteonemedia.com/agenda1.pdf
https://tuscaloosaal.suiteonemedia.com/agenda2.pdf
...
EOF

# 2. Process PDFs to Parquet (extracts text, deletes PDFs)
python scripts/upload_to_huggingface.py \
--repo "YOUR_USERNAME/oral-health-policy-data" \
--process-pdfs pdf_urls.txt

# 3. Upload the Parquet file (1 file, not thousands!)
python scripts/upload_to_huggingface.py \
--repo "YOUR_USERNAME/oral-health-policy-data" \
--meetings meetings_processed.parquet

from datasets import Dataset
from huggingface_hub import login
import pandas as pd

# Login
login(token="hf_YOUR_TOKEN")

# Load your data
df = pd.read_csv('data/bronze/discovered_sources/discovery_summary_final.csv')

# Convert to dataset
dataset = Dataset.from_pandas(df)

# Upload to Hugging Face (FREE!)
dataset.push_to_hub("YOUR_USERNAME/oral-health-policy-data", split="discovery")

print("✅ Data uploaded! View at:")
print("https://huggingface.co/datasets/YOUR_USERNAME/oral-health-policy-data")

💰 COST BREAKDOWN

What You GetCost
Unlimited storage (public datasets)FREE
Unlimited downloadsFREE
Built-in viewerFREE
Version controlFREE
Search & filteringFREE
API accessFREE
TOTAL$0/month

📊 STORAGE COMPARISON

Bad Approach (Expensive)

❌ Download all videos: 250 TB = $5,000/month
❌ Store all PDFs: 30 TB = $600/month
❌ Total: $5,600/month 💸

Good Approach (FREE)

✅ Store discovery data: 1 GB = FREE
✅ Store extracted text: 25 GB = FREE
✅ Store oral health subset: 5 GB = FREE
✅ Total: $0/month 🎉

Savings: $5,600/month → $0/month


🎯 WHAT TO UPLOAD

✅ Upload These:

  1. Discovery Results (~1 GB)

    • Jurisdiction websites
    • YouTube channels
    • Meeting platforms
    • Social media links
  2. Meeting Metadata (~2 GB)

    • Meeting dates/titles
    • Agenda item lists
    • Source URLs
  3. Extracted Text (~25 GB)

    • Text from PDFs
    • Meeting transcripts
    • Filtered for oral health

❌ Don't Upload These:

  1. Videos - Link to YouTube instead
  2. Full PDFs - Store text + URL to original
  3. Website HTML - Just store the data you extracted
  4. Duplicates - Filter first

📝 EXAMPLE WORKFLOW

Step 1: Run Discovery

# Discover all Alabama jurisdictions
python discovery/comprehensive_discovery_pipeline.py --state AL

# Output: data/bronze/discovered_sources/discovery_summary_AL.csv (~50 KB)

Step 2: Upload to Hugging Face

# Upload discovery results
python scripts/upload_to_huggingface.py \
--repo "YOUR_USERNAME/oral-health-policy-data" \
--discovery

Step 3: Free Up Local Space

# Optional: Delete local files (data is safely in cloud)
rm -rf data/bronze/discovered_sources/*.csv

# You can always download from Hugging Face later!

Step 4: Share & Analyze

# Anyone can now use your data (including you!)
from datasets import load_dataset

data = load_dataset("YOUR_USERNAME/oral-health-policy-data", split="discovery")
alabama = data.filter(lambda x: x['state'] == 'AL')

print(f"Alabama jurisdictions: {len(alabama)}")

🔄 CONTINUOUS WORKFLOW

Keep Local Storage Low (~100 MB)

# Process one jurisdiction at a time
for jurisdiction in all_jurisdictions:
# 1. Download PDF (2 MB)
pdf = download_agenda(jurisdiction)

# 2. Extract text (50 KB)
text = extract_text(pdf)

# 3. Upload to Hugging Face
upload_to_hf(text)

# 4. Delete local file
os.remove(pdf)

# Local storage: Never exceeds 100 MB! ✅

📚 HUGGING FACE BASICS

Load Your Data Anywhere

from datasets import load_dataset

# Load on your laptop
data = load_dataset("YOUR_USERNAME/oral-health-policy-data")

# Or in Google Colab (FREE GPU)
# Or on a friend's computer
# Or 5 years from now

# Your data is always available, forever, for FREE!

Search & Filter

# Find cities with YouTube channels
with_youtube = data.filter(lambda x: x['youtube_channels'] > 0)

# Find high-quality sources
high_quality = data.filter(lambda x: x['completeness'] > 0.8)

# Find specific state
indiana = data.filter(lambda x: x['state'] == 'IN')

Download Subset

# Only download what you need (save bandwidth)
oral_health_only = load_dataset(
"YOUR_USERNAME/oral-health-policy-data",
split="oral_health" # Only the filtered subset
)

# Maybe only 5 GB instead of 50 GB!

✅ BENEFITS

1. FREE Unlimited Storage

  • No storage limits for public datasets
  • No bandwidth limits
  • No time limits

2. Accessible Anywhere

  • Download from any computer
  • Share with collaborators
  • Use in Google Colab

3. Version Control

  • Git-based system
  • Track all changes
  • Revert if needed

4. Discovery

  • Your dataset appears in Hugging Face search
  • Other researchers can use it
  • Builds your portfolio

5. Integration

  • Works with PyTorch, TensorFlow
  • Built-in data viewer
  • API access

🎓 LEARN MORE

Official Docs

Examples


🆘 TROUBLESHOOTING

"Authentication failed"

# Make sure token is set
echo $HUGGINGFACE_TOKEN

# If empty, set it
export HUGGINGFACE_TOKEN="hf_YOUR_TOKEN"

# Or login interactively
huggingface-cli login

"Permission denied"

# Make sure repo name includes your username
# ✅ Correct: "myusername/oral-health-policy-data"
# ❌ Wrong: "oral-health-policy-data"

"Dataset too large"

# Don't upload raw files!
# Upload processed/filtered data only

# ❌ Bad: Upload 50 GB of PDFs
# ✅ Good: Upload 5 GB of extracted text

🎯 NEXT STEPS

  1. ✅ Create Hugging Face account
  2. ✅ Get API token
  3. ✅ Run discovery for your state
  4. ✅ Upload to Hugging Face
  5. ✅ Delete local files to free space
  6. ✅ Scale to all 22,000+ jurisdictions!

Your data is safe in the cloud, FREE, forever! 🎉


💡 PRO TIP

Make your dataset public (not private):

  • ✅ FREE unlimited storage
  • ✅ Helps research community
  • ✅ Builds your portfolio
  • ✅ Appears in search results

Private datasets are limited to 100 GB and don't help anyone!

Public = Win-Win-Win 🏆