⚠️ HUGGING FACE FILE LIMITS & SOLUTIONS
IMPORTANT: Don't upload individual PDFs! Use structured formats instead.
🚨 THE PROBLEM
Hugging Face Limits:
Files per folder: < 10,000 recommended
Total files per repo: < 100,000 recommended
Large-scale handling: Use WebDataset or Parquet, NOT individual files
Your Scale:
22,000 jurisdictions × 1,000 documents each = 22 MILLION files
❌ This would BREAK Hugging Face limits!
✅ THE SOLUTION: PARQUET FORMAT
Instead of uploading 22 million PDFs, store extracted data in Parquet files.
Why Parquet?
- ✅ Efficient - Columnar storage, highly compressed
- ✅ Scalable - Handle millions of rows in single file
- ✅ Fast - Optimized for filtering and querying
- ✅ Native - Hugging Face Datasets uses Parquet internally
- ✅ Small - 10-100x smaller than individual files
Size Comparison:
❌ Bad: 22 million PDF files (30 TB)
- Exceeds 100k file limit by 220x
- Slow to upload/download
- Impossible to manage
✅ Good: 220 Parquet files (25 GB compressed)
- 1 file per jurisdiction type per state
- Fast to query
- Easy to manage
- Within all limits
📊 RECOMMENDED STRUCTURE
Option 1: Parquet Files (RECOMMENDED)
Store all text content in Parquet tables:
import pandas as pd
from datasets import Dataset
# Instead of storing individual PDFs...
# Store rows in a DataFrame
meetings_data = []
for jurisdiction in all_jurisdictions:
for meeting in jurisdiction.meetings:
meetings_data.append({
'jurisdiction_name': 'Tuscaloosa',
'state': 'AL',
'meeting_date': '2025-03-15',
'meeting_title': 'City Council Regular Meeting',
'agenda_text': 'extracted text from PDF...', # ← TEXT, not PDF bytes
'minutes_text': 'extracted minutes...',
'video_url': 'https://youtube.com/watch?v=...', # ← LINK, not video
'source_url': 'https://tuscaloosaal.suiteonemedia.com/agenda.pdf',
'keywords_found': ['fluoride', 'dental'],
'is_oral_health_related': True
})
# Convert to DataFrame
df = pd.DataFrame(meetings_data)
# Save as Parquet (highly compressed)
df.to_parquet('meetings_all.parquet', compression='snappy')
# Upload to Hugging Face
dataset = Dataset.from_pandas(df)
dataset.push_to_hub("username/oral-health-policy-data", split="meetings")
File structure on Hugging Face:
your-dataset/
├── discovery.parquet # 1 file, ~1 GB (22k jurisdictions)
├── meetings.parquet # 1 file, ~10 GB (500k meetings)
├── oral_health.parquet # 1 file, ~2 GB (50k relevant docs)
└── README.md
Total: 3 files, 13 GB ✅ (vs 22 million files, 30 TB ❌)
🎯 CORRECT WORKFLOW
❌ WRONG: Download & Upload PDFs
# DON'T DO THIS!
for jurisdiction in all_jurisdictions:
for meeting in get_meetings(jurisdiction):
# Download PDF
pdf_bytes = download_pdf(meeting.pdf_url)
# Upload to Hugging Face
upload_file(pdf_bytes, f"pdfs/{jurisdiction}/{meeting.id}.pdf")
# ❌ Results in 22 million files!
✅ CORRECT: Extract & Store Text in Parquet
# DO THIS!
import pandas as pd
from PyPDF2 import PdfReader
import io
all_meetings = []
for jurisdiction in all_jurisdictions:
for meeting in get_meetings(jurisdiction):
# Download PDF temporarily
pdf_bytes = download_pdf(meeting.pdf_url)
# Extract text (don't store PDF!)
pdf_reader = PdfReader(io.BytesIO(pdf_bytes))
text = ""
for page in pdf_reader.pages:
text += page.extract_text()
# Store metadata + text (not PDF bytes)
all_meetings.append({
'id': f"{jurisdiction.name}_{meeting.date}_{meeting.id}",
'jurisdiction': jurisdiction.name,
'state': jurisdiction.state,
'date': meeting.date,
'title': meeting.title,
'text': text, # ← Extracted text
'source_pdf_url': meeting.pdf_url, # ← Link to original
'file_size_kb': len(pdf_bytes) // 1024,
'page_count': len(pdf_reader.pages)
})
# Delete PDF immediately (free memory)
del pdf_bytes
# Save all to single Parquet file
df = pd.DataFrame(all_meetings)
df.to_parquet('all_meetings.parquet', compression='snappy')
# Upload 1 file instead of 22 million!
from datasets import Dataset
dataset = Dataset.from_pandas(df)
dataset.push_to_hub("username/oral-health-meetings")
Result:
- ✅ 1 file (not 22 million)
- ✅ 10 GB (not 30 TB)
- ✅ Fast queries
- ✅ Easy downloads
📦 PARTITIONED PARQUET (For Very Large Datasets)
If you have 100+ GB of data, partition by state:
import pandas as pd
from pathlib import Path
# Process state by state
for state in all_states:
state_meetings = []
for jurisdiction in get_jurisdictions(state):
# Extract meetings for this jurisdiction
meetings = process_jurisdiction(jurisdiction)
state_meetings.extend(meetings)
# Save one Parquet per state
df = pd.DataFrame(state_meetings)
df.to_parquet(f'meetings_{state}.parquet')
# Upload to Hugging Face with state-based splits
from datasets import Dataset, DatasetDict
dataset_dict = {}
for state_file in Path('.').glob('meetings_*.parquet'):
state = state_file.stem.split('_')[1]
df = pd.read_parquet(state_file)
dataset_dict[state] = Dataset.from_pandas(df)
# Upload all states
datasets = DatasetDict(dataset_dict)
datasets.push_to_hub("username/oral-health-meetings")
File structure:
your-dataset/
├── AL/
│ └── data-00000-of-00001.parquet # Alabama meetings
├── CA/
│ └── data-00000-of-00001.parquet # California meetings
├── TX/
│ └── data-00000-of-00001.parquet # Texas meetings
...
└── README.md
Total: 50 files (one per state) ✅
Load specific state:
# Only download Alabama data
al_data = load_dataset("username/oral-health-meetings", split="AL")
🗜️ COMPRESSION COMPARISON
Parquet Compression:
# Same data, different compression
df.to_parquet('meetings.parquet', compression='snappy') # Fast, good compression
# Size: 8 GB
df.to_parquet('meetings.parquet', compression='gzip') # Slower, better compression
# Size: 5 GB
df.to_parquet('meetings.parquet', compression='brotli') # Slowest, best compression
# Size: 3 GB
Recommendation: Use snappy (default) - good balance of speed and size.
🔢 SIZE ESTIMATES
Real Numbers for 22,000 Jurisdictions:
| Data Type | Storage Method | Files | Size |
|---|---|---|---|
| PDFs (raw) | Individual files | 22M | 30 TB ❌ |
| PDFs (text) | Parquet | 50 | 25 GB ✅ |
| Oral health subset | Parquet | 1 | 5 GB ✅ |
| Discovery results | Parquet | 1 | 1 GB ✅ |
Total storage needed: ~30 GB (not 30 TB!) ✅
💡 ALTERNATIVE: WebDataset Format
For image-heavy or binary data, use WebDataset .tar files:
import webdataset as wds
# Create sharded tar files
sink = wds.ShardWriter("meetings-%06d.tar", maxcount=10000)
for jurisdiction in all_jurisdictions:
for meeting in jurisdiction.meetings:
# Extract text from PDF
text = extract_text(meeting.pdf_url)
sink.write({
"__key__": f"{jurisdiction.name}_{meeting.id}",
"txt": text.encode('utf-8'),
"json": json.dumps(meeting.metadata).encode('utf-8')
})
sink.close()
# Results in:
# meetings-000000.tar (10k documents)
# meetings-000001.tar (10k documents)
# ...
# meetings-002200.tar (remaining documents)
# Total: ~2,200 tar files ✅ (under 10k file limit per folder)
🎯 RECOMMENDED APPROACH
For Your Project:
1. Store Metadata + Text in Parquet (Primary)
# Structure your data
meetings_df = pd.DataFrame({
'id': [...],
'jurisdiction': [...],
'state': [...],
'date': [...],
'title': [...],
'agenda_text': [...], # Extracted text
'minutes_text': [...], # Extracted text
'source_url': [...], # Link to original PDF
'video_url': [...], # Link to YouTube
'oral_health_keywords': [...]
})
# Save as Parquet
meetings_df.to_parquet('meetings.parquet', compression='snappy')
# Upload to Hugging Face (1 file, ~10 GB)
dataset = Dataset.from_pandas(meetings_df)
dataset.push_to_hub("username/oral-health-meetings")
2. Partition by State (If >50 GB)
# One Parquet per state
for state in all_states:
state_df = meetings_df[meetings_df['state'] == state]
state_df.to_parquet(f'meetings_{state}.parquet')
# Upload with splits
dataset_dict = {...} # Load each state
datasets.push_to_hub("username/oral-health-meetings")
# Total: 50 files (one per state) ✅
3. Never Upload Individual PDFs
# ❌ NEVER do this
for pdf in all_pdfs:
upload_file(pdf) # Results in millions of files
# ✅ ALWAYS do this
text = extract_text(pdf)
df.append({'text': text, 'source_url': pdf_url})
df.to_parquet('data.parquet') # One file
📚 UPDATED UPLOAD SCRIPT
#!/usr/bin/env python3
"""
Correctly upload large-scale data to Hugging Face using Parquet format.
"""
import pandas as pd
from datasets import Dataset
from huggingface_hub import login
from PyPDF2 import PdfReader
import io
def process_and_upload_correct_way():
"""Process jurisdictions and upload as Parquet (not individual files)."""
all_meetings = []
# Process all jurisdictions
for jurisdiction in all_jurisdictions:
print(f"Processing {jurisdiction.name}...")
for agenda_url in jurisdiction.agenda_urls:
# Download PDF temporarily
pdf_bytes = download_pdf(agenda_url)
# Extract text
pdf_reader = PdfReader(io.BytesIO(pdf_bytes))
text = "\n".join(page.extract_text() for page in pdf_reader.pages)
# Store metadata + text (NOT PDF bytes)
all_meetings.append({
'jurisdiction': jurisdiction.name,
'state': jurisdiction.state,
'date': extract_date(text),
'text': text,
'source_url': agenda_url,
'page_count': len(pdf_reader.pages)
})
# Delete PDF immediately
del pdf_bytes
# Keep local storage low!
# Convert to DataFrame
df = pd.DataFrame(all_meetings)
# Save as Parquet (compressed)
df.to_parquet('all_meetings.parquet', compression='snappy')
print(f"Total meetings: {len(df)}")
print(f"File size: {Path('all_meetings.parquet').stat().st_size / 1e9:.2f} GB")
# Upload to Hugging Face (1 file instead of millions!)
dataset = Dataset.from_pandas(df)
dataset.push_to_hub("username/oral-health-meetings")
print("✅ Uploaded 1 Parquet file containing all meetings!")
✅ SUMMARY
Do This:
- ✅ Extract text from PDFs (don't store PDF bytes)
- ✅ Store in Parquet format (1-50 files total)
- ✅ Link to original sources (not duplicate content)
- ✅ Compress with snappy
- ✅ Partition by state if >50 GB
Don't Do This:
- ❌ Upload individual PDFs (millions of files)
- ❌ Store video files (link to YouTube)
- ❌ Duplicate raw content
- ❌ Exceed 100k file limit
- ❌ Use uncompressed formats
Result:
- 22 million files → 50 files ✅
- 30 TB → 30 GB ✅
- Slow uploads → Fast uploads ✅
- Hard to manage → Easy to manage ✅
- Expensive → FREE ✅
You can store ALL 22,000 jurisdictions in ~50 Parquet files totaling 30 GB!