✅ Confirmed: HuggingFace Datasets That WILL Help
Quick Answer: YES, 2 of 4 will help significantly!
| Dataset | Status | Usefulness | Priority |
|---|---|---|---|
| MeetingBank | ✅ READY TO USE | 🔥 VERY HIGH | USE IMMEDIATELY |
| LocalView | ✅ Already covered | HIGH | Download from Harvard |
| Council Data Project | ✅ Already covered | HIGH | Already integrated |
| CivicBand | ⚠️ Limited access | MEDIUM | Scrape municipality list |
1. MeetingBank 🔥 (NEW! USE THIS!)
What It Is:
A benchmark dataset from 6 major U.S. cities specifically designed for meeting summarization
URLs:
- HuggingFace (text): https://huggingface.co/datasets/huuuyeah/meetingbank
- HuggingFace (audio): https://huggingface.co/datasets/huuuyeah/MeetingBank_Audio
- Zenodo (all files): https://zenodo.org/record/7989108
- Archive.org (videos):
What You Get:
✅ 1,366 city council meetings from 6 cities:
- Alameda, CA
- Boston, MA
- Denver, CO
- King County, WA
- Long Beach, CA
- Seattle, WA
✅ 3,579 hours of video
✅ Full transcripts (average 28,000 tokens per meeting)
✅ PDF meeting minutes & agendas
✅ Human-written summaries (ground truth for evaluation)
✅ Machine-generated summaries (from 6 different systems)
✅ 6,892 segment-level summarization instances for training
Why This Is PERFECT for Your Project:
-
Immediate prototyping: Download from HuggingFace in 5 minutes
from datasets import load_datasetmeetingbank = load_dataset("huuuyeah/meetingbank")for instance in meetingbank['train']:print(instance['id'])print(instance['summary'])print(instance['transcript']) -
Quality validation: Compare your AI summarization against human-written summaries
-
URL discovery: Each meeting has source URLs to city websites
-
Benchmark your oral health keyword detection: Test against 1,366 real transcripts
-
Training data: If you want to fine-tune models for oral health policy
Paper:
"MeetingBank: A Benchmark Dataset for Meeting Summarization"
ACL 2023 (Association for Computational Linguistics)
https://arxiv.org/abs/2305.17529
🎯 ACTION PLAN:
# 1. Install HuggingFace datasets
pip install datasets
# 2. Download MeetingBank
python -c "
from datasets import load_dataset
meetingbank = load_dataset('huuuyeah/meetingbank')
print(f'Loaded {len(meetingbank['train'])} training instances')
"
# 3. Create discovery/meetingbank_ingestion.py
# - Parse meetings
# - Extract URLs
# - Load to Bronze layer
# - Run keyword detection on transcripts
# - Evaluate against human summaries
Expected ROI:
- Time: 2 hours to integrate
- Value: 1,366 meetings with transcripts + summaries + URLs
- Quality: Academic benchmark (peer-reviewed, ACL published)
- Coverage: 6 major cities (all large, high-value for advocacy)
2. LocalView ✅ (Already Covered)
Status: Already identified in previous investigation
Location: Harvard Dataverse (doi:10.7910/DVN/NJTBEM)
Coverage: 1,000-10,000 jurisdictions
Action: Download from Harvard (already documented)
3. Council Data Project ✅ (Already Covered)
Status: Already integrated in external_url_datasets.py
Coverage: 20+ cities with full pipelines
Action: Already coded, just run the script
4. CivicBand ⚠️ (Limited Usefulness)
What It Is:
"Largest public collection of civic meeting and election finance data"
Website: https://civic.band/
What Exists:
✅ 1,031 municipalities tracked
✅ Millions of pages scraped (meeting minutes, agendas)
✅ Search interface available
✅ Publicly browsable
The Problem:
❌ "Dataset access is via their platform; raw dumps require coordination"
- Can't directly download bulk URL list
- Would need to contact founder (Philip James: hello@civic.band)
- Or scrape the municipality list from their website
What You CAN Get:
The list of 1,031 municipalities is publicly visible on their site. You could:
- Scrape the municipality list (city names + states)
- Match against your Census data to get FIPS codes
- Use as verification (these 1,031 are confirmed to have meeting data)
Limited Value Because:
- Can't get direct URLs (need to coordinate with founder)
- Already have larger coverage from LocalView (1,000-10,000 jurisdictions)
- Already have premium coverage from CDP (20 cities)
- CivicBand's main value is their content (scraped minutes), not URLs
Possible Action:
# Scrape CivicBand's municipality list
import requests
from bs4 import BeautifulSoup
response = requests.get("https://civic.band/")
soup = BeautifulSoup(response.text, 'html.parser')
# Parse the table of municipalities
# Match against Census data
# Use as validation list
Estimated value: MEDIUM (validation only, not bulk URLs)
📊 Revised Priority Ranking
IMMEDIATE (Do This Week):
- 🔥 Download MeetingBank (2 hours)
- HuggingFace dataset ready to use
- 1,366 meetings with transcripts, summaries, URLs
- Perfect for prototyping and evaluation
HIGH PRIORITY (Do This Month):
-
✅ Download LocalView (1 day)
- Harvard Dataverse
- 1,000-10,000 jurisdictions
-
✅ Run CDP integration (2 hours)
- Already coded
- 20 premium cities
MEDIUM PRIORITY (Optional):
- ⚠️ Scrape CivicBand list (4 hours)
- 1,031 municipality names
- Use for validation
- Or contact founder for bulk access
🎯 Updated Integration Code
Add MeetingBank to your pipeline:
# discovery/meetingbank_ingestion.py
from datasets import load_dataset
from pyspark.sql import SparkSession
from loguru import logger
def load_meetingbank_to_bronze(spark: SparkSession) -> dict:
"""
Load MeetingBank dataset to Bronze layer.
MeetingBank contains 1,366 city council meetings from 6 major cities
with full transcripts, summaries, and source URLs.
"""
logger.info("Loading MeetingBank dataset from HuggingFace")
# Download from HuggingFace
meetingbank = load_dataset("huuuyeah/meetingbank")
meetings = []
for split in ['train', 'validation', 'test']:
for instance in meetingbank[split]:
meetings.append({
"meeting_id": instance['id'],
"jurisdiction_name": instance.get('city', 'Unknown'),
"state_code": instance.get('state', 'Unknown'),
"transcript": instance['transcript'],
"summary_human": instance['summary'],
"source_url": instance.get('url', ''),
"date": instance.get('date', ''),
"has_transcript": True,
"has_summary": True,
"has_url": bool(instance.get('url')),
"transcript_length": len(instance['transcript']),
"source": "meetingbank"
})
# Convert to DataFrame
df = spark.createDataFrame(meetings)
# Write to Bronze layer
output_path = f"{settings.delta_lake_path}/bronze/meetingbank_meetings"
df.write \
.format("delta") \
.mode("overwrite") \
.save(output_path)
logger.info(f"✅ Loaded {len(meetings)} meetings from MeetingBank")
return {
"total_meetings": len(meetings),
"cities": 6,
"source": "meetingbank"
}
Test your keyword detection:
# Test keyword detection on MeetingBank transcripts
from datasets import load_dataset
from alerts.keyword_monitor import KeywordAlertSystem
meetingbank = load_dataset("huuuyeah/meetingbank")
alert_system = KeywordAlertSystem()
# Test on first 10 meetings
for instance in meetingbank['train'][:10]:
matches = alert_system._find_keywords_in_text(
instance['transcript'],
alert_system.KEYWORD_CATEGORIES
)
if matches:
print(f"Meeting {instance['id']}: {len(matches)} oral health keywords found")
for match in matches[:3]: # Show first 3
print(f" - {match.keyword} ({match.category})")
Evaluate your AI summarization:
# Compare your summaries against human-written ground truth
from extraction.summarizer import MeetingSummarizer
from datasets import load_dataset
summarizer = MeetingSummarizer()
meetingbank = load_dataset("huuuyeah/meetingbank")
for instance in meetingbank['test'][:10]:
# Generate your summary
your_summary = summarizer.summarize(
event=None, # Create MeetingEvent from instance
full_text=instance['transcript'],
focus_on_health=False
)
# Compare against human summary
human_summary = instance['summary']
print(f"Meeting: {instance['id']}")
print(f"Your summary: {your_summary.executive_summary}")
print(f"Human summary: {human_summary}")
print(f"Quality: {your_summary.confidence_score}")
print()
📈 Expected Outcomes
Before MeetingBank:
- 76 URLs discovered (15% match rate)
- No evaluation benchmark
- No ground truth for summarization
After MeetingBank:
- +1,366 meetings with transcripts
- +6 major cities with verified URLs
- Academic benchmark for evaluation
- Human summaries for quality validation
- Total meetings: 1,366 ready to analyze immediately
🚀 Final Recommendation
DO THIS FIRST (2 hours):
# 1. Install HuggingFace datasets
pip install datasets
# 2. Download MeetingBank
python -c "
from datasets import load_dataset
meetingbank = load_dataset('huuuyeah/meetingbank')
print(f'✅ Downloaded {len(meetingbank[\"train\"])} meetings')
"
# 3. Create integration script
# See code example above
# 4. Test your keyword detection
# See test code above
# 5. Evaluate your summarization
# See evaluation code above
Expected Result:
- Immediate access to 1,366 meetings
- 6 major cities for prototyping
- Academic quality benchmark
- Proven ROI: Published in top NLP conference (ACL 2023)
Summary Table
| Dataset | Available? | Download Time | Meetings | Usefulness |
|---|---|---|---|---|
| MeetingBank | ✅ YES (HuggingFace) | 5 minutes | 1,366 | 🔥 VERY HIGH |
| LocalView | ✅ YES (Harvard) | 1 day | 1,000-10,000 | 🔥 VERY HIGH |
| CDP | ✅ YES (already coded) | 2 hours | 20 cities | 🔥 HIGH |
| CivicBand | ⚠️ PARTIAL (need coordination) | 4 hours | 1,031 list | 🟡 MEDIUM |
Bottom line: MeetingBank is the fastest win! Download it today and start testing your summarization and keyword detection on real city council meeting transcripts.