Skip to main content

✅ Confirmed: HuggingFace Datasets That WILL Help

Quick Answer: YES, 2 of 4 will help significantly!

DatasetStatusUsefulnessPriority
MeetingBankREADY TO USE🔥 VERY HIGHUSE IMMEDIATELY
LocalView✅ Already coveredHIGHDownload from Harvard
Council Data Project✅ Already coveredHIGHAlready integrated
CivicBand⚠️ Limited accessMEDIUMScrape municipality list

1. MeetingBank 🔥 (NEW! USE THIS!)

What It Is:

A benchmark dataset from 6 major U.S. cities specifically designed for meeting summarization

URLs:

What You Get:

1,366 city council meetings from 6 cities:

  • Alameda, CA
  • Boston, MA
  • Denver, CO
  • King County, WA
  • Long Beach, CA
  • Seattle, WA

3,579 hours of video

Full transcripts (average 28,000 tokens per meeting)

PDF meeting minutes & agendas

Human-written summaries (ground truth for evaluation)

Machine-generated summaries (from 6 different systems)

6,892 segment-level summarization instances for training

Why This Is PERFECT for Your Project:

  1. Immediate prototyping: Download from HuggingFace in 5 minutes

    from datasets import load_dataset
    meetingbank = load_dataset("huuuyeah/meetingbank")

    for instance in meetingbank['train']:
    print(instance['id'])
    print(instance['summary'])
    print(instance['transcript'])
  2. Quality validation: Compare your AI summarization against human-written summaries

  3. URL discovery: Each meeting has source URLs to city websites

  4. Benchmark your oral health keyword detection: Test against 1,366 real transcripts

  5. Training data: If you want to fine-tune models for oral health policy

Paper:

"MeetingBank: A Benchmark Dataset for Meeting Summarization"
ACL 2023 (Association for Computational Linguistics)
https://arxiv.org/abs/2305.17529

🎯 ACTION PLAN:

# 1. Install HuggingFace datasets
pip install datasets

# 2. Download MeetingBank
python -c "
from datasets import load_dataset
meetingbank = load_dataset('huuuyeah/meetingbank')
print(f'Loaded {len(meetingbank['train'])} training instances')
"

# 3. Create discovery/meetingbank_ingestion.py
# - Parse meetings
# - Extract URLs
# - Load to Bronze layer
# - Run keyword detection on transcripts
# - Evaluate against human summaries

Expected ROI:

  • Time: 2 hours to integrate
  • Value: 1,366 meetings with transcripts + summaries + URLs
  • Quality: Academic benchmark (peer-reviewed, ACL published)
  • Coverage: 6 major cities (all large, high-value for advocacy)

2. LocalView ✅ (Already Covered)

Status: Already identified in previous investigation
Location: Harvard Dataverse (doi:10.7910/DVN/NJTBEM)
Coverage: 1,000-10,000 jurisdictions
Action: Download from Harvard (already documented)


3. Council Data Project ✅ (Already Covered)

Status: Already integrated in external_url_datasets.py
Coverage: 20+ cities with full pipelines
Action: Already coded, just run the script


4. CivicBand ⚠️ (Limited Usefulness)

What It Is:

"Largest public collection of civic meeting and election finance data"
Website: https://civic.band/

What Exists:

1,031 municipalities tracked
✅ Millions of pages scraped (meeting minutes, agendas)
✅ Search interface available
✅ Publicly browsable

The Problem:

"Dataset access is via their platform; raw dumps require coordination"

  • Can't directly download bulk URL list
  • Would need to contact founder (Philip James: hello@civic.band)
  • Or scrape the municipality list from their website

What You CAN Get:

The list of 1,031 municipalities is publicly visible on their site. You could:

  1. Scrape the municipality list (city names + states)
  2. Match against your Census data to get FIPS codes
  3. Use as verification (these 1,031 are confirmed to have meeting data)

Limited Value Because:

  • Can't get direct URLs (need to coordinate with founder)
  • Already have larger coverage from LocalView (1,000-10,000 jurisdictions)
  • Already have premium coverage from CDP (20 cities)
  • CivicBand's main value is their content (scraped minutes), not URLs

Possible Action:

# Scrape CivicBand's municipality list
import requests
from bs4 import BeautifulSoup

response = requests.get("https://civic.band/")
soup = BeautifulSoup(response.text, 'html.parser')

# Parse the table of municipalities
# Match against Census data
# Use as validation list

Estimated value: MEDIUM (validation only, not bulk URLs)


📊 Revised Priority Ranking

IMMEDIATE (Do This Week):

  1. 🔥 Download MeetingBank (2 hours)
    • HuggingFace dataset ready to use
    • 1,366 meetings with transcripts, summaries, URLs
    • Perfect for prototyping and evaluation

HIGH PRIORITY (Do This Month):

  1. Download LocalView (1 day)

    • Harvard Dataverse
    • 1,000-10,000 jurisdictions
  2. Run CDP integration (2 hours)

    • Already coded
    • 20 premium cities

MEDIUM PRIORITY (Optional):

  1. ⚠️ Scrape CivicBand list (4 hours)
    • 1,031 municipality names
    • Use for validation
    • Or contact founder for bulk access

🎯 Updated Integration Code

Add MeetingBank to your pipeline:

# discovery/meetingbank_ingestion.py

from datasets import load_dataset
from pyspark.sql import SparkSession
from loguru import logger

def load_meetingbank_to_bronze(spark: SparkSession) -> dict:
"""
Load MeetingBank dataset to Bronze layer.

MeetingBank contains 1,366 city council meetings from 6 major cities
with full transcripts, summaries, and source URLs.
"""
logger.info("Loading MeetingBank dataset from HuggingFace")

# Download from HuggingFace
meetingbank = load_dataset("huuuyeah/meetingbank")

meetings = []

for split in ['train', 'validation', 'test']:
for instance in meetingbank[split]:
meetings.append({
"meeting_id": instance['id'],
"jurisdiction_name": instance.get('city', 'Unknown'),
"state_code": instance.get('state', 'Unknown'),
"transcript": instance['transcript'],
"summary_human": instance['summary'],
"source_url": instance.get('url', ''),
"date": instance.get('date', ''),
"has_transcript": True,
"has_summary": True,
"has_url": bool(instance.get('url')),
"transcript_length": len(instance['transcript']),
"source": "meetingbank"
})

# Convert to DataFrame
df = spark.createDataFrame(meetings)

# Write to Bronze layer
output_path = f"{settings.delta_lake_path}/bronze/meetingbank_meetings"
df.write \
.format("delta") \
.mode("overwrite") \
.save(output_path)

logger.info(f"✅ Loaded {len(meetings)} meetings from MeetingBank")

return {
"total_meetings": len(meetings),
"cities": 6,
"source": "meetingbank"
}

Test your keyword detection:

# Test keyword detection on MeetingBank transcripts
from datasets import load_dataset
from alerts.keyword_monitor import KeywordAlertSystem

meetingbank = load_dataset("huuuyeah/meetingbank")
alert_system = KeywordAlertSystem()

# Test on first 10 meetings
for instance in meetingbank['train'][:10]:
matches = alert_system._find_keywords_in_text(
instance['transcript'],
alert_system.KEYWORD_CATEGORIES
)

if matches:
print(f"Meeting {instance['id']}: {len(matches)} oral health keywords found")
for match in matches[:3]: # Show first 3
print(f" - {match.keyword} ({match.category})")

Evaluate your AI summarization:

# Compare your summaries against human-written ground truth
from extraction.summarizer import MeetingSummarizer
from datasets import load_dataset

summarizer = MeetingSummarizer()
meetingbank = load_dataset("huuuyeah/meetingbank")

for instance in meetingbank['test'][:10]:
# Generate your summary
your_summary = summarizer.summarize(
event=None, # Create MeetingEvent from instance
full_text=instance['transcript'],
focus_on_health=False
)

# Compare against human summary
human_summary = instance['summary']

print(f"Meeting: {instance['id']}")
print(f"Your summary: {your_summary.executive_summary}")
print(f"Human summary: {human_summary}")
print(f"Quality: {your_summary.confidence_score}")
print()

📈 Expected Outcomes

Before MeetingBank:

  • 76 URLs discovered (15% match rate)
  • No evaluation benchmark
  • No ground truth for summarization

After MeetingBank:

  • +1,366 meetings with transcripts
  • +6 major cities with verified URLs
  • Academic benchmark for evaluation
  • Human summaries for quality validation
  • Total meetings: 1,366 ready to analyze immediately

🚀 Final Recommendation

DO THIS FIRST (2 hours):

# 1. Install HuggingFace datasets
pip install datasets

# 2. Download MeetingBank
python -c "
from datasets import load_dataset
meetingbank = load_dataset('huuuyeah/meetingbank')
print(f'✅ Downloaded {len(meetingbank[\"train\"])} meetings')
"

# 3. Create integration script
# See code example above

# 4. Test your keyword detection
# See test code above

# 5. Evaluate your summarization
# See evaluation code above

Expected Result:

  • Immediate access to 1,366 meetings
  • 6 major cities for prototyping
  • Academic quality benchmark
  • Proven ROI: Published in top NLP conference (ACL 2023)

Summary Table

DatasetAvailable?Download TimeMeetingsUsefulness
MeetingBankYES (HuggingFace)5 minutes1,366🔥 VERY HIGH
LocalView✅ YES (Harvard)1 day1,000-10,000🔥 VERY HIGH
CDP✅ YES (already coded)2 hours20 cities🔥 HIGH
CivicBand⚠️ PARTIAL (need coordination)4 hours1,031 list🟡 MEDIUM

Bottom line: MeetingBank is the fastest win! Download it today and start testing your summarization and keyword detection on real city council meeting transcripts.