Skip to main content

Unified Contacts & Meetings Management

Purpose: Extract contact information (elected officials, speakers) from 153K meeting transcripts and build relationships between contacts and meetings.

🗂️ Data Model

Three Tables

  1. meetings_transcripts.parquet (2.8 GB)

    • 153,452 meeting transcripts
    • Columns: meeting_id, jurisdiction, date, transcript_text, etc.
    • Source: Scraped from city/county government websites
  2. contacts_local_officials.parquet

    • Unique officials aggregated from all meetings
    • Columns: name, title, jurisdiction, meetings_count, first_seen, last_updated
    • Deduplicated by (name, jurisdiction)
  3. contacts_meeting_attendance.parquet (Junction Table)

    • Many-to-many relationship: meetings ↔ contacts
    • Columns: meeting_id, name, title, jurisdiction, source, recorded_at
    • Enables queries like "Which officials attended meeting X?" and "Which meetings did official Y attend?"

Relationship

meetings_transcripts (1) ──< (many) contacts_meeting_attendance (many) >── (1) contacts_local_officials
│ │ │
meeting_id meeting_id, name name

🚀 Quick Start

Check Current State

python scripts/manage_contacts.py stats

Output:

📅 MEETINGS:
Total: 153,452
Jurisdictions: 1

👥 CONTACTS (Local Officials):
Total: 186
Avg meetings per official: 1.4

By Title:
Council Member: 119
Mayor: 42
Commissioner: 25

📋 MEETING ATTENDANCE (Relationships):
Total records: 262
Unique meetings: 183
Unique contacts: 186
Avg attendees per meeting: 1.4

Extract Contacts (Incremental)

# Test on 5,000 meetings
python scripts/manage_contacts.py extract --batch-size 1000 --limit 5000

# Process next 10,000
python scripts/manage_contacts.py extract --batch-size 1000 --limit 15000

# Process all 153K (takes ~6 hours)
python scripts/manage_contacts.py extract --batch-size 10000

Performance: ~2 minutes per 5,000 meetings = ~60 minutes for 153K meetings

Full Refresh

# Delete existing and re-extract from scratch
python scripts/manage_contacts.py refresh-all --confirm

📊 Extraction Method

NLP Patterns

The extraction uses 3 regex patterns to find official names:

1. Roll Call (Most Reliable)

"Jerry Schultz here, Ted Nelson here, Stephanie Briggs present"

Pattern: ([A-Z][a-z]+(?:\s+[A-Z][a-z]+){1,2})\s+(?:here|present|aye)

2. Title Mentions

"Mayor Smith called the meeting to order"
"Councilmember Jones seconded the motion"

Pattern: (Mayor|Councilmember|Commissioner)\s+([A-Z][a-z]+...)

3. Speaker Labels

John Doe: Thank you Mr. Mayor
Jane Smith: I move to approve

Pattern: ^([A-Z][a-z]+(?:\s+[A-Z][a-z]+){1,2}):\s+

Name Validation

Filters out false positives:

  • ❌ "Thank You" (contains common words: thank, you, good, etc.)
  • ❌ "Vice Chair" (contains title words: chair, mayor, council, etc.)
  • ❌ "City Council" (contains government words)
  • ✅ "Stephanie Briggs" (2-4 words, capitalized, no false positive words)
  • ✅ "Jerry Wayne Wright" (valid 3-word name)

🔄 Processing Strategy

Incremental Batches

Process meetings in batches to avoid memory issues:

# Phase 1: Test (5K meetings, 2 minutes)
python scripts/manage_contacts.py extract --limit 5000

# Phase 2: Small batch (50K meetings, 20 minutes)
python scripts/manage_contacts.py extract --limit 50000

# Phase 3: All meetings (153K, ~60 minutes)
python scripts/manage_contacts.py extract

Why Batches?

  • Meetings file: 2.8 GB (too big to load all at once)
  • Memory efficiency: Load 10K meetings at a time
  • Resumable: Can stop and restart without losing progress (merges with existing)

Auto-Merge

The extraction automatically merges with existing data:

  • Contacts: Updates meetings_count for existing contacts
  • Attendance: Deduplicates by (meeting_id, name)

📈 Expected Results

Based on 5,000 meeting sample:

  • Coverage: ~3.7% of meetings have extractable officials (183/5000)
  • Extraction rate: 186 unique contacts from 5,000 meetings
  • Avg per meeting: 1.4 officials per meeting (where found)

Projection for 153K Meetings

153,452 meetings × 3.7% coverage = ~5,677 meetings with extractables
186 contacts per 5K meetings = ~5,700 unique contacts total
262 attendance records per 5K = ~8,000 attendance records total

Note: Coverage improves over time as NLP patterns improve.

🗃️ File Structure

data/gold/
├── meetings_transcripts.parquet # 2.8 GB - Source data
├── contacts_local_officials.parquet # < 1 MB - Aggregated contacts
└── contacts_meeting_attendance.parquet # < 1 MB - Junction table

📚 Use Cases

1. Find Officials in a Specific Jurisdiction

import pandas as pd

contacts = pd.read_parquet('data/gold/contacts_local_officials.parquet')
tuscaloosa = contacts[contacts['jurisdiction'].str.contains('Tuscaloosa', na=False)]

print(f"Found {len(tuscaloosa)} officials in Tuscaloosa")

2. Find All Meetings an Official Attended

attendance = pd.read_parquet('data/gold/contacts_meeting_attendance.parquet')
stephanie_meetings = attendance[attendance['name'] == 'Stephanie Briggs']

print(f"Stephanie Briggs attended {len(stephanie_meetings)} meetings")

3. Find All Officials at a Specific Meeting

meeting_id = 'some-meeting-id'
officials = attendance[attendance['meeting_id'] == meeting_id]

print(f"Meeting had {len(officials)} officials:")
for _, row in officials.iterrows():
print(f" - {row['name']} ({row['title']})")

4. Most Active Officials

contacts = pd.read_parquet('data/gold/contacts_local_officials.parquet')
top_10 = contacts.nlargest(10, 'meetings_count')

print("Top 10 Most Active Officials:")
for _, row in top_10.iterrows():
print(f" {row['name']} ({row['title']}): {row['meetings_count']} meetings")

🔧 Advanced Options

Custom Batch Size

# Larger batches = faster but more memory
python scripts/manage_contacts.py extract --batch-size 20000

# Smaller batches = slower but safer
python scripts/manage_contacts.py extract --batch-size 5000

Limit Processing

# Process only first 100K meetings
python scripts/manage_contacts.py extract --limit 100000

🐛 Troubleshooting

"No meetings file found"

The source data file is missing:

# Check if file exists
ls -lh data/gold/national/meetings_transcripts.parquet

# If missing, regenerate from pipeline
python scripts/create_all_gold_tables.py --meetings-only

"Out of memory"

Reduce batch size:

python scripts/manage_contacts.py extract --batch-size 5000

"Too many false positives"

The name validation in _is_valid_name() can be tuned. Edit:

false_positive_words = {
'thank', 'you', 'good', 'evening', ... # Add more words here
}

"Duplicate contacts"

Contacts are deduplicated by (name, jurisdiction). If you see duplicates with different jurisdictions, that's expected (same person in different cities).

To merge manually:

import pandas as pd

contacts = pd.read_parquet('data/gold/contacts_local_officials.parquet')

# Group by name only (ignoring jurisdiction)
merged = contacts.groupby('name').agg({
'meetings_count': 'sum',
'title': 'first',
'jurisdiction': lambda x: ', '.join(x.unique())
}).reset_index()

merged.to_parquet('data/gold/contacts_local_officials.parquet', index=False)

📊 Data Quality

Accuracy

  • High confidence: Roll call patterns (95%+ accurate)
  • Medium confidence: Title mentions (80%+ accurate)
  • Lower confidence: Speaker labels (60%+ accurate, many false positives)

Coverage

  • Current: ~4% of meetings have extractable officials
  • Reason: Many transcripts lack structured patterns
  • Improvement: Add more patterns, improve OCR quality

Completeness

Not all officials are captured because:

  • Some meetings lack roll calls
  • Some officials only vote (no speaking)
  • OCR errors in source transcripts

🚀 Next Steps

1. Complete Extraction

# Process all 153K meetings
python scripts/manage_contacts.py extract --batch-size 10000

2. Enrich with External Data

  • Open States API: Add state legislators
  • Ballotpedia: Add elected official bios
  • Google Civic API: Add contact info

3. Upload to HuggingFace

# After extraction completes
python -m hosting.huggingface contacts

4. Create Search Index

Build search index for fast contact lookup:

# TODO: Create elasticsearch/algolia index

🎯 Success Metrics

  • Extraction complete: All 153K meetings processed
  • Contact quality: < 5% false positives
  • Coverage: > 10% of meetings have officials extracted
  • Published: Datasets available on HuggingFace