Unified Contacts & Meetings Management

Purpose: Extract contact information (elected officials, speakers) from 153K meeting transcripts and build relationships between contacts and meetings.

🗂️ Data Model

Three Tables

meetings_transcripts.parquet (2.8 GB)
- 153,452 meeting transcripts
- Columns: meeting_id, jurisdiction, date, transcript_text, etc.
- Source: Scraped from city/county government websites
contacts_local_officials.parquet
- Unique officials aggregated from all meetings
- Columns: name, title, jurisdiction, meetings_count, first_seen, last_updated
- Deduplicated by (name, jurisdiction)
contacts_meeting_attendance.parquet (Junction Table)
- Many-to-many relationship: meetings ↔ contacts
- Columns: meeting_id, name, title, jurisdiction, source, recorded_at
- Enables queries like "Which officials attended meeting X?" and "Which meetings did official Y attend?"

Relationship

meetings_transcripts (1) ──< (many) contacts_meeting_attendance (many) >── (1) contacts_local_officials
         │                                    │                                      │
     meeting_id                         meeting_id, name                          name

🚀 Quick Start

Check Current State

python scripts/manage_contacts.py stats

Output:

📅 MEETINGS:
   Total: 153,452
   Jurisdictions: 1

👥 CONTACTS (Local Officials):
   Total: 186
   Avg meetings per official: 1.4
   
   By Title:
      Council Member: 119
      Mayor: 42
      Commissioner: 25

📋 MEETING ATTENDANCE (Relationships):
   Total records: 262
   Unique meetings: 183
   Unique contacts: 186
   Avg attendees per meeting: 1.4

Extract Contacts (Incremental)

# Test on 5,000 meetings
python scripts/manage_contacts.py extract --batch-size 1000 --limit 5000

# Process next 10,000
python scripts/manage_contacts.py extract --batch-size 1000 --limit 15000

# Process all 153K (takes ~6 hours)
python scripts/manage_contacts.py extract --batch-size 10000

Performance: ~2 minutes per 5,000 meetings = ~60 minutes for 153K meetings

Full Refresh

# Delete existing and re-extract from scratch
python scripts/manage_contacts.py refresh-all --confirm

📊 Extraction Method

NLP Patterns

The extraction uses 3 regex patterns to find official names:

1. Roll Call (Most Reliable)

"Jerry Schultz here, Ted Nelson here, Stephanie Briggs present"

Pattern: ([A-Z][a-z]+(?:\s+[A-Z][a-z]+){1,2})\s+(?:here|present|aye)

2. Title Mentions

"Mayor Smith called the meeting to order"
"Councilmember Jones seconded the motion"

Pattern: (Mayor|Councilmember|Commissioner)\s+([A-Z][a-z]+...)

3. Speaker Labels

John Doe: Thank you Mr. Mayor
Jane Smith: I move to approve

Pattern: ^([A-Z][a-z]+(?:\s+[A-Z][a-z]+){1,2}):\s+

Name Validation

Filters out false positives:

❌ "Thank You" (contains common words: thank, you, good, etc.)
❌ "Vice Chair" (contains title words: chair, mayor, council, etc.)
❌ "City Council" (contains government words)
✅ "Stephanie Briggs" (2-4 words, capitalized, no false positive words)
✅ "Jerry Wayne Wright" (valid 3-word name)

🔄 Processing Strategy

Incremental Batches

Process meetings in batches to avoid memory issues:

# Phase 1: Test (5K meetings, 2 minutes)
python scripts/manage_contacts.py extract --limit 5000

# Phase 2: Small batch (50K meetings, 20 minutes)
python scripts/manage_contacts.py extract --limit 50000

# Phase 3: All meetings (153K, ~60 minutes)
python scripts/manage_contacts.py extract

Why Batches?

Meetings file: 2.8 GB (too big to load all at once)
Memory efficiency: Load 10K meetings at a time
Resumable: Can stop and restart without losing progress (merges with existing)

Auto-Merge

The extraction automatically merges with existing data:

Contacts: Updates meetings_count for existing contacts
Attendance: Deduplicates by (meeting_id, name)

📈 Expected Results

Based on 5,000 meeting sample:

Coverage: ~3.7% of meetings have extractable officials (183/5000)
Extraction rate: 186 unique contacts from 5,000 meetings
Avg per meeting: 1.4 officials per meeting (where found)

Projection for 153K Meetings

153,452 meetings × 3.7% coverage = ~5,677 meetings with extractables
186 contacts per 5K meetings = ~5,700 unique contacts total
262 attendance records per 5K = ~8,000 attendance records total

Note: Coverage improves over time as NLP patterns improve.

🗃️ File Structure

data/gold/
├── meetings_transcripts.parquet          # 2.8 GB - Source data
├── contacts_local_officials.parquet      # < 1 MB - Aggregated contacts
└── contacts_meeting_attendance.parquet   # < 1 MB - Junction table

📚 Use Cases

1. Find Officials in a Specific Jurisdiction

import pandas as pd

contacts = pd.read_parquet('data/gold/contacts_local_officials.parquet')
tuscaloosa = contacts[contacts['jurisdiction'].str.contains('Tuscaloosa', na=False)]

print(f"Found {len(tuscaloosa)} officials in Tuscaloosa")

2. Find All Meetings an Official Attended

attendance = pd.read_parquet('data/gold/contacts_meeting_attendance.parquet')
stephanie_meetings = attendance[attendance['name'] == 'Stephanie Briggs']

print(f"Stephanie Briggs attended {len(stephanie_meetings)} meetings")

3. Find All Officials at a Specific Meeting

meeting_id = 'some-meeting-id'
officials = attendance[attendance['meeting_id'] == meeting_id]

print(f"Meeting had {len(officials)} officials:")
for _, row in officials.iterrows():
    print(f"  - {row['name']} ({row['title']})")

4. Most Active Officials

contacts = pd.read_parquet('data/gold/contacts_local_officials.parquet')
top_10 = contacts.nlargest(10, 'meetings_count')

print("Top 10 Most Active Officials:")
for _, row in top_10.iterrows():
    print(f"  {row['name']} ({row['title']}): {row['meetings_count']} meetings")

🔧 Advanced Options

Custom Batch Size

# Larger batches = faster but more memory
python scripts/manage_contacts.py extract --batch-size 20000

# Smaller batches = slower but safer
python scripts/manage_contacts.py extract --batch-size 5000

Limit Processing

# Process only first 100K meetings
python scripts/manage_contacts.py extract --limit 100000

🐛 Troubleshooting

"No meetings file found"

The source data file is missing:

# Check if file exists
ls -lh data/gold/national/meetings_transcripts.parquet

# If missing, regenerate from pipeline
python scripts/create_all_gold_tables.py --meetings-only

"Out of memory"

Reduce batch size:

python scripts/manage_contacts.py extract --batch-size 5000

"Too many false positives"

The name validation in _is_valid_name() can be tuned. Edit:

false_positive_words = {
    'thank', 'you', 'good', 'evening', ...  # Add more words here
}

"Duplicate contacts"

Contacts are deduplicated by (name, jurisdiction). If you see duplicates with different jurisdictions, that's expected (same person in different cities).

To merge manually:

import pandas as pd

contacts = pd.read_parquet('data/gold/contacts_local_officials.parquet')

# Group by name only (ignoring jurisdiction)
merged = contacts.groupby('name').agg({
    'meetings_count': 'sum',
    'title': 'first',
    'jurisdiction': lambda x: ', '.join(x.unique())
}).reset_index()

merged.to_parquet('data/gold/contacts_local_officials.parquet', index=False)

📊 Data Quality

Accuracy

High confidence: Roll call patterns (95%+ accurate)
Medium confidence: Title mentions (80%+ accurate)
Lower confidence: Speaker labels (60%+ accurate, many false positives)

Coverage

Current: ~4% of meetings have extractable officials
Reason: Many transcripts lack structured patterns
Improvement: Add more patterns, improve OCR quality

Completeness

Not all officials are captured because:

Some meetings lack roll calls
Some officials only vote (no speaking)
OCR errors in source transcripts

🚀 Next Steps

1. Complete Extraction

# Process all 153K meetings
python scripts/manage_contacts.py extract --batch-size 10000

2. Enrich with External Data

Open States API: Add state legislators
Ballotpedia: Add elected official bios
Google Civic API: Add contact info

3. Upload to HuggingFace

# After extraction completes
python -m hosting.huggingface contacts

4. Create Search Index

Build search index for fast contact lookup:

# TODO: Create elasticsearch/algolia index

🎯 Success Metrics

✅ Extraction complete: All 153K meetings processed
✅ Contact quality: < 5% false positives
✅ Coverage: > 10% of meetings have officials extracted
✅ Published: Datasets available on HuggingFace

🗂️ Data Model​

Three Tables​

Relationship​

🚀 Quick Start​

Check Current State​

Extract Contacts (Incremental)​

Full Refresh​

📊 Extraction Method​

NLP Patterns​

1. Roll Call (Most Reliable)​

2. Title Mentions​

3. Speaker Labels​

Name Validation​

🔄 Processing Strategy​

Incremental Batches​

Why Batches?​

Auto-Merge​

📈 Expected Results​

Projection for 153K Meetings​

🗃️ File Structure​

📚 Use Cases​

1. Find Officials in a Specific Jurisdiction​

2. Find All Meetings an Official Attended​

3. Find All Officials at a Specific Meeting​

4. Most Active Officials​

🔧 Advanced Options​

Custom Batch Size​

Limit Processing​

🐛 Troubleshooting​

"No meetings file found"​

"Out of memory"​

"Too many false positives"​

"Duplicate contacts"​

📊 Data Quality​

Accuracy​

Coverage​

Completeness​

🚀 Next Steps​

1. Complete Extraction​

2. Enrich with External Data​

3. Upload to HuggingFace​

4. Create Search Index​

🎯 Success Metrics​

📝 Related Documentation​