Bronze Migration for Events Data
Overview
This migration creates bronze tables for event and events_text_search (renamed to events_text_ai), following the bronze → staging → marts dbt architecture pattern.
Architecture
Bronze Layer (open_navigator_bronze)
↓
Staging Layer (dbt views with cleaning/validation)
↓
Marts Layer (dbt tables - production-ready)
Files Created
Database Migrations
-
003_create_bronze_event.sql
- Creates
bronze_eventtable inopen_navigator_bronzedatabase - Stores raw meeting events from LocalView, YouTube, Legistar, etc.
- Includes all fields from current production
event - Tracks data source (
source,datasource_idcolumns)
- Creates
-
004_create_bronze_event_youtube_transcript.sql
- Creates
bronze_event_youtube_transcripttable inopen_navigator_bronzedatabase - Stores video transcripts and AI-extracted text
- Replaces production
events_text_searchtable - Includes quality flags and AI model tracking
- Creates
dbt Staging Models
-
stg_bronze_event.sql
- Staging view applying basic cleaning to bronze events
- Normalizes state codes (UPPER), cities (INITCAP)
- Adds quality flags:
missing_title,missing_date,missing_state,video_missing_channel - Filters out events without title or date
-
stg_bronze_event_youtube_transcript.sql
- Staging view cleaning video transcripts
- Calculates
word_countandtranscript_length - Adds quality flags:
missing_transcript,very_short_transcript,missing_segments - Filters out transcripts <100 characters
dbt Mart Models (Production)
-
event.sql
- Production-ready events table (replaces current
event) - Deduplicates by video_url (keeps most recent)
- Applies quality filters
- Compatible with current API schema
- Production-ready events table (replaces current
-
events_text_search.sql
- Production-ready transcripts table (replaces current
events_text_search) - Joins to
eventto getevent_id - Deduplicates by video_id (keeps highest quality)
- Quality scoring: prefers manual transcripts, then by word count
- Production-ready transcripts table (replaces current
Configuration Files
-
dbt_project/models/staging/_staging.yml
- Added
bronze_eventsource definition - Added
bronze_event_youtube_transcriptsource definition - Added
stg_bronze_eventmodel documentation - Added
stg_bronze_event_youtube_transcriptmodel documentation
- Added
-
dbt_project/models/marts/_marts.yml
- Added
eventmodel documentation - Added
events_text_searchmodel documentation
- Added
Migration Steps
Step 1: Create Bronze Tables
# Create bronze_event table
psql -h localhost -p 5433 -U postgres -d open_navigator_bronze \
-f packages/hosting/scripts/neon/migrations/003_create_bronze_event.sql
# Create bronze_event_youtube_transcript table
psql -h localhost -p 5433 -U postgres -d open_navigator_bronze \
-f packages/hosting/scripts/neon/migrations/004_create_bronze_event_youtube_transcript.sql
Step 2: Import Foreign Tables
# In open_navigator database, import bronze tables via FDW
psql -h localhost -p 5433 -U postgres -d open_navigator -c "
IMPORT FOREIGN SCHEMA public
LIMIT TO (bronze_event, bronze_event_youtube_transcript)
FROM SERVER bronze_server INTO bronze;
"
Step 3: Load Sample Data (Testing)
# Copy 100 sample events from production to bronze for testing
psql -h localhost -p 5433 -U postgres -d open_navigator_bronze -c "
INSERT INTO bronze_event (
title, description, event_date, event_time,
jurisdiction_id, jurisdiction_name, jurisdiction_type,
state_code, state, city, location, meeting_type, status,
agenda_url, minutes_url, video_url,
channel_id, channel_url, channel_type,
view_count, duration_minutes, like_count, language,
source, datasource_id
)
SELECT
event_title, event_description, event_date, event_time,
jurisdiction_id, jurisdiction_name, jurisdiction_type,
state_code, state, city, location, meeting_type, status,
agenda_url, minutes_url, video_url,
channel_id, channel_url, channel_type,
view_count, duration_minutes, like_count, language,
COALESCE(source, 'unknown') AS source,
CAST(event_id AS VARCHAR) AS datasource_id
FROM open_navigator.public.event
ORDER BY event_date DESC
LIMIT 100;
"
# Copy sample transcripts
psql -h localhost -p 5433 -U postgres -d open_navigator_bronze -c "
INSERT INTO bronze_event_youtube_transcript (
event_id, video_id, raw_text, segments,
language, is_auto_generated, transcript_source,
has_transcript, created_at
)
SELECT
event_id, video_id, raw_text, segments,
language, is_auto_generated, transcript_source,
TRUE AS has_transcript, created_at
FROM open_navigator.public.events_text_search
LIMIT 100;
"
Step 4: Run dbt Models
cd dbt_project
# Test staging models
dbt run --select stg_bronze_event stg_bronze_event_youtube_transcript
# Build production marts
dbt run --select event events_text_search
# Run tests
dbt test --select event events_text_search
Step 5: Verify Results
-- Check events count
SELECT
'bronze_event' AS table_name,
COUNT(*)
FROM bronze.bronze_event
UNION ALL
SELECT
'event (dbt mart)',
COUNT(*)
FROM event;
-- Check transcripts count
SELECT
'bronze_event_youtube_transcript' AS table_name,
COUNT(*)
FROM bronze.bronze_event_youtube_transcript
UNION ALL
SELECT
'events_text_search (dbt mart)',
COUNT(*)
FROM events_text_search;
-- Verify deduplication worked
SELECT
COUNT(*) AS total_bronze_events,
COUNT(DISTINCT video_url) AS unique_video_urls
FROM bronze.bronze_event
WHERE video_url IS NOT NULL;
Data Flow Diagram
┌─────────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
├────────────┬────────────┬────────────┬──────────────────────────┤
│ LocalView │ YouTube │ Legistar │ Other (Granicus, etc.) │
└──────┬─────┴──────┬─────┴──────┬─────┴──────┬──────────────────┘
│ │ │ │
↓ ↓ ↓ ↓
┌──────────────────────────────────────────────────────────────────┐
│ BRONZE LAYER (open_navigator_bronze) │
├──────────────────────────────┬───────────────────────────────────┤
│ bronze_event │ bronze_event_youtube_transcript │
│ - Raw events from all │ - Raw transcripts │
│ sources │ - AI extraction metadata │
│ - May contain duplicates │ - Quality flags │
│ - Tracks source system │ │
└──────────────┬───────────────┴──────────────┬────────────────────┘
│ │
↓ (FDW) ↓ (FDW)
┌──────────────────────────────────────────────────────────────────┐
│ STAGING LAYER (dbt views - open_navigator) │
├──────────────────────────────┬───────────────────────────────────┤
│ stg_bronze_event │ stg_bronze_event_youtube_transcript │
│ - Clean & normalize │ - Calculate word count │
│ - Quality flags │ - Filter <100 chars │
│ - No deduplication │ - Quality scoring │
└──────────────┬───────────────┴──────────────┬────────────────────┘
│ │
↓ ↓
┌──────────────────────────────────────────────────────────────────┐
│ MARTS LAYER (dbt tables - open_navigator) │
├──────────────────────────────┬───────────────────────────────────┤
│ event │ events_text_search │
│ - Deduplicate by video_url │ - Join to get event_id │
│ - Production-ready │ - Deduplicate by video_id │
│ - API-compatible schema │ - Production-ready │
└──────────────┬───────────────┴──────────────┬────────────────────┘
│ │
↓ ↓
┌──────────────────────────────────────────────────────────────────┐
│ API & FRONTEND │
│ (api/routes/search_postgres.py) │
└──────────────────────────────────────────────────────────────────┘
Quality Improvements
Events Deduplication
Before: Direct loading to production could create duplicates After:
- Bronze layer tracks all raw events
- Staging adds quality flags
- Marts deduplicate by
video_url(keeps most recent)
Transcript Quality Scoring
Before: No quality ranking for multiple transcripts After:
- Quality score based on: manual > auto-generated, word count
- Keeps only highest quality transcript per video
- Filters transcripts <100 characters
Data Lineage
Before: Unclear where events came from After:
sourcefield tracks origin (localview, youtube, legistar)datasource_idstores original system ID- Full history in bronze layer
Updating Data Loading Scripts
Current Scripts to Update
-
packages/scrapers/src/scrapers/youtube/load_youtube_events_to_postgres.py
- Change: Insert to
bronze_eventinstead ofevent - Change: Insert to
bronze_event_youtube_transcriptinstead ofevents_text_search
- Change: Insert to
-
scripts/datasources/localview/load_to_postgres.py
- Change: Insert to
bronze_eventinstead ofevent
- Change: Insert to
-
Any other scripts inserting to event
- Search:
grep -r "INSERT INTO event" scripts/ - Update to insert to
bronze_event
- Search:
After Updating Scripts
# Run updated loader script
python packages/scrapers/src/scrapers/youtube/load_youtube_events_to_postgres.py --states AL,MA
# Run dbt to update production tables
cd dbt_project
dbt run --select event events_text_search
# Production tables are now up to date!
Benefits
✅ Version Control - All transformations in SQL tracked by git ✅ Testable - dbt tests ensure data quality ✅ Deduplication - Automatic deduplication in marts layer ✅ Quality Filters - Consistent quality rules applied ✅ Data Lineage - Clear path from source to production ✅ Rollback-able - Can rebuild from bronze at any time
Next Steps
- Update loading scripts - Change insert targets to bronze tables
- Test full pipeline - Load → dbt run → API query
- Schedule dbt runs - Add to cron/Airflow for daily updates
- Monitor quality - Review dbt test results regularly
- Backfill bronze - Load historical data from production to bronze
Questions?
See also: