Skip to main content

Open Navigator dbt Project

Transforms bronze AI extractions into production-ready search tables.

๐ŸŽฏ Purposeโ€‹

This dbt project handles:

  • Bronze โ†’ Production transformations (AI extracted data)
  • Data quality testing
  • Incremental processing (only new records)
  • Entity deduplication
  • Documentation generation

๐Ÿ“ Project Structureโ€‹

dbt_project/
โ”œโ”€โ”€ dbt_project.yml # Project configuration
โ”œโ”€โ”€ profiles.yml.example # Database connection template
โ”œโ”€โ”€ models/
โ”‚ โ”œโ”€โ”€ staging/ # Clean bronze data
โ”‚ โ”‚ โ”œโ”€โ”€ _staging.yml
โ”‚ โ”‚ โ”œโ”€โ”€ stg_bronze_contacts.sql
โ”‚ โ”‚ โ”œโ”€โ”€ stg_bronze_organizations_meetings.sql
โ”‚ โ”‚ โ””โ”€โ”€ stg_bronze_bills.sql
โ”‚ โ”œโ”€โ”€ intermediate/ # Deduplicate
โ”‚ โ”‚ โ”œโ”€โ”€ _intermediate.yml
โ”‚ โ”‚ โ””โ”€โ”€ int_contacts_deduped.sql
โ”‚ โ””โ”€โ”€ marts/ # Production tables
โ”‚ โ”œโ”€โ”€ _marts.yml
โ”‚ โ””โ”€โ”€ contact_ai.sql
โ”œโ”€โ”€ macros/ # Reusable SQL functions
โ”‚ โ”œโ”€โ”€ calculate_confidence.sql
โ”‚ โ”œโ”€โ”€ normalize_bill_number.sql
โ”‚ โ””โ”€โ”€ normalize_name.sql
โ”œโ”€โ”€ tests/ # Custom data quality tests
โ””โ”€โ”€ README.md # This file

๐Ÿš€ Quick Startโ€‹

1. Install dbtโ€‹

# Install dbt-postgres
pip install dbt-postgres

# Verify installation
dbt --version

2. Configure Database Connectionโ€‹

# Copy example profiles
cp profiles.yml.example ~/.dbt/profiles.yml

# Edit with your database credentials
nano ~/.dbt/profiles.yml

Or set environment variables:

export POSTGRES_PASSWORD=your_password
export NEON_HOST=your-neon-host.neon.tech
export NEON_USER=your_user
export NEON_PASSWORD=your_password

3. Test Connectionโ€‹

# Check dbt can connect
dbt debug

# Should show:
# โœ“ Connection test: [OK connection ok]

4. Run Modelsโ€‹

# Run all models
dbt run

# Run specific model
dbt run --select stg_bronze_contacts

# Run with full refresh (rebuild everything)
dbt run --full-refresh

# Run tests
dbt test

# Generate documentation
dbt docs generate
dbt docs serve # Opens in browser

๐Ÿ“Š Model Layersโ€‹

Staging (models/staging/)โ€‹

Purpose: Clean and normalize bronze data

  • stg_bronze_contacts.sql - Clean contact names, filter invalid records
  • stg_bronze_organizations_meetings.sql - Normalize org names, clean EINs
  • stg_bronze_bills.sql - Standardize bill numbers

Materialization: view (no storage, computed on-the-fly)

Intermediate (models/intermediate/)โ€‹

Purpose: Deduplicate and prepare for production

  • int_contacts_deduped.sql - One record per person per org

Materialization: table (stored, fast to query)

Marts (models/marts/)โ€‹

Purpose: Production-ready tables for API

  • contact_ai.sql - AI-extracted contacts (incremental)

Materialization: incremental (only processes new records)

๐Ÿงช Testingโ€‹

Run Testsโ€‹

# Run all tests
dbt test

# Run tests for specific model
dbt test --select contact_ai

# Run specific test type
dbt test --select test_type:unique
dbt test --select test_type:not_null

Available Testsโ€‹

  1. Schema tests (in .yml files)

    • unique - No duplicates
    • not_null - No NULL values
    • accepted_values - Value in allowed list
    • relationships - Foreign key exists
  2. Custom tests (in tests/ folder)

    • Custom SQL assertions

๐Ÿ”„ Incremental Processingโ€‹

Models marked materialized='incremental' only process new records:

{% if is_incremental() %}
WHERE extracted_at > (SELECT MAX(last_updated) FROM {{ this }})
{% endif %}

Full Refreshโ€‹

To rebuild everything from scratch:

dbt run --full-refresh --select contact_ai

๐ŸŽจ Macrosโ€‹

Reusable SQL functions in macros/:

calculate_confidence(datasource)โ€‹

SELECT {{ calculate_confidence('datasource') }} as score
-- Returns 1.0 for authoritative, 0.60 for AI extraction

normalize_bill_number(column)โ€‹

SELECT {{ normalize_bill_number('official_number') }} as bill_num
-- 'HB 123' โ†’ 'HB123'

normalize_name(column)โ€‹

SELECT {{ normalize_name('full_name') }} as name_clean
-- Lowercase, trim, remove special chars

๐Ÿ“‹ Workflow Integrationโ€‹

Combined with Python ETLโ€‹

#!/bin/bash
# Full ETL pipeline

# 1. Python: Load bronze data
python scripts/datasources/gemini/load_meeting_transcripts_bronze.py

# 2. dbt: Transform to production
cd dbt_project
dbt run --select staging+
dbt run --select intermediate+
dbt run --select marts+
dbt test

# 3. Python: Export to parquet (if needed)
cd ..
python scripts/data/export_to_gold_parquet.py

๐Ÿ› Troubleshootingโ€‹

"relation does not exist"โ€‹

Problem: Source table not found

Solution: Check you're connected to the right database

dbt debug
# Look at "target" database

"Compilation Error: macro 'dbt_utils' is not defined"โ€‹

Problem: Missing dbt packages

Solution: Install packages

# Create packages.yml
cat > packages.yml << EOF
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
EOF

# Install
dbt deps

"Incremental model not updating"โ€‹

Problem: New records not being processed

Solution: Check timestamp logic

# Full refresh to rebuild
dbt run --full-refresh --select contact_ai

๐Ÿ“š Resourcesโ€‹

โญ๏ธ Next Stepsโ€‹

  1. Install packages: dbt deps
  2. Run models: dbt run
  3. Run tests: dbt test
  4. Generate docs: dbt docs generate && dbt docs serve
  5. Iterate: Add more models incrementally