Google Colab Setup Guide

Run Open Navigator data processing workflows in Google Colab with GPU/TPU acceleration and Google Drive storage.

🎯 When to Use Google Colab

✅ Process large datasets without local compute resources
✅ Access free GPU/TPU for ML tasks (Gemini AI analysis, embeddings)
✅ Bypass local disk space limitations
✅ Run long-running jobs without keeping laptop on
✅ Share notebooks with collaborators

📋 Prerequisites

Google account
Google Drive with available storage (15 GB free tier)
GitHub account (for cloning repository)

🚀 Quick Start

Step 1: Create New Colab Notebook

Go to Google Colab
Click File → New Notebook
Rename notebook: open-navigator-processing.ipynb

Step 2: Mount Google Drive

from google.colab import drive
drive.mount('/content/drive')

# Verify mount
!ls /content/drive/MyDrive/

Authorization:

Click the link that appears
Choose your Google account
Click "Allow" to give Colab access to Drive
Copy the authorization code
Paste into Colab input box

Step 3: Clone Repository

# Navigate to Google Drive
%cd /content/drive/MyDrive/

# Clone repository (only needed first time)
# Skip git hooks to avoid permission issues on Google Drive
!git clone --no-checkout https://github.com/getcommunityone/open-navigator.git
%cd open-navigator
!git config core.hooksPath /dev/null
!git checkout main

# Install Git LFS and pull large files (images, etc.)
!apt-get install -y git-lfs
!git lfs install
!git lfs pull

# Verify clone
!ls -lh

Subsequent runs:

# Just navigate to existing directory
%cd /content/drive/MyDrive/open-navigator

# Pull latest changes (skip hooks)
!git config core.hooksPath /dev/null
!git pull origin main
!git lfs pull

⚠️ Common Issues:

"Permission denied" on git hooks:

# Fix hook permissions
!git config core.hooksPath /dev/null
# This disables git hooks which don't work on Google Drive

"Files should have been pointers" (Git LFS):

# Install Git LFS
!apt-get install -y git-lfs
!git lfs install

# Pull actual files (replaces pointers with real files)
!git lfs pull

# Verify images loaded correctly
!ls -lh frontend/public/*.png

Step 4: Install Dependencies

# Install Python packages
!pip install -q -r requirements.txt

# Verify installation
!pip list | grep pandas
!pip list | grep psycopg2

Step 5: Configure Environment

# Create .env file with secrets
import os

# Database connection
os.environ['NEON_DATABASE_URL_DEV'] = 'postgresql://user:pass@host:5432/dbname'

# API keys
os.environ['YOUTUBE_API_KEY'] = 'your_youtube_api_key_here'
os.environ['GEMINI_API_KEY'] = 'your_gemini_api_key_here'

# Verify
!echo "Environment configured"

Or upload .env file:

from google.colab import files

# Upload .env from your computer
uploaded = files.upload()

# Move to project root
!mv .env /content/drive/MyDrive/open-navigator/

📊 Common Workflows

Load YouTube Videos to Bronze

%cd /content/drive/MyDrive/open-navigator

# Load recent videos (last 180 days)
!python packages/scrapers/src/scrapers/youtube/load_youtube_events_to_postgres.py \
  --days 180 \
  --max-videos 200 \
  --skip-transcripts \
  --states AL,MA,WI,GA,IN,WA

Process with Gemini AI

%cd /content/drive/MyDrive/open-navigator

# Analyze meetings with Gemini (uses GPU if available)
!python scripts/datasources/gemini/load_meeting_transcripts.py \
  --meetings-per-channel 10 \
  --delay 3

Run dbt Models

%cd /content/drive/MyDrive/open-navigator/dbt_project

# Install dbt if needed
!pip install -q dbt-postgres

# Run models
!dbt run --select tag:bronze
!dbt run --select tag:ai-extraction

Download YouTube Audio to Google Drive

%cd /content/drive/MyDrive/CommunityOne/open-navigator

# Install yt-dlp if needed
!pip install -q yt-dlp

# Download audio from recent videos (last 30 days)
!python packages/scrapers/src/scrapers/youtube/download_audio_to_drive.py \
  --output-dir /content/drive/MyDrive/CommunityOne/youtube_audio \
  --days 30 \
  --limit 50

# Download from specific channels
!python packages/scrapers/src/scrapers/youtube/download_audio_to_drive.py \
  --output-dir /content/drive/MyDrive/CommunityOne/youtube_audio \
  --channels "City of Seattle,Portland" \
  --limit 100

# Download from specific states
!python packages/scrapers/src/scrapers/youtube/download_audio_to_drive.py \
  --output-dir /content/drive/MyDrive/CommunityOne/youtube_audio \
  --states AL,MA,WI \
  --limit 200

Output Structure:

youtube_audio/
├── City-of-Seattle_UCxxx/
│   ├── 2026-05-01_City Council Meeting.opus
│   ├── 2026-04-28_Planning Commission.opus
│   └── ...
├── Portland-City-Hall_UCyyy/
│   ├── 2026-05-02_Council Session.opus
│   └── ...

Audio Format:

Opus codec (best quality, smallest size ~10-20 MB/hour)
Audio-only (no video)
Organized by channel → date_title.opus

Export Data to Google Drive

# Create output directory in Drive
!mkdir -p /content/drive/MyDrive/open-navigator-exports

# Export query results
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine(os.environ['NEON_DATABASE_URL_DEV'])

df = pd.read_sql("""
    SELECT * FROM bronze.bronze_event_youtube
    WHERE event_date >= '2026-01-01'
""", engine)

# Save to Drive
df.to_parquet('/content/drive/MyDrive/open-navigator-exports/youtube_2026.parquet')
df.to_csv('/content/drive/MyDrive/open-navigator-exports/youtube_2026.csv', index=False)

print(f"✅ Exported {len(df):,} rows to Google Drive")

⚙️ Advanced Configuration

Use GPU Runtime

Click Runtime → Change runtime type
Select GPU from Hardware accelerator dropdown
Choose T4 or A100 (paid tier)
Click Save

Verify GPU:

import torch
print(f"GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU name: {torch.cuda.get_device_name(0)}")

Increase RAM

For large datasets:

Runtime → Change runtime type
Select High-RAM from Runtime shape
Click Save

Keep Session Alive

Colab disconnects after ~90 minutes of inactivity:

# Add to notebook cell
import time
from IPython.display import clear_output

while True:
    print("❤️ Keeping session alive...")
    time.sleep(60)  # Ping every minute
    clear_output(wait=True)

Or use browser extension:

Colab Auto Reconnect

Monitor Progress

# Install progress bars
!pip install -q tqdm

# Use in Python scripts
from tqdm import tqdm
import time

for i in tqdm(range(100), desc="Processing"):
    time.sleep(0.1)

📁 Organize Google Drive

Recommended folder structure:

MyDrive/
├── open-navigator/              # Git repository
│   ├── scripts/
│   ├── dbt_project/
│   └── .env
├── open-navigator-exports/      # Output files
│   ├── youtube_2026.parquet
│   └── reports/
└── open-navigator-cache/        # Large cache files
    └── gemini/

Create structure:

!mkdir -p /content/drive/MyDrive/open-navigator-exports
!mkdir -p /content/drive/MyDrive/open-navigator-cache

🔒 Security Best Practices

Don't Hardcode Secrets

❌ Bad:

api_key = "AIzaSyABC123..."  # Visible in notebook history!

✅ Good:

from google.colab import userdata
api_key = userdata.get('GEMINI_API_KEY')

Store secrets in Colab:

Click 🔑 (Secrets) in left sidebar
Click Add new secret
Name: GEMINI_API_KEY, Value: your key
Click Save

Use Read-Only Database Credentials

For analytics/exports, use read-only credentials:

# Read-only user (safer for Colab)
os.environ['DATABASE_URL_READONLY'] = 'postgresql://readonly:pass@host:5432/db'

🐛 Troubleshooting

Git Hooks Permission Error

Error:

fatal: cannot exec '/content/drive/MyDrive/.../open-navigator/.git/hooks/post-checkout': Permission denied

Fix:

%cd /content/drive/MyDrive/open-navigator

# Disable git hooks (they don't work on Google Drive FUSE filesystem)
!git config core.hooksPath /dev/null

# Verify
!git config --get core.hooksPath
# Should output: /dev/null

Why this happens:

Google Drive FUSE filesystem doesn't preserve execute permissions
Git hooks need executable permissions to run
Solution: Disable hooks entirely (they're for pre-commit checks, not needed in Colab)

Git LFS Files Missing (Image Pointers)

Error:

Encountered 20 file(s) that should have been pointers, but weren't:
    frontend/public/communityone_logo.jpg
    website/static/img/favicon.ico
    ...

Fix:

# Install Git LFS
!apt-get update -qq
!apt-get install -y git-lfs

# Initialize Git LFS for this repository
%cd /content/drive/MyDrive/open-navigator
!git lfs install

# Pull actual files (replaces text pointers with real binary files)
!git lfs pull

# Verify images loaded correctly
!file frontend/public/communityone_logo.jpg
# Should say: "JPEG image data", not "ASCII text"

Why this happens:

Repository uses Git LFS (Large File Storage) for images/binaries
Plain git clone downloads text pointers instead of actual files
Need git lfs pull to fetch the real files

To avoid this in future clones:

# Install Git LFS BEFORE cloning
!apt-get install -y git-lfs
!git lfs install

# Then clone (will automatically fetch LFS files)
!git clone https://github.com/getcommunityone/open-navigator.git

"No module named 'xyz'"

# Install missing package
!pip install xyz

# Or reinstall all requirements
!pip install -q -r requirements.txt --force-reinstall

"Permission denied" when writing to Drive

# Remount Drive
from google.colab import drive
drive.flush_and_unmount()
drive.mount('/content/drive')

"Runtime disconnected"

Save work frequently: Ctrl+S or File → Save
Download important outputs: files.download('output.csv')
Use checkpoints in long-running scripts

"Out of memory"

# Clear Python variables
import gc
gc.collect()

# Or restart runtime
# Runtime → Restart runtime

📝 Example Notebook Template

# ========================================
# Open Navigator - Google Colab Workflow
# ========================================

# 1. Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# 2. Navigate to project
%cd /content/drive/MyDrive/open-navigator

# 3. Pull latest code (if repo exists)
!git pull origin main

# 4. Install dependencies
!pip install -q -r requirements.txt

# 5. Set environment variables
from google.colab import userdata
import os

os.environ['NEON_DATABASE_URL_DEV'] = userdata.get('DATABASE_URL')
os.environ['GEMINI_API_KEY'] = userdata.get('GEMINI_API_KEY')

# 6. Run data processing
print("🚀 Starting data load...")
!python packages/scrapers/src/scrapers/youtube/load_youtube_events_to_postgres.py \
  --days 30 \
  --max-videos 100 \
  --skip-transcripts

print("✅ Data load complete!")

# 7. Export results
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine(os.environ['NEON_DATABASE_URL_DEV'])

df = pd.read_sql("""
    SELECT COUNT(*) as total_videos, 
           MAX(event_date) as latest_date
    FROM bronze.bronze_event_youtube
""", engine)

print(df)

Google Colab Documentation
Colab Pro Features - Faster GPUs, longer runtimes
Colab Tips & Tricks

💡 Tips

Save notebooks in Drive - File → Save a copy in Drive
Use markdown cells - Document your workflow
Share notebooks - Click Share button, send link
Schedule runs - Use Colab Pro+ background execution
Download outputs - files.download('filename') before session ends
Check quotas - Free tier has usage limits, upgrade to Pro if needed

🆘 Getting Help

If you encounter issues:

Check Common Errors above
Search Colab FAQ
Ask in GitHub Discussions
File bug report with notebook link

🎯 When to Use Google Colab​

📋 Prerequisites​

🚀 Quick Start​

Step 1: Create New Colab Notebook​

Step 2: Mount Google Drive​

Step 3: Clone Repository​

Step 4: Install Dependencies​

Step 5: Configure Environment​

📊 Common Workflows​

Load YouTube Videos to Bronze​

Process with Gemini AI​

Run dbt Models​

Download YouTube Audio to Google Drive​

Export Data to Google Drive​

⚙️ Advanced Configuration​

Use GPU Runtime​

Increase RAM​

Keep Session Alive​

Monitor Progress​

📁 Organize Google Drive​

🔒 Security Best Practices​

Don't Hardcode Secrets​

Use Read-Only Database Credentials​

🐛 Troubleshooting​

Git Hooks Permission Error​

Git LFS Files Missing (Image Pointers)​

"No module named 'xyz'"​

"Permission denied" when writing to Drive​

"Runtime disconnected"​

"Out of memory"​

📝 Example Notebook Template​

🔗 Related Resources​

💡 Tips​

🆘 Getting Help​