Skip to main content

For Developers & Technical Users

Welcome! This section contains technical documentation for developers, data scientists, and system administrators working with Open Navigator.

Platform Scale & Data Volume

Open Navigator processes data at scale across the United States:

CategoryCountSource
Total Jurisdictions90,000+Census Bureau Gazetteer 2024
Counties3,144All U.S. counties (FIPS coded)
Municipalities19,500+Cities, towns, villages, boroughs
Townships36,000+County subdivisions, census divisions
School Districts13,000+NCES Common Core of Data
Nonprofit Organizations3,000,000+IRS TEOS + ProPublica Nonprofit Explorer
State Legislatures50All U.S. states
Video Channels50+YouTube state legislature channels
Meeting Datasets1,000+MeetingBank, LocalView, City Scrapers
.gov Domains15,000+CISA validated government websites

Storage & Processing Requirements

Estimated Data Volumes:

  • Meeting Minutes: 10-100 MB per municipality × 1,000+ cities = 10-100 GB
  • Financial Documents: 5-50 MB per jurisdiction × 90,000 = 450 GB - 4.5 TB
  • Nonprofit 990s: 1-5 MB per org × 3M = 3-15 TB
  • Video Content: Variable (streaming recommended over storage)

Medallion Architecture (Delta Lake):

  • Bronze Layer: Raw scraped data (largest storage footprint)
  • Silver Layer: Cleaned/standardized (50-70% compression)
  • Gold Layer: Analyzed/aggregated (90%+ compression)

API Rate Limits & Quotas

Free Tier (No Cost):

  • Census Bureau: Unlimited downloads
  • NCES: Unlimited bulk downloads
  • ProPublica API: Respectful use (~1 req/sec suggested)
  • IRS TEOS: Bulk data downloads (monthly updates)
  • CISA .gov Domains: GitHub dataset (updated daily)

Paid/Limited:

  • OpenAI API: Pay per token (required for LLM features)
  • Harvard Dataverse: API key recommended (free registration)
Complete Technical Citations & Standards

For full citations, licenses, API documentation, and technical specifications:

Citations & Data Sources

Includes:

  • Academic Research: MeetingBank (ACL 2023), LocalView (Harvard), Council Data Project, City Scrapers
  • Government APIs: U.S. Census, NCES, IRS, Open States
  • Standards: OCD-ID (OCDEP 2), Popolo Project, Schema.org, CEDS, OMOP CDM (OHDSI), IATI v2.03
  • Data Models: Microsoft CDM for Nonprofits, OMOP vocabulary system
  • Fact-Checking: N/A (not currently integrated)
  • Nonprofit Data: IRS BMF (43,726 orgs from 5 states)
  • Churches & Faith-Based: 4,372 congregations from IRS data
  • Enterprise Tech: Microsoft (Nonprofit CDM), Google (Data Commons), AWS (Open Data), Databricks (Unity Catalog, MLflow), Snowflake, Salesforce (NPSP)
  • BibTeX citations for academic papers and research use

What You'll Find Here

🚀 Setup & Installation

Get the platform running:

📊 Data Sources (Technical)

Technical details on data ingestion:

🛠️ How-To Guides

Step-by-step technical guides:

🔌 Integrations

Connect external services:

🚀 Deployment

Production deployment:

💻 Development

Contributing and development:

Quick Start (TL;DR)

# Clone and install
git clone https://github.com/getcommunityone/open-navigator-for-engagement.git
cd oral-health-policy-pulse
./install.sh

# Install frontend and docs
cd frontend && npm install && cd ..
cd website && npm install && cd ..

# Start all services
./start-all.sh

# Visit:
# - Main App: http://localhost:5173
# - API Docs: http://localhost:8000/docs
# - This Site: http://localhost:3000

Architecture Overview

┌─────────────────────────────────────────┐
│ Open Navigator Platform │
├─────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ React App │ │ FastAPI │ │
│ │ (Frontend) │──▶│ (Backend) │ │
│ │ Port 5173 │ │ Port 8000 │ │
│ └──────────────┘ └──────┬───────┘ │
│ │ │
│ ┌──────────────────────────▼────────┐ │
│ │ Delta Lake (Data Storage) │ │
│ │ • Bronze: Raw data │ │
│ │ • Silver: Cleaned data │ │
│ │ • Gold: Analyzed data │ │
│ └──────────────────────────────────┘ │
└─────────────────────────────────────────┘

Common Tasks

Run Jurisdiction Discovery

source .venv/bin/activate

# Test run (100 jurisdictions)
python main.py discover-jurisdictions --limit 100

# Single state
python main.py discover-jurisdictions --state CA

# Full discovery (~30k jurisdictions)
python main.py discover-jurisdictions

Ingest Reference Data

# Census jurisdictions (90,000+ entities)
python -m discovery.census_ingestion

# NCES school districts (13,000+)
python -m discovery.nces_ingestion

# Pre-built datasets
python discovery/meetingbank_ingestion.py
python discovery/city_scrapers_urls.py
python discovery/openstates_sources.py

Scrape Meeting Minutes

# Batch scraping from discovered sites
python main.py scrape-batch --source discovered --limit 50

# Single jurisdiction
python main.py scrape --url "https://chicago.legistar.com" \
--state "IL" \
--municipality "Chicago"

Publish to HuggingFace

# Requires HUGGINGFACE_TOKEN in .env
python main.py publish-to-hf --dataset all
python main.py publish-to-hf --dataset discovered-urls
python main.py publish-to-hf --dataset census --sample

Technology Stack

Backend

  • Python 3.11+ - Core language
  • FastAPI - REST API framework
  • Delta Lake - Data lakehouse storage
  • Databricks - Production data platform
  • OpenAI API - LLM capabilities

Frontend

  • React 18 - UI framework
  • Vite - Build tool
  • TypeScript - Type safety
  • Leaflet - Interactive maps

Data Processing

  • Pandas - Data manipulation
  • BeautifulSoup - HTML parsing
  • PyPDF2 - PDF extraction
  • Tesseract OCR - Image to text

Deployment

  • Docker - Containerization
  • tmux - Session management
  • Databricks Apps - Production hosting

API Reference

Start API Server

python main.py serve --host 0.0.0.0 --port 8000

Visit http://localhost:8000/docs for interactive API documentation.

Example: Start Workflow

curl -X POST "http://localhost:8000/workflow/start" \
-H "Content-Type: application/json" \
-d '{
"scrape_targets": [
{
"url": "https://chicago.legistar.com",
"municipality": "Chicago",
"state": "IL",
"platform": "legistar"
}
]
}'

Example: Query Opportunities

curl "http://localhost:8000/opportunities?state=CA&urgency=critical"

Development Workflow

1. Local Development

# Terminal 1: API (with hot reload)
source .venv/bin/activate
python main.py serve --reload

# Terminal 2: Frontend (with hot reload)
cd frontend
npm run dev

# Terminal 3: Documentation
cd website
npm start

2. Testing

# Run all tests
pytest

# With coverage
pytest --cov=agents --cov=pipeline --cov=visualization

# Specific test file
pytest tests/test_agents.py

3. Deployment

# Deploy to Databricks
export DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
export DATABRICKS_TOKEN=dapi...
./scripts/deploy-databricks-app.sh

Data Pipeline

Medallion Architecture

Bronze (Raw) Silver (Cleaned) Gold (Analyzed)
────────────────────────────────────────────────────────────
Scraped PDFs → Extracted text → Classifications
Meeting videos → Transcripts → Sentiment scores
Budget docs → Line items → Budget analysis
Form 990s → Financial data → Spending patterns

File Locations

  • Bronze: data/bronze/ - Raw downloaded files
  • Silver: data/silver/ - Cleaned and standardized
  • Gold: data/gold/ - Enriched with analysis
  • Cache: cache/ - Temporary processing files

Configuration

Environment Variables

Create .env file:

# Required
OPENAI_API_KEY=sk-...

# Optional (for production)
DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
DATABRICKS_TOKEN=dapi...

# Optional (for publishing)
HUGGINGFACE_TOKEN=hf_...

# Optional (for Harvard Dataverse)
DATAVERSE_API_KEY=...

Settings File

Edit config/settings.py for:

  • Delta Lake paths
  • Scraping rate limits
  • Batch sizes
  • Model configurations

Contributing

1. Fork & Clone

git clone https://github.com/YOUR-USERNAME/oral-health-policy-pulse.git
cd oral-health-policy-pulse
git remote add upstream https://github.com/getcommunityone/open-navigator-for-engagement.git

2. Create Branch

git checkout -b feature/your-feature-name

3. Make Changes

  • Add tests for new features
  • Update documentation
  • Follow existing code style
  • Keep commits focused and atomic

4. Submit PR

git push origin feature/your-feature-name
# Then create PR on GitHub

See CONTRIBUTING.md for details.

Troubleshooting

Port Already in Use

# Find process using port
lsof -i :8000
lsof -i :5173
lsof -i :3000

# Kill process
kill -9 <PID>

Dependencies Not Installing

# Clear cache and reinstall
rm -rf .venv
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Scraping Failures

Check logs:

tail -f logs/scraper.log

Adjust rate limits in config/settings.py.

Next Steps

  1. Read ArchitectureSystem Design
  2. Set Up EnvironmentQuick Start
  3. Run DiscoveryJurisdiction Setup
  4. Deploy to ProductionDatabricks Apps
  5. ContributeGitHub Issues

Support