For Developers & Technical Users
Welcome! This section contains technical documentation for developers, data scientists, and system administrators working with Open Navigator.
Platform Scale & Data Volume
Open Navigator processes data at scale across the United States:
| Category | Count | Source |
|---|---|---|
| Total Jurisdictions | 90,000+ | Census Bureau Gazetteer 2024 |
| Counties | 3,144 | All U.S. counties (FIPS coded) |
| Municipalities | 19,500+ | Cities, towns, villages, boroughs |
| Townships | 36,000+ | County subdivisions, census divisions |
| School Districts | 13,000+ | NCES Common Core of Data |
| Nonprofit Organizations | 3,000,000+ | IRS TEOS + ProPublica Nonprofit Explorer |
| State Legislatures | 50 | All U.S. states |
| Video Channels | 50+ | YouTube state legislature channels |
| Meeting Datasets | 1,000+ | MeetingBank, LocalView, City Scrapers |
| .gov Domains | 15,000+ | CISA validated government websites |
Storage & Processing Requirements
Estimated Data Volumes:
- Meeting Minutes: 10-100 MB per municipality × 1,000+ cities = 10-100 GB
- Financial Documents: 5-50 MB per jurisdiction × 90,000 = 450 GB - 4.5 TB
- Nonprofit 990s: 1-5 MB per org × 3M = 3-15 TB
- Video Content: Variable (streaming recommended over storage)
Medallion Architecture (Delta Lake):
- Bronze Layer: Raw scraped data (largest storage footprint)
- Silver Layer: Cleaned/standardized (50-70% compression)
- Gold Layer: Analyzed/aggregated (90%+ compression)
API Rate Limits & Quotas
Free Tier (No Cost):
- Census Bureau: Unlimited downloads
- NCES: Unlimited bulk downloads
- ProPublica API: Respectful use (~1 req/sec suggested)
- IRS TEOS: Bulk data downloads (monthly updates)
- CISA .gov Domains: GitHub dataset (updated daily)
Paid/Limited:
- OpenAI API: Pay per token (required for LLM features)
- Harvard Dataverse: API key recommended (free registration)
Complete Technical Citations & Standards
For full citations, licenses, API documentation, and technical specifications:
Includes:
- Academic Research: MeetingBank (ACL 2023), LocalView (Harvard), Council Data Project, City Scrapers
- Government APIs: U.S. Census, NCES, IRS, Open States
- Standards: OCD-ID (OCDEP 2), Popolo Project, Schema.org, CEDS, OMOP CDM (OHDSI), IATI v2.03
- Data Models: Microsoft CDM for Nonprofits, OMOP vocabulary system
- Fact-Checking: N/A (not currently integrated)
- Nonprofit Data: IRS BMF (43,726 orgs from 5 states)
- Churches & Faith-Based: 4,372 congregations from IRS data
- Enterprise Tech: Microsoft (Nonprofit CDM), Google (Data Commons), AWS (Open Data), Databricks (Unity Catalog, MLflow), Snowflake, Salesforce (NPSP)
- BibTeX citations for academic papers and research use
What You'll Find Here
🚀 Setup & Installation
Get the platform running:
- Quick Start - Detailed installation instructions
- Quick Reference - CLI commands cheat sheet
- Architecture - System design and components
📊 Data Sources (Technical)
Technical details on data ingestion:
- Jurisdiction Discovery - Finding 90,000+ government websites
- Census Data - Ingesting Census Bureau datasets
- HuggingFace Datasets - Pre-built meeting collections
- YouTube Discovery - Video channel scraping
🛠️ How-To Guides
Step-by-step technical guides:
- Jurisdiction Setup - Configure discovery for your area
- HuggingFace Publishing - Publish datasets to HuggingFace Hub
- Handling Formats - Process different document types
- Scraper Improvements - Enhance scraping capabilities
🔌 Integrations
Connect external services:
- Dataverse Integration - Harvard Dataverse API
- Frontend Integration - React application setup
- LocalView - LocalView dataset ingestion
🚀 Deployment
Production deployment:
- Databricks Apps - Deploy to Databricks
- Scale Deployment - Handle large datasets
- Cost Management - Optimize expenses
💻 Development
Contributing and development:
- Changelog - Version history
- Migration Guides - Upgrading between versions
- Refactoring Summary - Recent changes
Quick Start (TL;DR)
# Clone and install
git clone https://github.com/getcommunityone/open-navigator-for-engagement.git
cd oral-health-policy-pulse
./install.sh
# Install frontend and docs
cd frontend && npm install && cd ..
cd website && npm install && cd ..
# Start all services
./start-all.sh
# Visit:
# - Main App: http://localhost:5173
# - API Docs: http://localhost:8000/docs
# - This Site: http://localhost:3000
Architecture Overview
┌─────────────────────────────────────────┐
│ Open Navigator Platform │
├─────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ React App │ │ FastAPI │ │
│ │ (Frontend) │──▶│ (Backend) │ │
│ │ Port 5173 │ │ Port 8000 │ │
│ └──────────────┘ └──────┬───────┘ │
│ │ │
│ ┌──────────────────────────▼────────┐ │
│ │ Delta Lake (Data Storage) │ │
│ │ • Bronze: Raw data │ │
│ │ • Silver: Cleaned data │ │
│ │ • Gold: Analyzed data │ │
│ └──────────────────────────────────┘ │
└─────────────────────────────────────────┘
Common Tasks
Run Jurisdiction Discovery
source .venv/bin/activate
# Test run (100 jurisdictions)
python main.py discover-jurisdictions --limit 100
# Single state
python main.py discover-jurisdictions --state CA
# Full discovery (~30k jurisdictions)
python main.py discover-jurisdictions
Ingest Reference Data
# Census jurisdictions (90,000+ entities)
python -m discovery.census_ingestion
# NCES school districts (13,000+)
python -m discovery.nces_ingestion
# Pre-built datasets
python discovery/meetingbank_ingestion.py
python discovery/city_scrapers_urls.py
python discovery/openstates_sources.py
Scrape Meeting Minutes
# Batch scraping from discovered sites
python main.py scrape-batch --source discovered --limit 50
# Single jurisdiction
python main.py scrape --url "https://chicago.legistar.com" \
--state "IL" \
--municipality "Chicago"
Publish to HuggingFace
# Requires HUGGINGFACE_TOKEN in .env
python main.py publish-to-hf --dataset all
python main.py publish-to-hf --dataset discovered-urls
python main.py publish-to-hf --dataset census --sample
Technology Stack
Backend
- Python 3.11+ - Core language
- FastAPI - REST API framework
- Delta Lake - Data lakehouse storage
- Databricks - Production data platform
- OpenAI API - LLM capabilities
Frontend
- React 18 - UI framework
- Vite - Build tool
- TypeScript - Type safety
- Leaflet - Interactive maps
Data Processing
- Pandas - Data manipulation
- BeautifulSoup - HTML parsing
- PyPDF2 - PDF extraction
- Tesseract OCR - Image to text
Deployment
- Docker - Containerization
- tmux - Session management
- Databricks Apps - Production hosting
API Reference
Start API Server
python main.py serve --host 0.0.0.0 --port 8000
Visit http://localhost:8000/docs for interactive API documentation.
Example: Start Workflow
curl -X POST "http://localhost:8000/workflow/start" \
-H "Content-Type: application/json" \
-d '{
"scrape_targets": [
{
"url": "https://chicago.legistar.com",
"municipality": "Chicago",
"state": "IL",
"platform": "legistar"
}
]
}'
Example: Query Opportunities
curl "http://localhost:8000/opportunities?state=CA&urgency=critical"
Development Workflow
1. Local Development
# Terminal 1: API (with hot reload)
source .venv/bin/activate
python main.py serve --reload
# Terminal 2: Frontend (with hot reload)
cd frontend
npm run dev
# Terminal 3: Documentation
cd website
npm start
2. Testing
# Run all tests
pytest
# With coverage
pytest --cov=agents --cov=pipeline --cov=visualization
# Specific test file
pytest tests/test_agents.py
3. Deployment
# Deploy to Databricks
export DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
export DATABRICKS_TOKEN=dapi...
./scripts/deploy-databricks-app.sh
Data Pipeline
Medallion Architecture
Bronze (Raw) Silver (Cleaned) Gold (Analyzed)
────────────────────────────────────────────────────────────
Scraped PDFs → Extracted text → Classifications
Meeting videos → Transcripts → Sentiment scores
Budget docs → Line items → Budget analysis
Form 990s → Financial data → Spending patterns
File Locations
- Bronze:
data/bronze/- Raw downloaded files - Silver:
data/silver/- Cleaned and standardized - Gold:
data/gold/- Enriched with analysis - Cache:
cache/- Temporary processing files
Configuration
Environment Variables
Create .env file:
# Required
OPENAI_API_KEY=sk-...
# Optional (for production)
DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
DATABRICKS_TOKEN=dapi...
# Optional (for publishing)
HUGGINGFACE_TOKEN=hf_...
# Optional (for Harvard Dataverse)
DATAVERSE_API_KEY=...
Settings File
Edit config/settings.py for:
- Delta Lake paths
- Scraping rate limits
- Batch sizes
- Model configurations
Contributing
1. Fork & Clone
git clone https://github.com/YOUR-USERNAME/oral-health-policy-pulse.git
cd oral-health-policy-pulse
git remote add upstream https://github.com/getcommunityone/open-navigator-for-engagement.git
2. Create Branch
git checkout -b feature/your-feature-name
3. Make Changes
- Add tests for new features
- Update documentation
- Follow existing code style
- Keep commits focused and atomic
4. Submit PR
git push origin feature/your-feature-name
# Then create PR on GitHub
See CONTRIBUTING.md for details.
Troubleshooting
Port Already in Use
# Find process using port
lsof -i :8000
lsof -i :5173
lsof -i :3000
# Kill process
kill -9 <PID>
Dependencies Not Installing
# Clear cache and reinstall
rm -rf .venv
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Scraping Failures
Check logs:
tail -f logs/scraper.log
Adjust rate limits in config/settings.py.
Next Steps
- Read Architecture → System Design
- Set Up Environment → Quick Start
- Run Discovery → Jurisdiction Setup
- Deploy to Production → Databricks Apps
- Contribute → GitHub Issues
Support
- GitHub Issues: Report bugs or request features
- Documentation: Browse the sidebar
- API Docs: http://localhost:8000/docs
- Email: johnbowyer@communityone.com