D Drive Configuration for Large Datasets

Configure Open Navigator to store large datasets (ACS census data, IRS 990s, etc.) on an external drive or secondary volume to avoid filling your primary disk.

Why Use External Storage?

Open Navigator downloads and caches large datasets:

Dataset	Size	Records
ACS 5-Year (All States)	~15 GB	85,000 tracts
IRS Form 990s	~100 GB	1.8M nonprofits
Meeting Minutes (PDFs)	~500 GB	90,000 jurisdictions
Legislative Bills	~50 GB	Millions of bills

Total potential storage: Up to 1 TB of data!

📁 Recommended Directory Structure

Create this structure on your D drive (or external storage):

D:/open-navigator-data/
├── acs/                    # American Community Survey demographic data
│   ├── B19013_county_*_2022.parquet
│   ├── B27010_county_*_2022.parquet
│   └── acs_2022_ALL/      # Bulk downloads
├── irs/                    # IRS nonprofit data
│   ├── bmf/               # Business Master File
│   └── 990s/              # Form 990 PDFs and XML
├── legislative/           # Legislative data
│   ├── bills/
│   ├── votes/
│   └── legislators/
├── meetings/              # Meeting minutes and agendas
│   ├── pdfs/
│   └── transcripts/
└── cache/                 # General cache

🖥️ Platform-Specific Setup

Windows (D Drive)

1. Create the data directory:

# PowerShell
New-Item -Path "D:\open-navigator-data" -ItemType Directory -Force
New-Item -Path "D:\open-navigator-data\acs" -ItemType Directory -Force
New-Item -Path "D:\open-navigator-data\irs" -ItemType Directory -Force
New-Item -Path "D:\open-navigator-data\cache" -ItemType Directory -Force

2. Set permissions (if needed):

# Give yourself full control
icacls "D:\open-navigator-data" /grant "${env:USERNAME}:(OI)(CI)F" /T

3. Use in Python scripts:

from pathlib import Path

# Option 1: Explicit Windows path
data_dir = Path("D:/open-navigator-data/acs")

# Option 2: Using environment variable
import os
data_dir = Path(os.environ.get("OPEN_NAV_DATA_DIR", "D:/open-navigator-data/acs"))

Linux (Mounted Drive)

1. Mount the drive:

# Find your drive
lsblk

# Create mount point
sudo mkdir -p /mnt/d

# Mount (replace /dev/sdb1 with your drive)
sudo mount /dev/sdb1 /mnt/d

# Auto-mount on boot (add to /etc/fstab)
echo "/dev/sdb1 /mnt/d ext4 defaults 0 0" | sudo tee -a /etc/fstab

2. Create directories:

mkdir -p /mnt/d/open-navigator-data/{acs,irs,legislative,meetings,cache}

3. Set ownership:

# Give yourself ownership
sudo chown -R $USER:$USER /mnt/d/open-navigator-data
chmod -R 755 /mnt/d/open-navigator-data

4. Use in scripts:

from pathlib import Path

data_dir = Path("/mnt/d/open-navigator-data/acs")

macOS (External Drive)

1. External drives auto-mount to /Volumes:

# Your drive might be at:
ls /Volumes/
# Example: /Volumes/MyExternalDrive

2. Create directories:

mkdir -p "/Volumes/MyExternalDrive/open-navigator-data"/{acs,irs,legislative,meetings,cache}

3. Use in scripts:

from pathlib import Path

data_dir = Path("/Volumes/MyExternalDrive/open-navigator-data/acs")

WSL (Windows Subsystem for Linux)

Windows drives are auto-mounted in WSL:

# D drive is at /mnt/d
cd /mnt/d

# Create directories
mkdir -p /mnt/d/open-navigator-data/{acs,irs,legislative,meetings,cache}

Use in scripts (WSL path):

from pathlib import Path

# WSL path to Windows D drive
data_dir = Path("/mnt/d/open-navigator-data/acs")

🔧 Configuration Methods

Method 1: Environment Variable (Recommended)

Advantages: Single configuration, all scripts respect it

Setup:

Add to .env file in project root:

# .env
OPEN_NAV_DATA_DIR=D:/open-navigator-data

# Or Linux/Mac:
# OPEN_NAV_DATA_DIR=/mnt/d/open-navigator-data

Update scripts to use it:

# scripts/datasources/census/acs_ingestion.py
import os
from pathlib import Path

# Use environment variable with fallback
base_data_dir = os.environ.get("OPEN_NAV_DATA_DIR", "data/cache")
acs_dir = Path(base_data_dir) / "acs"

acs = ACSDataIngestion(data_dir=acs_dir)

Method 2: Config File

Create config/paths.py:

from pathlib import Path
import os

# Base data directory (customizable)
BASE_DATA_DIR = Path(os.environ.get("OPEN_NAV_DATA_DIR", "data"))

# Subdirectories
ACS_DATA_DIR = BASE_DATA_DIR / "acs"
IRS_DATA_DIR = BASE_DATA_DIR / "irs"
LEGISLATIVE_DATA_DIR = BASE_DATA_DIR / "legislative"
MEETINGS_DATA_DIR = BASE_DATA_DIR / "meetings"
CACHE_DIR = BASE_DATA_DIR / "cache"

# Create directories if they don't exist
for directory in [ACS_DATA_DIR, IRS_DATA_DIR, LEGISLATIVE_DATA_DIR, 
                   MEETINGS_DATA_DIR, CACHE_DIR]:
    directory.mkdir(parents=True, exist_ok=True)

Use in scripts:

from config.paths import ACS_DATA_DIR

acs = ACSDataIngestion(data_dir=ACS_DATA_DIR)

Method 3: Command-Line Argument

For one-off downloads:

# scripts/datasources/census/download_acs.py
import argparse
from pathlib import Path

parser = argparse.ArgumentParser()
parser.add_argument("--data-dir", type=Path, default="data/cache/acs",
                    help="Directory to store ACS data")
args = parser.parse_args()

acs = ACSDataIngestion(data_dir=args.data_dir)

Usage:

# Use D drive
python download_acs.py --data-dir D:/open-navigator-data/acs

# Use default
python download_acs.py

🗂️ Example Configurations

Small Project (Default)

Storage: Local project directory
Size: < 10 GB
Speed: Fastest (same disk as code)

# Uses data/cache/ in project directory
acs = ACSDataIngestion()

Medium Project (D Drive)

Storage: D drive or external SSD
Size: 10-100 GB
Speed: Fast

from pathlib import Path

acs = ACSDataIngestion(data_dir=Path("D:/open-navigator-data/acs"))

Large Project (Network Storage)

Storage: NAS or cloud-mounted drive
Size: 100+ GB
Speed: Slower but shared

from pathlib import Path

# Windows network path
data_dir = Path("//server/open-navigator/acs")

# Or mounted network drive (Linux)
# data_dir = Path("/mnt/nas/open-navigator/acs")

acs = ACSDataIngestion(data_dir=data_dir)

📊 Storage Requirements by Dataset

ACS Census Data

Download Type	Size	Time (50 Mbps)
Single table, single state	~5 MB	< 1 sec
Single table, all states	~50 MB	~10 sec
All tables, all states (API)	~500 MB	~2 min
Bulk download (all data)	~15 GB	~40 min

Recommended: Start with API downloads (targeted), only use bulk if needed.

IRS Nonprofit Data

Dataset	Size	Records
Business Master File (BMF)	~500 MB	1.8M nonprofits
Form 990 XML (1 year)	~20 GB	~300K filings
Form 990 PDFs (1 year)	~50 GB	~300K PDFs
Full 990 archive (10 years)	~500 GB	3M+ filings

Recommended: D drive or external storage required.

Legislative Data

Dataset	Size	Records
OpenStates bills (1 state, 1 year)	~100 MB	~5K bills
OpenStates bills (all states, 1 year)	~5 GB	~250K bills
OpenStates bills (all states, 10 years)	~50 GB	2.5M bills

Recommended: D drive for multi-year data.

Meeting Minutes

Dataset	Size	Records
Meeting agendas (text)	~1 GB	100K meetings
Meeting PDFs (cached)	~500 GB	1M documents

Recommended: External storage for PDF archive.

🔍 Verify Configuration

Test your setup:

from pathlib import Path
import pandas as pd

def test_storage_config(data_dir: Path):
    """Test that we can write/read data."""
    
    # Create test directory
    test_dir = data_dir / "test"
    test_dir.mkdir(parents=True, exist_ok=True)
    
    # Write test file
    test_file = test_dir / "test.parquet"
    df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
    df.to_parquet(test_file)
    
    # Read test file
    df2 = pd.read_parquet(test_file)
    
    # Verify
    assert df.equals(df2), "Data mismatch!"
    
    # Cleanup
    test_file.unlink()
    test_dir.rmdir()
    
    print(f"✅ Storage configuration verified: {data_dir.absolute()}")
    print(f"✅ Free space: {get_free_space(data_dir):.2f} GB")

def get_free_space(path: Path) -> float:
    """Get free disk space in GB."""
    import shutil
    stat = shutil.disk_usage(path)
    return stat.free / (1024**3)

# Test
test_storage_config(Path("D:/open-navigator-data"))

⚡ Performance Tips

1. Use SSD for Cache

C drive (SSD): Small, frequently accessed files
D drive (HDD): Large, rarely accessed archives

2. Organize by Access Frequency

D:/open-navigator-data/
├── hot/        # Frequently accessed (keep on SSD if possible)
│   └── acs/    # Census data for current analysis
└── cold/       # Rarely accessed (OK on HDD)
    ├── irs/990s/  # Historical 990 PDFs
    └── archive/   # Old datasets

3. Use Compression for Archives

# Compress old data
import gzip
import shutil

with open("data.parquet", "rb") as f_in:
    with gzip.open("data.parquet.gz", "wb") as f_out:
        shutil.copyfileobj(f_in, f_out)

4. Clean Up Old Cache

from pathlib import Path
from datetime import datetime, timedelta

def cleanup_old_cache(cache_dir: Path, days: int = 30):
    """Delete cache files older than N days."""
    
    cutoff = datetime.now() - timedelta(days=days)
    
    for file in cache_dir.rglob("*.parquet"):
        if datetime.fromtimestamp(file.stat().st_mtime) < cutoff:
            print(f"Deleting old cache: {file}")
            file.unlink()

cleanup_old_cache(Path("D:/open-navigator-data/cache"))

🆘 Troubleshooting

"Permission denied" on D drive

Windows:

# Give yourself full control
icacls "D:\open-navigator-data" /grant "${env:USERNAME}:(OI)(CI)F" /T

Linux:

# Change ownership
sudo chown -R $USER:$USER /mnt/d/open-navigator-data

"No space left on device"

Check disk space:

# Windows (PowerShell)
Get-PSDrive D | Select-Object Used,Free

# Linux/Mac
df -h /mnt/d

Free up space:

Delete old cache files
Compress archive data
Move rarely used data to cloud storage

Slow read/write performance

Causes:

USB 2.0 external drive (upgrade to USB 3.0+)
Network drive over slow connection
HDD vs SSD (SSD is 10-100x faster)

Solutions:

Use SSD for frequently accessed data
Enable write caching (Windows)
Use local storage for processing, archive to external

WSL can't access D drive

Fix:

# Verify drive is mounted
ls /mnt/d

# If not, add to /etc/fstab:
sudo mkdir -p /mnt/d
sudo mount -t drvfs D: /mnt/d

🔮 Next Steps

Create data directory on your D drive or external storage
Set environment variable OPEN_NAV_DATA_DIR in .env
Test configuration using the verification script above
Download ACS data to test storage setup
Monitor disk usage as you add more datasets

Census ACS Integration - Download demographic data
IRS 990 Downloads - Nonprofit data
Data Architecture - Overall data flow

Why Use External Storage?​

📁 Recommended Directory Structure​

🖥️ Platform-Specific Setup​

Windows (D Drive)​

Linux (Mounted Drive)​

macOS (External Drive)​

WSL (Windows Subsystem for Linux)​

🔧 Configuration Methods​

Method 1: Environment Variable (Recommended)​

Method 2: Config File​

Method 3: Command-Line Argument​

🗂️ Example Configurations​

Small Project (Default)​

Medium Project (D Drive)​

Large Project (Network Storage)​

📊 Storage Requirements by Dataset​

ACS Census Data​

IRS Nonprofit Data​

Legislative Data​

Meeting Minutes​

🔍 Verify Configuration​

⚡ Performance Tips​

1. Use SSD for Cache​

2. Organize by Access Frequency​

3. Use Compression for Archives​

4. Clean Up Old Cache​

🆘 Troubleshooting​

"Permission denied" on D drive​

"No space left on device"​

Slow read/write performance​

WSL can't access D drive​

🔮 Next Steps​

Related Documentation​