Skip to main content

D Drive Configuration for Large Datasets

Configure Open Navigator to store large datasets (ACS census data, IRS 990s, etc.) on an external drive or secondary volume to avoid filling your primary disk.

Why Use External Storage?

Open Navigator downloads and caches large datasets:

DatasetSizeRecords
ACS 5-Year (All States)~15 GB85,000 tracts
IRS Form 990s~100 GB1.8M nonprofits
Meeting Minutes (PDFs)~500 GB90,000 jurisdictions
Legislative Bills~50 GBMillions of bills

Total potential storage: Up to 1 TB of data!

Create this structure on your D drive (or external storage):

D:/open-navigator-data/
├── acs/ # American Community Survey demographic data
│ ├── B19013_county_*_2022.parquet
│ ├── B27010_county_*_2022.parquet
│ └── acs_2022_ALL/ # Bulk downloads
├── irs/ # IRS nonprofit data
│ ├── bmf/ # Business Master File
│ └── 990s/ # Form 990 PDFs and XML
├── legislative/ # Legislative data
│ ├── bills/
│ ├── votes/
│ └── legislators/
├── meetings/ # Meeting minutes and agendas
│ ├── pdfs/
│ └── transcripts/
└── cache/ # General cache

🖥️ Platform-Specific Setup

Windows (D Drive)

1. Create the data directory:

# PowerShell
New-Item -Path "D:\open-navigator-data" -ItemType Directory -Force
New-Item -Path "D:\open-navigator-data\acs" -ItemType Directory -Force
New-Item -Path "D:\open-navigator-data\irs" -ItemType Directory -Force
New-Item -Path "D:\open-navigator-data\cache" -ItemType Directory -Force

2. Set permissions (if needed):

# Give yourself full control
icacls "D:\open-navigator-data" /grant "${env:USERNAME}:(OI)(CI)F" /T

3. Use in Python scripts:

from pathlib import Path

# Option 1: Explicit Windows path
data_dir = Path("D:/open-navigator-data/acs")

# Option 2: Using environment variable
import os
data_dir = Path(os.environ.get("OPEN_NAV_DATA_DIR", "D:/open-navigator-data/acs"))

Linux (Mounted Drive)

1. Mount the drive:

# Find your drive
lsblk

# Create mount point
sudo mkdir -p /mnt/d

# Mount (replace /dev/sdb1 with your drive)
sudo mount /dev/sdb1 /mnt/d

# Auto-mount on boot (add to /etc/fstab)
echo "/dev/sdb1 /mnt/d ext4 defaults 0 0" | sudo tee -a /etc/fstab

2. Create directories:

mkdir -p /mnt/d/open-navigator-data/{acs,irs,legislative,meetings,cache}

3. Set ownership:

# Give yourself ownership
sudo chown -R $USER:$USER /mnt/d/open-navigator-data
chmod -R 755 /mnt/d/open-navigator-data

4. Use in scripts:

from pathlib import Path

data_dir = Path("/mnt/d/open-navigator-data/acs")

macOS (External Drive)

1. External drives auto-mount to /Volumes:

# Your drive might be at:
ls /Volumes/
# Example: /Volumes/MyExternalDrive

2. Create directories:

mkdir -p "/Volumes/MyExternalDrive/open-navigator-data"/{acs,irs,legislative,meetings,cache}

3. Use in scripts:

from pathlib import Path

data_dir = Path("/Volumes/MyExternalDrive/open-navigator-data/acs")

WSL (Windows Subsystem for Linux)

Windows drives are auto-mounted in WSL:

# D drive is at /mnt/d
cd /mnt/d

# Create directories
mkdir -p /mnt/d/open-navigator-data/{acs,irs,legislative,meetings,cache}

Use in scripts (WSL path):

from pathlib import Path

# WSL path to Windows D drive
data_dir = Path("/mnt/d/open-navigator-data/acs")

🔧 Configuration Methods

Advantages: Single configuration, all scripts respect it

Setup:

Add to .env file in project root:

# .env
OPEN_NAV_DATA_DIR=D:/open-navigator-data

# Or Linux/Mac:
# OPEN_NAV_DATA_DIR=/mnt/d/open-navigator-data

Update scripts to use it:

# scripts/datasources/census/acs_ingestion.py
import os
from pathlib import Path

# Use environment variable with fallback
base_data_dir = os.environ.get("OPEN_NAV_DATA_DIR", "data/cache")
acs_dir = Path(base_data_dir) / "acs"

acs = ACSDataIngestion(data_dir=acs_dir)

Method 2: Config File

Create config/paths.py:

from pathlib import Path
import os

# Base data directory (customizable)
BASE_DATA_DIR = Path(os.environ.get("OPEN_NAV_DATA_DIR", "data"))

# Subdirectories
ACS_DATA_DIR = BASE_DATA_DIR / "acs"
IRS_DATA_DIR = BASE_DATA_DIR / "irs"
LEGISLATIVE_DATA_DIR = BASE_DATA_DIR / "legislative"
MEETINGS_DATA_DIR = BASE_DATA_DIR / "meetings"
CACHE_DIR = BASE_DATA_DIR / "cache"

# Create directories if they don't exist
for directory in [ACS_DATA_DIR, IRS_DATA_DIR, LEGISLATIVE_DATA_DIR,
MEETINGS_DATA_DIR, CACHE_DIR]:
directory.mkdir(parents=True, exist_ok=True)

Use in scripts:

from config.paths import ACS_DATA_DIR

acs = ACSDataIngestion(data_dir=ACS_DATA_DIR)

Method 3: Command-Line Argument

For one-off downloads:

# scripts/datasources/census/download_acs.py
import argparse
from pathlib import Path

parser = argparse.ArgumentParser()
parser.add_argument("--data-dir", type=Path, default="data/cache/acs",
help="Directory to store ACS data")
args = parser.parse_args()

acs = ACSDataIngestion(data_dir=args.data_dir)

Usage:

# Use D drive
python download_acs.py --data-dir D:/open-navigator-data/acs

# Use default
python download_acs.py

🗂️ Example Configurations

Small Project (Default)

Storage: Local project directory
Size: < 10 GB
Speed: Fastest (same disk as code)

# Uses data/cache/ in project directory
acs = ACSDataIngestion()

Medium Project (D Drive)

Storage: D drive or external SSD
Size: 10-100 GB
Speed: Fast

from pathlib import Path

acs = ACSDataIngestion(data_dir=Path("D:/open-navigator-data/acs"))

Large Project (Network Storage)

Storage: NAS or cloud-mounted drive
Size: 100+ GB
Speed: Slower but shared

from pathlib import Path

# Windows network path
data_dir = Path("//server/open-navigator/acs")

# Or mounted network drive (Linux)
# data_dir = Path("/mnt/nas/open-navigator/acs")

acs = ACSDataIngestion(data_dir=data_dir)

📊 Storage Requirements by Dataset

ACS Census Data

Download TypeSizeTime (50 Mbps)
Single table, single state~5 MB< 1 sec
Single table, all states~50 MB~10 sec
All tables, all states (API)~500 MB~2 min
Bulk download (all data)~15 GB~40 min

Recommended: Start with API downloads (targeted), only use bulk if needed.

IRS Nonprofit Data

DatasetSizeRecords
Business Master File (BMF)~500 MB1.8M nonprofits
Form 990 XML (1 year)~20 GB~300K filings
Form 990 PDFs (1 year)~50 GB~300K PDFs
Full 990 archive (10 years)~500 GB3M+ filings

Recommended: D drive or external storage required.

Legislative Data

DatasetSizeRecords
OpenStates bills (1 state, 1 year)~100 MB~5K bills
OpenStates bills (all states, 1 year)~5 GB~250K bills
OpenStates bills (all states, 10 years)~50 GB2.5M bills

Recommended: D drive for multi-year data.

Meeting Minutes

DatasetSizeRecords
Meeting agendas (text)~1 GB100K meetings
Meeting PDFs (cached)~500 GB1M documents

Recommended: External storage for PDF archive.

🔍 Verify Configuration

Test your setup:

from pathlib import Path
import pandas as pd

def test_storage_config(data_dir: Path):
"""Test that we can write/read data."""

# Create test directory
test_dir = data_dir / "test"
test_dir.mkdir(parents=True, exist_ok=True)

# Write test file
test_file = test_dir / "test.parquet"
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df.to_parquet(test_file)

# Read test file
df2 = pd.read_parquet(test_file)

# Verify
assert df.equals(df2), "Data mismatch!"

# Cleanup
test_file.unlink()
test_dir.rmdir()

print(f"✅ Storage configuration verified: {data_dir.absolute()}")
print(f"✅ Free space: {get_free_space(data_dir):.2f} GB")

def get_free_space(path: Path) -> float:
"""Get free disk space in GB."""
import shutil
stat = shutil.disk_usage(path)
return stat.free / (1024**3)

# Test
test_storage_config(Path("D:/open-navigator-data"))

⚡ Performance Tips

1. Use SSD for Cache

  • C drive (SSD): Small, frequently accessed files
  • D drive (HDD): Large, rarely accessed archives

2. Organize by Access Frequency

D:/open-navigator-data/
├── hot/ # Frequently accessed (keep on SSD if possible)
│ └── acs/ # Census data for current analysis
└── cold/ # Rarely accessed (OK on HDD)
├── irs/990s/ # Historical 990 PDFs
└── archive/ # Old datasets

3. Use Compression for Archives

# Compress old data
import gzip
import shutil

with open("data.parquet", "rb") as f_in:
with gzip.open("data.parquet.gz", "wb") as f_out:
shutil.copyfileobj(f_in, f_out)

4. Clean Up Old Cache

from pathlib import Path
from datetime import datetime, timedelta

def cleanup_old_cache(cache_dir: Path, days: int = 30):
"""Delete cache files older than N days."""

cutoff = datetime.now() - timedelta(days=days)

for file in cache_dir.rglob("*.parquet"):
if datetime.fromtimestamp(file.stat().st_mtime) < cutoff:
print(f"Deleting old cache: {file}")
file.unlink()

cleanup_old_cache(Path("D:/open-navigator-data/cache"))

🆘 Troubleshooting

"Permission denied" on D drive

Windows:

# Give yourself full control
icacls "D:\open-navigator-data" /grant "${env:USERNAME}:(OI)(CI)F" /T

Linux:

# Change ownership
sudo chown -R $USER:$USER /mnt/d/open-navigator-data

"No space left on device"

Check disk space:

# Windows (PowerShell)
Get-PSDrive D | Select-Object Used,Free

# Linux/Mac
df -h /mnt/d

Free up space:

  1. Delete old cache files
  2. Compress archive data
  3. Move rarely used data to cloud storage

Slow read/write performance

Causes:

  • USB 2.0 external drive (upgrade to USB 3.0+)
  • Network drive over slow connection
  • HDD vs SSD (SSD is 10-100x faster)

Solutions:

  • Use SSD for frequently accessed data
  • Enable write caching (Windows)
  • Use local storage for processing, archive to external

WSL can't access D drive

Fix:

# Verify drive is mounted
ls /mnt/d

# If not, add to /etc/fstab:
sudo mkdir -p /mnt/d
sudo mount -t drvfs D: /mnt/d

🔮 Next Steps

  1. Create data directory on your D drive or external storage
  2. Set environment variable OPEN_NAV_DATA_DIR in .env
  3. Test configuration using the verification script above
  4. Download ACS data to test storage setup
  5. Monitor disk usage as you add more datasets