D Drive Configuration for Large Datasets
Configure Open Navigator to store large datasets (ACS census data, IRS 990s, etc.) on an external drive or secondary volume to avoid filling your primary disk.
Why Use External Storage?
Open Navigator downloads and caches large datasets:
| Dataset | Size | Records |
|---|---|---|
| ACS 5-Year (All States) | ~15 GB | 85,000 tracts |
| IRS Form 990s | ~100 GB | 1.8M nonprofits |
| Meeting Minutes (PDFs) | ~500 GB | 90,000 jurisdictions |
| Legislative Bills | ~50 GB | Millions of bills |
Total potential storage: Up to 1 TB of data!
📁 Recommended Directory Structure
Create this structure on your D drive (or external storage):
D:/open-navigator-data/
├── acs/ # American Community Survey demographic data
│ ├── B19013_county_*_2022.parquet
│ ├── B27010_county_*_2022.parquet
│ └── acs_2022_ALL/ # Bulk downloads
├── irs/ # IRS nonprofit data
│ ├── bmf/ # Business Master File
│ └── 990s/ # Form 990 PDFs and XML
├── legislative/ # Legislative data
│ ├── bills/
│ ├── votes/
│ └── legislators/
├── meetings/ # Meeting minutes and agendas
│ ├── pdfs/
│ └── transcripts/
└── cache/ # General cache
🖥️ Platform-Specific Setup
Windows (D Drive)
1. Create the data directory:
# PowerShell
New-Item -Path "D:\open-navigator-data" -ItemType Directory -Force
New-Item -Path "D:\open-navigator-data\acs" -ItemType Directory -Force
New-Item -Path "D:\open-navigator-data\irs" -ItemType Directory -Force
New-Item -Path "D:\open-navigator-data\cache" -ItemType Directory -Force
2. Set permissions (if needed):
# Give yourself full control
icacls "D:\open-navigator-data" /grant "${env:USERNAME}:(OI)(CI)F" /T
3. Use in Python scripts:
from pathlib import Path
# Option 1: Explicit Windows path
data_dir = Path("D:/open-navigator-data/acs")
# Option 2: Using environment variable
import os
data_dir = Path(os.environ.get("OPEN_NAV_DATA_DIR", "D:/open-navigator-data/acs"))
Linux (Mounted Drive)
1. Mount the drive:
# Find your drive
lsblk
# Create mount point
sudo mkdir -p /mnt/d
# Mount (replace /dev/sdb1 with your drive)
sudo mount /dev/sdb1 /mnt/d
# Auto-mount on boot (add to /etc/fstab)
echo "/dev/sdb1 /mnt/d ext4 defaults 0 0" | sudo tee -a /etc/fstab
2. Create directories:
mkdir -p /mnt/d/open-navigator-data/{acs,irs,legislative,meetings,cache}
3. Set ownership:
# Give yourself ownership
sudo chown -R $USER:$USER /mnt/d/open-navigator-data
chmod -R 755 /mnt/d/open-navigator-data
4. Use in scripts:
from pathlib import Path
data_dir = Path("/mnt/d/open-navigator-data/acs")
macOS (External Drive)
1. External drives auto-mount to /Volumes:
# Your drive might be at:
ls /Volumes/
# Example: /Volumes/MyExternalDrive
2. Create directories:
mkdir -p "/Volumes/MyExternalDrive/open-navigator-data"/{acs,irs,legislative,meetings,cache}
3. Use in scripts:
from pathlib import Path
data_dir = Path("/Volumes/MyExternalDrive/open-navigator-data/acs")
WSL (Windows Subsystem for Linux)
Windows drives are auto-mounted in WSL:
# D drive is at /mnt/d
cd /mnt/d
# Create directories
mkdir -p /mnt/d/open-navigator-data/{acs,irs,legislative,meetings,cache}
Use in scripts (WSL path):
from pathlib import Path
# WSL path to Windows D drive
data_dir = Path("/mnt/d/open-navigator-data/acs")
🔧 Configuration Methods
Method 1: Environment Variable (Recommended)
Advantages: Single configuration, all scripts respect it
Setup:
Add to .env file in project root:
# .env
OPEN_NAV_DATA_DIR=D:/open-navigator-data
# Or Linux/Mac:
# OPEN_NAV_DATA_DIR=/mnt/d/open-navigator-data
Update scripts to use it:
# scripts/datasources/census/acs_ingestion.py
import os
from pathlib import Path
# Use environment variable with fallback
base_data_dir = os.environ.get("OPEN_NAV_DATA_DIR", "data/cache")
acs_dir = Path(base_data_dir) / "acs"
acs = ACSDataIngestion(data_dir=acs_dir)
Method 2: Config File
Create config/paths.py:
from pathlib import Path
import os
# Base data directory (customizable)
BASE_DATA_DIR = Path(os.environ.get("OPEN_NAV_DATA_DIR", "data"))
# Subdirectories
ACS_DATA_DIR = BASE_DATA_DIR / "acs"
IRS_DATA_DIR = BASE_DATA_DIR / "irs"
LEGISLATIVE_DATA_DIR = BASE_DATA_DIR / "legislative"
MEETINGS_DATA_DIR = BASE_DATA_DIR / "meetings"
CACHE_DIR = BASE_DATA_DIR / "cache"
# Create directories if they don't exist
for directory in [ACS_DATA_DIR, IRS_DATA_DIR, LEGISLATIVE_DATA_DIR,
MEETINGS_DATA_DIR, CACHE_DIR]:
directory.mkdir(parents=True, exist_ok=True)
Use in scripts:
from config.paths import ACS_DATA_DIR
acs = ACSDataIngestion(data_dir=ACS_DATA_DIR)
Method 3: Command-Line Argument
For one-off downloads:
# scripts/datasources/census/download_acs.py
import argparse
from pathlib import Path
parser = argparse.ArgumentParser()
parser.add_argument("--data-dir", type=Path, default="data/cache/acs",
help="Directory to store ACS data")
args = parser.parse_args()
acs = ACSDataIngestion(data_dir=args.data_dir)
Usage:
# Use D drive
python download_acs.py --data-dir D:/open-navigator-data/acs
# Use default
python download_acs.py
🗂️ Example Configurations
Small Project (Default)
Storage: Local project directory
Size: < 10 GB
Speed: Fastest (same disk as code)
# Uses data/cache/ in project directory
acs = ACSDataIngestion()
Medium Project (D Drive)
Storage: D drive or external SSD
Size: 10-100 GB
Speed: Fast
from pathlib import Path
acs = ACSDataIngestion(data_dir=Path("D:/open-navigator-data/acs"))
Large Project (Network Storage)
Storage: NAS or cloud-mounted drive
Size: 100+ GB
Speed: Slower but shared
from pathlib import Path
# Windows network path
data_dir = Path("//server/open-navigator/acs")
# Or mounted network drive (Linux)
# data_dir = Path("/mnt/nas/open-navigator/acs")
acs = ACSDataIngestion(data_dir=data_dir)
📊 Storage Requirements by Dataset
ACS Census Data
| Download Type | Size | Time (50 Mbps) |
|---|---|---|
| Single table, single state | ~5 MB | < 1 sec |
| Single table, all states | ~50 MB | ~10 sec |
| All tables, all states (API) | ~500 MB | ~2 min |
| Bulk download (all data) | ~15 GB | ~40 min |
Recommended: Start with API downloads (targeted), only use bulk if needed.
IRS Nonprofit Data
| Dataset | Size | Records |
|---|---|---|
| Business Master File (BMF) | ~500 MB | 1.8M nonprofits |
| Form 990 XML (1 year) | ~20 GB | ~300K filings |
| Form 990 PDFs (1 year) | ~50 GB | ~300K PDFs |
| Full 990 archive (10 years) | ~500 GB | 3M+ filings |
Recommended: D drive or external storage required.
Legislative Data
| Dataset | Size | Records |
|---|---|---|
| OpenStates bills (1 state, 1 year) | ~100 MB | ~5K bills |
| OpenStates bills (all states, 1 year) | ~5 GB | ~250K bills |
| OpenStates bills (all states, 10 years) | ~50 GB | 2.5M bills |
Recommended: D drive for multi-year data.
Meeting Minutes
| Dataset | Size | Records |
|---|---|---|
| Meeting agendas (text) | ~1 GB | 100K meetings |
| Meeting PDFs (cached) | ~500 GB | 1M documents |
Recommended: External storage for PDF archive.
🔍 Verify Configuration
Test your setup:
from pathlib import Path
import pandas as pd
def test_storage_config(data_dir: Path):
"""Test that we can write/read data."""
# Create test directory
test_dir = data_dir / "test"
test_dir.mkdir(parents=True, exist_ok=True)
# Write test file
test_file = test_dir / "test.parquet"
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df.to_parquet(test_file)
# Read test file
df2 = pd.read_parquet(test_file)
# Verify
assert df.equals(df2), "Data mismatch!"
# Cleanup
test_file.unlink()
test_dir.rmdir()
print(f"✅ Storage configuration verified: {data_dir.absolute()}")
print(f"✅ Free space: {get_free_space(data_dir):.2f} GB")
def get_free_space(path: Path) -> float:
"""Get free disk space in GB."""
import shutil
stat = shutil.disk_usage(path)
return stat.free / (1024**3)
# Test
test_storage_config(Path("D:/open-navigator-data"))
⚡ Performance Tips
1. Use SSD for Cache
- C drive (SSD): Small, frequently accessed files
- D drive (HDD): Large, rarely accessed archives
2. Organize by Access Frequency
D:/open-navigator-data/
├── hot/ # Frequently accessed (keep on SSD if possible)
│ └── acs/ # Census data for current analysis
└── cold/ # Rarely accessed (OK on HDD)
├── irs/990s/ # Historical 990 PDFs
└── archive/ # Old datasets
3. Use Compression for Archives
# Compress old data
import gzip
import shutil
with open("data.parquet", "rb") as f_in:
with gzip.open("data.parquet.gz", "wb") as f_out:
shutil.copyfileobj(f_in, f_out)
4. Clean Up Old Cache
from pathlib import Path
from datetime import datetime, timedelta
def cleanup_old_cache(cache_dir: Path, days: int = 30):
"""Delete cache files older than N days."""
cutoff = datetime.now() - timedelta(days=days)
for file in cache_dir.rglob("*.parquet"):
if datetime.fromtimestamp(file.stat().st_mtime) < cutoff:
print(f"Deleting old cache: {file}")
file.unlink()
cleanup_old_cache(Path("D:/open-navigator-data/cache"))
🆘 Troubleshooting
"Permission denied" on D drive
Windows:
# Give yourself full control
icacls "D:\open-navigator-data" /grant "${env:USERNAME}:(OI)(CI)F" /T
Linux:
# Change ownership
sudo chown -R $USER:$USER /mnt/d/open-navigator-data
"No space left on device"
Check disk space:
# Windows (PowerShell)
Get-PSDrive D | Select-Object Used,Free
# Linux/Mac
df -h /mnt/d
Free up space:
- Delete old cache files
- Compress archive data
- Move rarely used data to cloud storage
Slow read/write performance
Causes:
- USB 2.0 external drive (upgrade to USB 3.0+)
- Network drive over slow connection
- HDD vs SSD (SSD is 10-100x faster)
Solutions:
- Use SSD for frequently accessed data
- Enable write caching (Windows)
- Use local storage for processing, archive to external
WSL can't access D drive
Fix:
# Verify drive is mounted
ls /mnt/d
# If not, add to /etc/fstab:
sudo mkdir -p /mnt/d
sudo mount -t drvfs D: /mnt/d
🔮 Next Steps
- Create data directory on your D drive or external storage
- Set environment variable
OPEN_NAV_DATA_DIRin.env - Test configuration using the verification script above
- Download ACS data to test storage setup
- Monitor disk usage as you add more datasets
Related Documentation
- Census ACS Integration - Download demographic data
- IRS 990 Downloads - Nonprofit data
- Data Architecture - Overall data flow