Skip to main content

scripts/datasources/ Reconciliation Report

Generated 2026-05-28. Purpose: determine which scripts/datasources/<source>/ trees are safe to retire now that the core-lib refactor ported sources into packages/ingestion/src/ingestion/.

Method​

For each source that exists in both packages/ingestion and scripts/datasources, we measured:

  1. LOC still living under scripts/datasources/<source>/.
  2. Live code dependencies — actual from/import scripts.datasources.<source> statements in packages/, api/, pipeline/, agents/, tests/, and shell/Makefile entrypoints that invoke the scripts directly.

Path mentions inside docstrings / .md / .sql comments (provenance notes) were treated as noise, not dependencies — the already-shimmed bls and tpc set that baseline (they each show 1–2 such mentions and nothing more).

Headline​

This is not simply "cleanup was skipped." For several sources the port is incomplete: the new packages/ingestion/<source> module exists, but live api/ / pipeline/ code — and in one case the new package itself — still imports the old scripts/ implementation. Those cannot be archived without finishing the port.


Category A — DONE: migrated to the new packages/scrapers library​

These were LAND-ported but not FETCH-ported: the ingestion.<source> module reads a data/cache/<source>/ that a scraper still living in scripts/ produced. Web-crawling (httpx / BeautifulSoup / Playwright) does not fit the DataSourcePipeline cache->bronze contract, so per the project decision the scrapers moved into a new workspace package packages/scrapers (scrapers.<source>), not into archive/. The scripts/datasources/<source>/ trees are now empty.

sourcescraper now atexternal importer repointed
ballotpediascrapers.ballotpediajurisdiction_pilot/county_municipality_websites.py → scrapers.ballotpedia
dotscrapers.dot—
google_civicscrapers.google_civic—
leagueofcitiesscrapers.leagueofcities(intra-source only)
nacoscrapers.nacorunner → packages/scrapers/scripts/run_naco.sh
uscmscrapers.uscmvendorsearch/vendor_meeting_portal_search.py → scrapers.uscm

All migrated modules import cleanly; ingestion docstrings updated to the scrapers.<source> (FETCH) + ingestion.<source> (LAND) two-step.

Category A.2 — renamed ports also migrated (per maintainer correction)​

Two sources mis-classified as "never ported" in the first pass were actually ported under a different package name. Their scrapers were migrated too:

scripts sourceLAND ported toscraper now at
parcelsingestion.arcgis (data/cache/parcels → bronze.bronze_addresses)scrapers.parcels
powerbi_ballot_measuresingestion.ncls (data/cache/ncls → bronze.bronze_ballot_measures_powerbi)scrapers.powerbi_ballot_measures

Note: scrapers.parcels.batch_state_parcels was a combined FETCH+LAND orchestrator; its inline load_csv_to_bronze (now ported to ingestion.arcgis.addresses) was removed — it is FETCH/extract-only now and raises a clear pointer to python -m ingestion.arcgis.addresses if invoked with loading enabled. TODO: add a thin LAND adapter if the per-county batch load is still wanted in one command.


Category B — Port INCOMPLETE, do NOT archive yet​

Live code still depends on the scripts/ version. Finish the port (move the imported symbols into packages/ingestion, repoint callers) before retiring.

sourceblocking dependency
ncespackages/ingestion/.../nces/school_districts.py imports scripts.datasources.nces.download_nces.NCESSchoolDistrictIngestion — the new module wraps the old one
irsapi/app.py, api/main.py, pipeline/create_nonprofits_gold_tables.py import scripts.datasources.irs.nonprofit_discovery.NonprofitDiscovery
youtubeapi/routes/batch_jobs.py imports many api.batch_jobs.batch_job_* modules; 17 test files import youtube scripts (18,450 LOC — largest)
openstatespipeline/create_contacts_gold_tables.py imports scripts.datasources.openstates.openstates_sources; several .sh runners + tests
censusoperator entrypoint packages/hosting/scripts/neon/run_bronze_jurisdictions_to_cloud.sh invokes load_census_gazetteer.py
fecscripts/datasources/fec/run_bulk_download.sh invokes load_fec_bulk.py
wikidatapackages/hosting/scripts/neon/run_jurisdiction_id_migration.sh invokes materialize_bronze_jurisdictions_wikidata_tables.py

Category C — Never ported, never archived (separate triage)​

scripts/datasources/ subdirs with no packages/ingestion counterpart:

  • Load-bearing (live api/ + tests import them, despite no package): jurisdictions (api/routes/jurisdiction_mapping.py + tests), jurisdiction_pilot (many tests).
  • Renamed ports (LAND lives in packages/ingestion under a different name; scrapers migrated in Category A.2 above): parcels → ingestion.arcgis, powerbi_ballot_measures → ingestion.ncls, data_gov → ingestion.data_gov.organizations, google_data_commons → ingestion.google_data_commons.{client,bronze}.
  • Still to triage (no live refs found; may yet be renamed ports — maintainer to confirm): cityscrapers, dbpedia, govwebsites, grants_gov, ma_pilot, meetingbank, netronline, social_media, vendorsearch, voter_data.

Caveat learned this pass: "no same-name packages/ingestion dir" does not mean "never ported" — match by cache dir / bronze table, not by name. The only ingestion modules with no same-name scripts dir are arcgis, everyorg, ncls, ntee, wikimedia; of those, arcgis↔parcels and ncls↔powerbi_ballot_measures are the Category-C renames, while everyorg (HuggingFace parquet), wikimedia (scripts/wikimedia/), and ntee (already archived) do not come from a remaining Category-C dir.


Already clean (for reference)​

  • Thin shims (correct end state): bls (22 LOC), tpc (25 LOC).
  • Fully removed from scripts/: gsa, hifld, hud, nccs, osf, wikicommons.

On archive/​

Only 12 sources have an archive/datasources/<source>/ entry, and several (census, fec, nccs, nces, wikidata) also still have a scripts/ copy — so "archived" did not imply "removed from scripts". Archiving was applied ad hoc rather than as the closing step of each port.