Skip to main content

HuggingFace Datasets

Public, ready-to-load civic datasets that bootstrap the meeting and transcript pipelines.

HuggingFace hosts several openly licensed datasets of local-government meetings — transcripts, audio, and human-written summaries. We use them as a head start: real meeting text to validate extraction and keyword detection against before the live scrapers fill in current coverage.

:::info At a glance

ProviderCommunity researchers, via the HuggingFace Hub
CoverageA handful of large US cities; historical (varies by dataset)
Update cadenceStatic research releases
LicensePer-dataset (CC-BY / CDLA) · see Terms and Privacy
CostFree
Access methoddatasets library / bulk download
Our pipelinebronze.meetingbank_meetings → staging → meeting marts
:::

Overview

The most directly useful dataset is MeetingBank, a benchmark built for meeting summarization. The others below either overlap with sources we already ingest (LocalView, Council Data Project) or are not available as bulk downloads (CivicBand). This page covers what each one offers and which we actually load.

DatasetAvailable hereRole in Open Navigator
MeetingBankYes (HuggingFace)Primary — transcripts + reference summaries
LocalViewVia Harvard DataverseCovered on URL Datasets
Council Data ProjectVia project deploymentsCovered on URL Datasets
CivicBandPlatform onlyValidation list, not bulk URLs

Data available

MeetingBank

A benchmark dataset of 1,366 city-council meetings from six US cities — Alameda CA, Boston MA, Denver CO, King County WA, Long Beach CA, and Seattle WA. Each meeting ships with a full transcript (≈28k tokens on average), human-written summaries used as evaluation ground truth, and links back to the source city.

FieldDescriptionTypeCoverage
idMeeting identifierstring100%
transcriptFull meeting transcriptstring100%
summaryHuman-written summary (ground truth)string100%
city / stateSource jurisdictionstring100%
source_urlLink to the city's recordstringPartial

Grain & keys

  • Grain: one row per meeting (segment-level instances also available).
  • Primary key: id
  • Joins to: our meeting marts via source_url / jurisdiction match.

How we ingest it

# Pull MeetingBank and land it in the bronze layer.
python -m ingestion.huggingface.load_meetingbank
  • Source: HuggingFace Hub (huuuyeah/meetingbank).
  • Lands in: bronze.meetingbank_meetings → meeting staging models → meeting marts.
  • Refresh: static dataset; re-run only to pick up an upstream revision.

The transcripts double as a fixture for evaluating keyword detection and AI summarization: the human-written summaries give us a reference to score generated output against.

Coverage & known gaps

  • Six cities only — large metros, useful for prototyping, not national coverage.
  • Historical snapshots; current meetings come from the live scrapers and YouTube discovery.
  • CivicBand (≈1,031 municipalities at civic.band) is browsable but offers no bulk export; we use its municipality list only to validate jurisdiction matches, not as a URL source.
  • LocalView and Council Data Project are richer for URLs and are documented on the URL Datasets page rather than duplicated here.

Licensing & attribution

MeetingBank is released for research use; cite the ACL 2023 paper (arXiv:2305.17529) when redistributing derived data. Confirm each dataset's license on its HuggingFace card before republishing. See Terms and Privacy for our redistribution policy.