📚 LocalView Integration Guide
Overview​
LocalView is a Harvard University dataset containing 1,000-10,000 municipality URLs with meeting videos and transcripts. It's the largest known database of municipal meeting video archives.
Challenge: The Harvard Dataverse requires JavaScript and may have CAPTCHA verification, so we need to download the files manually.
Step-by-Step Download Instructions​
1. Visit the Harvard Dataverse Website​
URL: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NJTBEM
2. Navigate to the Files Section​
Once the page loads:
- Scroll down to the "Files" section
- Look for downloadable CSV/TAB files with names like:
municipalities.csvormunicipalities.tabmeetings.csvormeetings.tabvideos.csvorvideos.tab- Or similar naming patterns
3. Download the Files​
Click the "Download" button for each file:
- Download all CSV/TAB files related to municipalities, meetings, and videos
- Save them to:
/home/developer/projects/open-navigator/data/cache/localview/
Expected files (names may vary):
data/cache/localview/
├── municipalities.csv # List of municipalities with URLs
├── meetings.csv # Meeting metadata
├── videos.csv # Video URLs and metadata
└── README.txt # Dataset documentation (if available)
4. Expected Data Structure​
The LocalView dataset typically includes:
Municipalities file (municipalities.csv):
municipality_name- City/town namestate- Two-letter state codecounty- County namepopulation- Population countwebsite_url- Official government websitemeeting_page_url- Link to meetings/agendas pagevideo_archive_url- Link to video archive
Meetings file (meetings.csv):
meeting_id- Unique identifiermunicipality_name- City/town namemeeting_date- Date of meetingmeeting_type- Type (Council, Planning, etc.)video_url- Direct link to videotranscript_available- Boolean flagtranscript_url- Link to transcript (if available)
Videos file (videos.csv):
video_id- Unique identifiervideo_url- Direct video linkplatform- Platform (YouTube, Granicus, etc.)duration_minutes- Video lengthhas_captions- Caption availability
Integration Script Usage​
After Downloading Files​
Once you've downloaded the files to data/cache/localview/, run:
cd /home/developer/projects/open-navigator
source venv/bin/activate
# Run the ingestion script
python discovery/localview_ingestion.py
What the Script Does​
- Reads downloaded CSV files from cache directory
- Parses municipality data - Names, states, URLs
- Extracts video URLs - Direct links to meeting videos
- Identifies platforms - YouTube, Granicus, Vimeo, Archive.org
- Writes to Bronze layer -
bronze/localview_municipalitiesandbronze/localview_videos
Expected Output​
[INFO] Loading LocalView data from cache...
[INFO] Found 1,247 municipalities
[INFO] Found 8,453 meeting videos
[INFO] Platforms detected:
- YouTube: 3,421 videos
- Granicus: 2,876 videos
- Vimeo: 1,234 videos
- Other: 922 videos
[SUCCESS] ✓ Written 1,247 municipalities to bronze/localview_municipalities
[SUCCESS] ✓ Written 8,453 videos to bronze/localview_videos
Alternative: API Access (If Available)​
Check if LocalView provides API access:
Some Harvard Dataverse datasets offer API access. Try:
# Check for API availability
curl -I "https://dataverse.harvard.edu/api/datasets/:persistentId/?persistentId=doi:10.7910/DVN/NJTBEM"
If successful, we can update the script to use the API instead of manual download.
Troubleshooting​
Problem: Can't Find CSV Files​
Solution: The files might be in TAB format. The ingestion script handles both CSV and TAB files automatically.
Problem: Files Have Different Names​
Solution: Edit the EXPECTED_FILES dictionary in discovery/localview_ingestion.py to match the actual filenames.
Problem: Data Format is Different​
Solution: Check the README.txt or dataset documentation on Harvard Dataverse. Update the column mappings in the script to match.
Problem: CAPTCHA Blocks Download​
Solution:
- Use a web browser (not curl/wget)
- Complete the CAPTCHA verification
- Download files manually through the browser
- Save to
data/cache/localview/
Data Quality & Coverage​
Expected Coverage​
Based on the LocalView research paper:
- 1,000-1,500 municipalities with verified meeting archives
- 5,000-10,000 meeting videos with URLs
- Coverage: Major cities + medium-sized municipalities
- Time range: 2015-2024 (approximately)
- Focus states: CA, MA, TX, NY, FL, IL (highest coverage)
Quality Indicators​
- ✅ Academic validation - Harvard research project
- ✅ Human verification - URLs manually verified
- ✅ Transcript availability - Many include automated transcripts
- ✅ Continuous updates - Dataset may be updated periodically
Next Steps After Integration​
1. Combine with Other Sources​
# After running LocalView ingestion
python discovery/meetingbank_ingestion.py # 1,366 meetings
python discovery/city_scrapers_urls.py # 100-500 agencies
python discovery/openstates_sources.py # 50+ legislatures
# Total: 7,000-12,000 verified URLs!
2. Deduplicate URLs​
Create a deduplication script to merge URLs from all sources:
# discovery/url_deduplication.py
from pyspark.sql.functions import col, count, first
# Read all source tables
localview = spark.read.format("delta").load("bronze/localview_videos")
meetingbank = spark.read.format("delta").load("bronze/meetingbank_meetings")
city_scrapers = spark.read.format("delta").load("bronze/city_scrapers_urls")
# Deduplicate by URL
unique_urls = (
localview.select("url", "platform", "municipality", "state")
.union(meetingbank.select("url", "platform", "municipality", "state"))
.union(city_scrapers.select("url", "platform", "municipality", "state"))
.dropDuplicates(["url"])
)
print(f"Total unique URLs: {unique_urls.count()}")
3. Priority Scraping​
Use LocalView data to prioritize which municipalities to scrape first:
-- Find municipalities with the most videos
SELECT municipality, state, COUNT(*) as video_count
FROM bronze.localview_videos
GROUP BY municipality, state
ORDER BY video_count DESC
LIMIT 100
Documentation Links​
- Harvard Dataverse: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NJTBEM
- LocalView Research Paper: Search for "LocalView municipal meetings dataset" on Google Scholar
- Harvard Mellon Urbanism Initiative: https://www.gsd.harvard.edu/project/localview/
Expected Timeline​
| Step | Time Required | Priority |
|---|---|---|
| Download files from Harvard Dataverse | 5-10 min | 🔥 HIGH |
| Run ingestion script | 2-5 min | 🔥 HIGH |
| Verify data quality | 5 min | 🟡 MEDIUM |
| Deduplication with other sources | 10 min | 🟡 MEDIUM |
Total time: ~30 minutes for complete integration
Questions?​
If you encounter issues:
- Check the dataset documentation on Harvard Dataverse
- Look at example data in the first few rows
- Update column mappings in the script accordingly
- Run with
--sampleflag first to test:python discovery/localview_ingestion.py --sample