📚 LocalView Integration Guide

Overview

LocalView is a Harvard University dataset containing 1,000-10,000 municipality URLs with meeting videos and transcripts. It's the largest known database of municipal meeting video archives.

Challenge: The Harvard Dataverse requires JavaScript and may have CAPTCHA verification, so we need to download the files manually.

Step-by-Step Download Instructions

1. Visit the Harvard Dataverse Website

URL: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NJTBEM

2. Navigate to the Files Section

Once the page loads:

Scroll down to the "Files" section
Look for downloadable CSV/TAB files with names like:
- municipalities.csv or municipalities.tab
- meetings.csv or meetings.tab
- videos.csv or videos.tab
- Or similar naming patterns

3. Download the Files

Click the "Download" button for each file:

Download all CSV/TAB files related to municipalities, meetings, and videos
Save them to: /home/developer/projects/open-navigator/data/cache/localview/

Expected files (names may vary):

data/cache/localview/
├── municipalities.csv     # List of municipalities with URLs
├── meetings.csv           # Meeting metadata
├── videos.csv             # Video URLs and metadata
└── README.txt             # Dataset documentation (if available)

4. Expected Data Structure

The LocalView dataset typically includes:

Municipalities file (municipalities.csv):

municipality_name - City/town name
state - Two-letter state code
county - County name
population - Population count
website_url - Official government website
meeting_page_url - Link to meetings/agendas page
video_archive_url - Link to video archive

Meetings file (meetings.csv):

meeting_id - Unique identifier
municipality_name - City/town name
meeting_date - Date of meeting
meeting_type - Type (Council, Planning, etc.)
video_url - Direct link to video
transcript_available - Boolean flag
transcript_url - Link to transcript (if available)

Videos file (videos.csv):

video_id - Unique identifier
video_url - Direct video link
platform - Platform (YouTube, Granicus, etc.)
duration_minutes - Video length
has_captions - Caption availability

Integration Script Usage

After Downloading Files

Once you've downloaded the files to data/cache/localview/, run:

cd /home/developer/projects/open-navigator
source venv/bin/activate

# Run the ingestion script
python discovery/localview_ingestion.py

What the Script Does

Reads downloaded CSV files from cache directory
Parses municipality data - Names, states, URLs
Extracts video URLs - Direct links to meeting videos
Identifies platforms - YouTube, Granicus, Vimeo, Archive.org
Writes to Bronze layer - bronze/localview_municipalities and bronze/localview_videos

Expected Output

[INFO] Loading LocalView data from cache...
[INFO] Found 1,247 municipalities
[INFO] Found 8,453 meeting videos
[INFO] Platforms detected:
  - YouTube: 3,421 videos
  - Granicus: 2,876 videos
  - Vimeo: 1,234 videos
  - Other: 922 videos
[SUCCESS] ✓ Written 1,247 municipalities to bronze/localview_municipalities
[SUCCESS] ✓ Written 8,453 videos to bronze/localview_videos

Alternative: API Access (If Available)

Check if LocalView provides API access:

Some Harvard Dataverse datasets offer API access. Try:

# Check for API availability
curl -I "https://dataverse.harvard.edu/api/datasets/:persistentId/?persistentId=doi:10.7910/DVN/NJTBEM"

If successful, we can update the script to use the API instead of manual download.

Troubleshooting

Problem: Can't Find CSV Files

Solution: The files might be in TAB format. The ingestion script handles both CSV and TAB files automatically.

Problem: Files Have Different Names

Solution: Edit the EXPECTED_FILES dictionary in discovery/localview_ingestion.py to match the actual filenames.

Problem: Data Format is Different

Solution: Check the README.txt or dataset documentation on Harvard Dataverse. Update the column mappings in the script to match.

Problem: CAPTCHA Blocks Download

Solution:

Use a web browser (not curl/wget)
Complete the CAPTCHA verification
Download files manually through the browser
Save to data/cache/localview/

Data Quality & Coverage

Expected Coverage

Based on the LocalView research paper:

1,000-1,500 municipalities with verified meeting archives
5,000-10,000 meeting videos with URLs
Coverage: Major cities + medium-sized municipalities
Time range: 2015-2024 (approximately)
Focus states: CA, MA, TX, NY, FL, IL (highest coverage)

Quality Indicators

✅ Academic validation - Harvard research project
✅ Human verification - URLs manually verified
✅ Transcript availability - Many include automated transcripts
✅ Continuous updates - Dataset may be updated periodically

Next Steps After Integration

1. Combine with Other Sources

# After running LocalView ingestion
python discovery/meetingbank_ingestion.py      # 1,366 meetings
python discovery/city_scrapers_urls.py         # 100-500 agencies
python discovery/openstates_sources.py         # 50+ legislatures

# Total: 7,000-12,000 verified URLs!

2. Deduplicate URLs

Create a deduplication script to merge URLs from all sources:

# discovery/url_deduplication.py
from pyspark.sql.functions import col, count, first

# Read all source tables
localview = spark.read.format("delta").load("bronze/localview_videos")
meetingbank = spark.read.format("delta").load("bronze/meetingbank_meetings")
city_scrapers = spark.read.format("delta").load("bronze/city_scrapers_urls")

# Deduplicate by URL
unique_urls = (
    localview.select("url", "platform", "municipality", "state")
    .union(meetingbank.select("url", "platform", "municipality", "state"))
    .union(city_scrapers.select("url", "platform", "municipality", "state"))
    .dropDuplicates(["url"])
)

print(f"Total unique URLs: {unique_urls.count()}")

3. Priority Scraping

Use LocalView data to prioritize which municipalities to scrape first:

-- Find municipalities with the most videos
SELECT municipality, state, COUNT(*) as video_count
FROM bronze.localview_videos
GROUP BY municipality, state
ORDER BY video_count DESC
LIMIT 100

Documentation Links

Harvard Dataverse: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/NJTBEM
LocalView Research Paper: Search for "LocalView municipal meetings dataset" on Google Scholar
Harvard Mellon Urbanism Initiative: https://www.gsd.harvard.edu/project/localview/

Expected Timeline

Step	Time Required	Priority
Download files from Harvard Dataverse	5-10 min	🔥 HIGH
Run ingestion script	2-5 min	🔥 HIGH
Verify data quality	5 min	🟡 MEDIUM
Deduplication with other sources	10 min	🟡 MEDIUM

Total time: ~30 minutes for complete integration

Questions?

If you encounter issues:

Check the dataset documentation on Harvard Dataverse
Look at example data in the first few rows
Update column mappings in the script accordingly
Run with --sample flag first to test: python discovery/localview_ingestion.py --sample

Overview​

Step-by-Step Download Instructions​

1. Visit the Harvard Dataverse Website​

2. Navigate to the Files Section​

3. Download the Files​

4. Expected Data Structure​

Integration Script Usage​

After Downloading Files​

What the Script Does​

Expected Output​

Alternative: API Access (If Available)​

Troubleshooting​

Problem: Can't Find CSV Files​

Problem: Files Have Different Names​

Problem: Data Format is Different​

Problem: CAPTCHA Blocks Download​

Data Quality & Coverage​

Expected Coverage​

Quality Indicators​

Next Steps After Integration​

1. Combine with Other Sources​

2. Deduplicate URLs​

3. Priority Scraping​

Documentation Links​

Expected Timeline​

Questions?​