This application scrapes meeting information from the Tulsa Government website, downloads and processes videos and documents, transcribes audio to text, and makes them available through a set of APIs. The system is designed as a set of microservices, each with its own responsibility and data store.
The application is organized into four main services:
Purpose: Scrape and provide access to Tulsa Government meeting data
- Scrapes the TGov website for meeting information
- Stores committee and meeting data
- Provides APIs for accessing meeting information
- Extracts video download URLs from viewer pages
Key Endpoints:
GET /scrape/tgov
- Trigger a scrape of the TGov websiteGET /tgov/meetings
- List meetings with filtering optionsGET /tgov/committees
- List all committeesPOST /tgov/extract-video-url
- Extract a video URL from a viewer page
Purpose: Handle video downloading, processing, and storage
- Downloads videos from URLs
- Extracts audio tracks from videos
- Processes video batches in the background
- Provides APIs for accessing processed media
Key Endpoints:
POST /api/videos/download
- Download videos from URLsGET /api/media/:blobId/info
- Get information about a media fileGET /api/videos
- List all stored videosGET /api/audio
- List all stored audio filesPOST /api/videos/batch/queue
- Queue a batch of videos for processingGET /api/videos/batch/:batchId
- Get the status of a batchPOST /api/videos/batch/process
- Process the next batch of videos
Purpose: Handle document storage and retrieval
- Downloads and stores documents from URLs
- Manages document metadata
- Links documents to meeting records
Key Endpoints:
POST /api/documents/download
- Download and store a documentGET /api/documents
- List documents with filtering optionsGET /api/documents/:id
- Get a specific documentPOST /api/meeting-documents
- Download and link meeting agenda documents
Purpose: Convert audio to text and manage transcriptions
- Transcribes audio files using the OpenAI Whisper API
- Stores and manages transcription results with time-aligned segments
- Processes transcription jobs asynchronously
- Provides APIs for accessing transcriptions
Key Endpoints:
POST /transcribe
- Request transcription for an audio fileGET /jobs/:jobId
- Get the status of a transcription jobGET /transcriptions/:transcriptionId
- Get a transcription by IDGET /meetings/:meetingId/transcriptions
- Get all transcriptions for a meeting
Services communicate with each other using type-safe API calls through the Encore client library:
- TGov → Media: Media service calls TGov's
extractVideoUrl
endpoint to get download URLs - Documents → TGov: Documents service calls TGov's
listMeetings
endpoint to get meeting data - Media → TGov: Media service uses TGov's meeting data for processing videos
- Transcription → Media: Transcription service calls Media's
getMediaFile
endpoint to get audio file information
- TGov service scrapes meeting information from the Tulsa Government website
- Media service extracts download URLs and processes videos, including audio extraction
- Documents service downloads and links agenda documents to meetings
- Transcription service converts audio files to text and stores the transcriptions
Each service has its own database:
- TGov Database: Stores committee and meeting information
- Media Database: Stores media file metadata and processing tasks
- Documents Database: Stores document metadata
- Transcription Database: Stores transcription jobs and results
- recordings: Video and audio files (managed by Media service)
- agendas: Document files (managed by Documents service)
- bucket-meta: Metadata for storage buckets
- daily-tgov-scrape: Daily scrape of the TGov website (12:01 AM)
- process-video-batches: Process video batches every 5 minutes