A Rails application for monitoring websites and automating the remediation of and accessibility of PDF files.
Before you begin, ensure you have the following installed:
- Ruby 3.2.2 (we recommend using a version manager like
rbenv
orrvm
) - Node.js 18.17.0 (we recommend using
nvm
for version management) - Yarn (latest version)
- Redis (for Sidekiq background jobs)
- SQLite3 (default database)
- Docker and Docker Compose (for LocalStack S3 in development)
-
Clone the repository:
git clone https://github.com/codeforamerica/asap_pdf.git cd asap_pdf
-
Install Ruby dependencies:
bundle install
-
Install JavaScript dependencies:
yarn install
-
Setup the database:
bin/rails db:setup
Start the development server and all required processes:
bin/dev
This command starts the following processes (defined in Procfile.dev
):
- Rails server
- JavaScript build process (with esbuild)
- CSS build process (with Tailwind CSS)
- Sidekiq worker for background jobs
The application will be available at http://localhost:3000
- Frontend: Built with Hotwired (Turbo + Stimulus) and Tailwind CSS
- Backend: Ruby on Rails 7.0
- Background Jobs: Sidekiq with Redis
- Testing: RSpec
The application includes several Python components for PDF processing:
- Site Crawler: Discovers PDF files on government websites
- Metadata Downloader: Downloads and extracts PDF metadata
- Document Classifier: Determines document types using LLM
- Policy Reviewer: Checks WCAG 2.1 compliance
- Document Transformer: Converts PDFs to MD/HTML
To set up the Python components, follow these instructions.
Background jobs are processed using Sidekiq. The Redis server must be running for Sidekiq to work. Jobs are defined in config/sidekiq.yml
and configured in the following queues (in order of priority):
- critical
- default
- low
- mailers
Example of creating a background job:
# app/jobs/example_job.rb
ExampleJob.perform_later("argument")
Run the test suite:
bundle exec rails test:prepare
bundle exec rspec
The project includes several development tools:
- Brakeman: Security analysis (
bin/brakeman
) - RuboCop: Code style checking (
bin/rubocop
) - Overcommit: Git hooks management
- Better Errors: Enhanced error pages in development
- Bullet: N+1 query detection
The following environment variables can be configured:
REDIS_URL
: Redis connection URL (default: redis://localhost:6379/0)AWS_ACCESS_KEY_ID
: AWS access key for S3 in productionAWS_SECRET_ACCESS_KEY
: AWS secret key for S3 in production
The application provides a RESTful API using Grape. API documentation is available through Swagger UI.
Once the application is running:
- Visit http://localhost:3000/api-docs to access the Swagger UI
- Browse and test the available endpoints directly in your browser
-
GET /api/v1/sites
- Lists all sites
- Returns site details excluding user_id, created_at, updated_at
- Includes s3_endpoint for each site
-
GET /api/v1/sites/:id
- Retrieves a specific site
- Returns site details excluding user_id, created_at, updated_at
- Includes s3_endpoint
-
POST /api/v1/documents/inference
- Adds or updates a document inference record.
- Ideally used for storing AI results for documents.
The application uses S3 with automatic versioning to track document changes over time. This allows us to maintain a complete history of each document as it changes externally (e.g., when a city updates their PDF) or internally (e.g., after accessibility improvements).
See architecture.md for detailed documentation of the storage system.
-
Development Setup:
docker-compose up
This starts LocalStack with S3 versioning enabled, matching production behavior.
-
Production Setup:
See deployment.md for detailed documentation on deployment.
-
S3 Path Structure: The application uses a standardized S3 path structure for all documents:
# Global S3 bucket S3_BUCKET = "s3://cfa-aistudio-asap-pdf" # Site-specific paths site.s3_endpoint # Returns "s3://cfa-aistudio-asap-pdf/site-url-host" site.s3_endpoint_prefix # Returns "site-url-host" (sanitized hostname) site.s3_key_for("file") # Returns "site-url-host/file"
Each site gets its own prefix based on its primary URL's hostname, ensuring unique and organized storage paths.
-
Working with Versions:
# Get document versions document.latest_file # Most recent version document.file_versions # All versions document.file_version(id) # Specific version # Get version details version = document.latest_file document.version_metadata(version) # Version metadata
The system automatically maintains version history as files change, with no additional configuration needed. Each document has its own path in S3 based on its site's URL and document ID.
- Ensure all tests pass and no new RuboCop violations are introduced
- Update documentation as needed
- Follow the existing code style and conventions
This project is licensed under CC0 1.0 Universal. See the LICENSE file for details.