Skip to content

Wire preservation workflow to Archival Packaging Tool (APT) #1465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ gem 'sentry-ruby'
gem 'simple_form'
gem 'skylight'
gem 'terser'
gem 'webmock'
gem 'zip_tricks'

group :production do
Expand Down
9 changes: 9 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,9 @@ GEM
cocoon (1.2.15)
concurrent-ruby (1.3.5)
connection_pool (2.5.3)
crack (1.0.0)
bigdecimal
rexml
crass (1.0.6)
date (3.4.1)
delayed_job (4.1.13)
Expand Down Expand Up @@ -182,6 +185,7 @@ GEM
terminal-table (>= 1.8)
globalid (1.2.1)
activesupport (>= 6.1)
hashdiff (1.2.0)
hashie (5.0.0)
i18n (1.14.7)
concurrent-ruby (~> 1.0)
Expand Down Expand Up @@ -445,6 +449,10 @@ GEM
activemodel (>= 6.0.0)
bindex (>= 0.4.0)
railties (>= 6.0.0)
webmock (3.25.1)
addressable (>= 2.8.0)
crack (>= 0.3.2)
hashdiff (>= 0.4.0, < 2.0.0)
websocket (1.2.11)
websocket-driver (0.7.7)
base64
Expand Down Expand Up @@ -509,6 +517,7 @@ DEPENDENCIES
terser
timecop
web-console
webmock
zip_tricks

RUBY VERSION
Expand Down
27 changes: 20 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,19 @@ polling by specifying a longer queue wait time. Defaults to 10 if unset.
`SQS_RESULT_IDLE_TIMEOUT` - Configures the :idle_timeout arg of the AWS poll method, which specifies the maximum time
in seconds to wait for a new message before the polling loop exists. Defaults to 0 if unset.

### Archival Packaging Tool (APT) configuration

The following enviroment variables are needed to communicate with [APT](https://github.com/MITLibraries/archival-packaging-tool), which is used in the
[preservation workflow](#publishing-workflow).

`APT_CHALLENGE_SECRET` - Secret value used to authenticate requests to the APT Lambda endpoint.
`APT_VERBOSE` - If set to `true`, enables verbose logging for APT requests.
`APT_CHECKSUMS_TO_GENERATE` - Array of checksum algorithms to generate for files (default: ['md5']).
`APT_COMPRESS_ZIP` - Boolean value to indicate whether the output bag should be compressed as a zip
file (default: true).
`APT_S3_BUCKET` - S3 bucket URI where APT output bags are stored.
`APT_LAMBDA_URL` - The URL of the APT Lambda endpoint for preservation requests.

### Email configuration

`SMTP_ADDRESS`, `SMTP_PASSWORD`, `SMTP_PORT`, `SMTP_USER` - all required to send mail.
Expand Down Expand Up @@ -397,15 +410,15 @@ Note: `Pending publication` is allowed here, but not expected to be a normal occ
## Preservation workflow

The publishing workflow will automatically trigger preservation for all of the published theses in the results queue.
At this point a submission information package is generated for each thesis, then a bag is constructed, zipped, and
streamed to an S3 bucket. (See the SubmissionInformationPackage and SubmissionInformationPackageZipper classes for more
details on this part of the process.)

Once they are in the S3 bucket, the bags are automatically replicated to the Digital Preservation S3 bucket, where they
can be ingested into Archivematica.
At this point, the preservation job will generate an Archivematica payload for each thesis, which
are then POSTed to [APT](https://github.com/MITLibraries/archival-packaging-tool) for further processing. Each payload includes a metadata CSV and a JSON object containing structural information about the thesis files.

Once the payloads are sent to APT, each thesis is structured as a BagIt bag and saved to an S3
bucket, where they can be ingested into Archivematica.

A thesis can be sent to preservation more than once. In order to track provenance across multiple preservation events,
we persist certain data about the SIP and audit the model using `paper_trail`.
A thesis can be sent to preservation more than once. In order to track provenance across multiple preservation events, we persist certain data about the Archivematica payload and audit the model
using `paper_trail`.

### Preserving a single thesis

Expand Down
41 changes: 34 additions & 7 deletions app/jobs/preservation_submission_job.rb
Original file line number Diff line number Diff line change
@@ -1,13 +1,16 @@
class PreservationSubmissionJob < ActiveJob::Base
require 'net/http'
require 'uri'

queue_as :default

def perform(theses)
Rails.logger.info("Preparing to send #{theses.count} theses to preservation")
results = { total: theses.count, processed: 0, errors: [] }
theses.each do |thesis|
Rails.logger.info("Thesis #{thesis.id} is now being prepared for preservation")
sip = thesis.submission_information_packages.create!
preserve_sip(sip)
payload = thesis.archivematica_payloads.create!
preserve_payload(payload)
Rails.logger.info("Thesis #{thesis.id} has been sent to preservation")
results[:processed] += 1
rescue StandardError, Aws::Errors => e
Expand All @@ -20,10 +23,34 @@ def perform(theses)

private

def preserve_sip(sip)
SubmissionInformationPackageZipper.new(sip)
sip.preservation_status = 'preserved'
sip.preserved_at = DateTime.now
sip.save
def preserve_payload(payload)
post_payload(payload)
payload.preservation_status = 'preserved'
payload.preserved_at = DateTime.now
payload.save!
end

def post_payload(payload)
s3_url = ENV.fetch('APT_LAMBDA_URL', nil)
uri = URI.parse(s3_url)
request = Net::HTTP::Post.new(uri, { 'Content-Type' => 'application/json' })
request.body = payload.payload_json

response = Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
http.request(request)
end

validate_response(response)
end

def validate_response(response)
unless response.is_a?(Net::HTTPSuccess)
raise "Failed to post Archivematica payload to APT: #{response.code} #{response.body}"
end

result = JSON.parse(response.body)
unless result['success'] == true
raise "APT failed to create a bag: #{response.body}"
end
end
end
101 changes: 101 additions & 0 deletions app/models/archivematica_payload.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# == Schema Information
#
# Table name: archivematica_payloads
#
# id :integer not null, primary key
# preservation_status :integer default("unpreserved"), not null
# payload_json :text
# preserved_at :datetime
# thesis_id :integer not null
# created_at :datetime not null
# updated_at :datetime not null
#
# This class assembles a payload to send to the Archival Packaging Tool (APT), which then creates a bag for
# preservation. It includes the thesis files, metadata, and checksums. The payload is then serialized to JSON
# for transmission.
#
# Instances of this class are invalid without an associated thesis that has a DSpace handle, a copyright, and
# at least one attached file with no duplicate filenames.
#
# There is some intentional duplication between this and the SubmissionInformationPackage model. The
# SubmissionInformationPackage is the legacy model that was used to create the bag, but it is not
# used in the current APT workflow. We are retaining it for historical purposes.
class ArchivematicaPayload < ApplicationRecord
include Checksums
include Baggable

has_paper_trail
belongs_to :thesis
has_one_attached :metadata_csv

validates :baggable?, presence: true

before_create :set_metadata_csv, :set_payload_json

enum preservation_status: %i[unpreserved preserved]

private

# compress_zip is cast to a boolean to override the string value from ENV. APT strictly requires
# a boolean for this field.
def build_payload
{
action: 'create-bagit-zip',
challenge_secret: ENV.fetch('APT_CHALLENGE_SECRET', nil),
verbose: ActiveModel::Type::Boolean.new.cast(ENV.fetch('APT_VERBOSE', false)),
input_files: build_input_files,
checksums_to_generate: ENV.fetch('APT_CHECKSUMS_TO_GENERATE', ['md5']),
output_zip_s3_uri: bag_output_uri,
compress_zip: ActiveModel::Type::Boolean.new.cast(ENV.fetch('APT_COMPRESS_ZIP', true))
}
end

# Build input_files array from thesis files and attached metadata CSV
def build_input_files
files = thesis.files.map { |file| build_file_entry(file) }
files << build_file_entry(metadata_csv) # Metadata CSV is the only file that is generated in this model
files
end

# Build a file entry for each file, including the metadata CSV.
def build_file_entry(file)
{
uri: ["s3://#{ENV.fetch('AWS_S3_BUCKET')}", file.blob.key].join('/'),
filepath: set_filepath(file),
checksums: {
md5: base64_to_hex(file.blob.checksum)
}
}
end

def set_filepath(file)
file == metadata_csv ? 'metadata/metadata.csv' : file.filename.to_s
end

# The bag_name has to be unique due to our using it as the basis of an ActiveStorage key. Using a UUID
# was not preferred as the target system of these bags adds it's own UUID to the file when it arrives there
# so the filename was unwieldy with two UUIDs embedded in it so we simply increment integers.
def bag_name
safe_handle = thesis.dspace_handle.gsub('/', '_')
"#{safe_handle}-thesis-#{thesis.submission_information_packages.count + 1}"
end

# The bag_output_uri key is constructed to match the expected format for Archivematica.
def bag_output_uri
key = "etdsip/#{thesis.graduation_year}/#{thesis.graduation_month}-#{thesis.accession_number}/#{bag_name}.zip"
[ENV.fetch('APT_S3_BUCKET'), key].join('/')
end

def baggable?
baggable_thesis?(thesis)
end

def set_metadata_csv
csv_data = ArchivematicaMetadata.new(thesis).to_csv
metadata_csv.attach(io: StringIO.new(csv_data), filename: 'metadata.csv', content_type: 'text/csv')
end

def set_payload_json
self.payload_json = build_payload.to_json
end
end
4 changes: 4 additions & 0 deletions app/models/submission_information_package.rb
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@
# updated_at :datetime not null
#

# This model is no longer used, but it is retained for historical purposes and to preserve existing
# data. Its functionality has been replaced by the ArchivematicaPayload model, which is used in the
# current preservation workflow.
#
# Creates the structure for an individual thesis to be preserved in Archivematica according to the BagIt spec:
# https://datatracker.ietf.org/doc/html/rfc8493.
#
Expand Down
54 changes: 0 additions & 54 deletions app/models/submission_information_package_zipper.rb

This file was deleted.

1 change: 1 addition & 0 deletions app/models/thesis.rb
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ class Thesis < ApplicationRecord
has_many :users, through: :authors

has_many :submission_information_packages, dependent: :destroy
has_many :archivematica_payloads, dependent: :destroy

has_many_attached :files
has_one_attached :dspace_metadata
Expand Down
5 changes: 5 additions & 0 deletions config/environments/test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,14 @@
ENV['SQS_RESULT_WAIT_TIME_SECONDS'] = '10'
ENV['SQS_RESULT_IDLE_TIMEOUT'] = '0'
ENV['AWS_REGION'] = 'us-east-1'
ENV['AWS_S3_BUCKET'] = 'fake-etd-bucket'
ENV['DSPACE_DOCTORAL_HANDLE'] = '1721.1/999999'
ENV['DSPACE_GRADUATE_HANDLE'] = '1721.1/888888'
ENV['DSPACE_UNDERGRADUATE_HANDLE'] = '1721.1/777777'
ENV['APT_CHALLENGE_SECRET'] = 'fake-challenge-secret'
ENV['APT_S3_BUCKET'] = 's3://fake-apt-bucket'
ENV['APT_LAMBDA_URL'] = 'https://fake-lambda.example.com/'
ENV['APT_COMPRESS_ZIP'] = 'true'

# While tests run files are not watched, reloading is not necessary.
config.enable_reloading = false
Expand Down
13 changes: 13 additions & 0 deletions db/migrate/20250624182142_create_archivematica_payloads.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
class CreateArchivematicaPayloads < ActiveRecord::Migration[7.1]
def change
create_table :archivematica_payloads do |t|
t.integer :preservation_status, null: false, default: 0
t.text :payload_json
t.datetime :preserved_at

t.references :thesis, null: false, foreign_key: true

t.timestamps
end
end
end
13 changes: 12 additions & 1 deletion db/schema.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
#
# It's strongly recommended that you check this file into your version control system.

ActiveRecord::Schema[7.1].define(version: 2025_02_21_202836) do
ActiveRecord::Schema[7.1].define(version: 2025_06_24_182142) do
create_table "active_storage_attachments", force: :cascade do |t|
t.string "name", null: false
t.string "record_type", null: false
Expand Down Expand Up @@ -65,6 +65,16 @@
t.index ["degree_period_id"], name: "index_archivematica_accessions_on_degree_period_id", unique: true
end

create_table "archivematica_payloads", force: :cascade do |t|
t.integer "preservation_status", default: 0, null: false
t.text "payload_json"
t.datetime "preserved_at"
t.integer "thesis_id", null: false
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
t.index ["thesis_id"], name: "index_archivematica_payloads_on_thesis_id"
end

create_table "authors", force: :cascade do |t|
t.integer "user_id", null: false
t.integer "thesis_id", null: false
Expand Down Expand Up @@ -290,6 +300,7 @@
add_foreign_key "active_storage_attachments", "active_storage_blobs", column: "blob_id"
add_foreign_key "active_storage_variant_records", "active_storage_blobs", column: "blob_id"
add_foreign_key "archivematica_accessions", "degree_periods"
add_foreign_key "archivematica_payloads", "theses"
add_foreign_key "authors", "theses"
add_foreign_key "authors", "users"
add_foreign_key "degrees", "degree_types"
Expand Down
Loading