Skip to content

Commit a83339c

Browse files
committed
Wire preservation workflow to Archival Packaging Tool (APT)
Why these changes are being introduced: DataEng has developed [APT](https://github.com/MITLibraries/archival-packaging-tool/) as middleware between ETD and Archivematica. This new application handles the BagIt logic, including creating bags in an S3 bucket connected to Archivematica. Thus, much of the SIP logic in ETD is no longer required. Relevant ticket(s): * [ETD-669](https://mitlibraries.atlassian.net/browse/ETD-669) How this addresses that need: This adds an Archivematica Payload model that effectively replaces the SIP model. The new model constructs the payload JSON expected by APT. Instantations of the model generate and persist this JSON on create, along with the metadata CSV as an ActiveStorage attachment. The other significant change is in the Preservation Submission Job. Previously, this job invoked the Submission Information Package Zipper model to stream a serialized bag to S3. Now, it's responsible for POSTing the JSON data to APT and handling the response. Side effects of this change: * The tests that call APT use webmock and stubbed responses. We would normally use VCR for external API calls, but in this case it doesn't seem prudent to pollute the APT S3 bucket, as it's possible the current test bucket will become the bucket we use. * The SIP model is retained for historical purposes. This is not ideal in terms of maintainability, but it feels important to retain that data, at least for the time being.
1 parent 92c6806 commit a83339c

14 files changed

+353
-135
lines changed

Gemfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ gem 'sentry-ruby'
3636
gem 'simple_form'
3737
gem 'skylight'
3838
gem 'terser'
39+
gem 'webmock'
3940
gem 'zip_tricks'
4041

4142
group :production do

Gemfile.lock

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,9 @@ GEM
153153
cocoon (1.2.15)
154154
concurrent-ruby (1.3.5)
155155
connection_pool (2.5.0)
156+
crack (1.0.0)
157+
bigdecimal
158+
rexml
156159
crass (1.0.6)
157160
date (3.4.1)
158161
delayed_job (4.1.13)
@@ -180,6 +183,7 @@ GEM
180183
terminal-table (>= 1.8)
181184
globalid (1.2.1)
182185
activesupport (>= 6.1)
186+
hashdiff (1.2.0)
183187
hashie (5.0.0)
184188
i18n (1.14.7)
185189
concurrent-ruby (~> 1.0)
@@ -440,6 +444,10 @@ GEM
440444
activemodel (>= 6.0.0)
441445
bindex (>= 0.4.0)
442446
railties (>= 6.0.0)
447+
webmock (3.25.1)
448+
addressable (>= 2.8.0)
449+
crack (>= 0.3.2)
450+
hashdiff (>= 0.4.0, < 2.0.0)
443451
websocket (1.2.11)
444452
websocket-driver (0.7.7)
445453
base64
@@ -504,6 +512,7 @@ DEPENDENCIES
504512
terser
505513
timecop
506514
web-console
515+
webmock
507516
zip_tricks
508517

509518
RUBY VERSION

README.md

Lines changed: 20 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,19 @@ polling by specifying a longer queue wait time. Defaults to 10 if unset.
159159
`SQS_RESULT_IDLE_TIMEOUT` - Configures the :idle_timeout arg of the AWS poll method, which specifies the maximum time
160160
in seconds to wait for a new message before the polling loop exists. Defaults to 0 if unset.
161161

162+
### Archival Packaging Tool (APT) configuration
163+
164+
The following enviroment variables are needed to communicate with [APT](https://github.com/MITLibraries/archival-packaging-tool), which is used in the
165+
[preservation workflow](#publishing-workflow).
166+
167+
`APT_CHALLENGE_SECRET` - Secret value used to authenticate requests to the APT Lambda endpoint.
168+
`APT_VERBOSE` - If set to `true`, enables verbose logging for APT requests.
169+
`APT_CHECKSUMS_TO_GENERATE` - Array of checksum algorithms to generate for files (default: ['md5']).
170+
`APT_COMPRESS_ZIP` - Boolean value to indicate whether the output bag should be compressed as a zip
171+
file (default: true).
172+
`APT_S3_BUCKET` - S3 bucket URI where APT output bags are stored.
173+
`APT_LAMBDA_URL` - The URL of the APT Lambda endpoint for preservation requests.
174+
162175
### Email configuration
163176

164177
`SMTP_ADDRESS`, `SMTP_PASSWORD`, `SMTP_PORT`, `SMTP_USER` - all required to send mail.
@@ -397,15 +410,15 @@ Note: `Pending publication` is allowed here, but not expected to be a normal occ
397410
## Preservation workflow
398411

399412
The publishing workflow will automatically trigger preservation for all of the published theses in the results queue.
400-
At this point a submission information package is generated for each thesis, then a bag is constructed, zipped, and
401-
streamed to an S3 bucket. (See the SubmissionInformationPackage and SubmissionInformationPackageZipper classes for more
402-
details on this part of the process.)
403413

404-
Once they are in the S3 bucket, the bags are automatically replicated to the Digital Preservation S3 bucket, where they
405-
can be ingested into Archivematica.
414+
At this point, the preservation job will generate an Archivematica payload for each thesis, which
415+
are then POSTed to [APT](https://github.com/MITLibraries/archival-packaging-tool) for further processing. Each payload includes a metadata CSV and a JSON object containing structural information about the thesis files.
416+
417+
Once the payloads are sent to APT, each thesis is structured as a BagIt bag and saved to an S3
418+
bucket, where they can be ingested into Archivematica.
406419

407-
A thesis can be sent to preservation more than once. In order to track provenance across multiple preservation events,
408-
we persist certain data about the SIP and audit the model using `paper_trail`.
420+
A thesis can be sent to preservation more than once. In order to track provenance across multiple preservation events, we persist certain data about the Archivematica payload and audit the model
421+
using `paper_trail`.
409422

410423
### Preserving a single thesis
411424

Lines changed: 34 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,16 @@
11
class PreservationSubmissionJob < ActiveJob::Base
2+
require 'net/http'
3+
require 'uri'
4+
25
queue_as :default
36

47
def perform(theses)
58
Rails.logger.info("Preparing to send #{theses.count} theses to preservation")
69
results = { total: theses.count, processed: 0, errors: [] }
710
theses.each do |thesis|
811
Rails.logger.info("Thesis #{thesis.id} is now being prepared for preservation")
9-
sip = thesis.submission_information_packages.create!
10-
preserve_sip(sip)
12+
payload = thesis.archivematica_payloads.create!
13+
preserve_payload(payload)
1114
Rails.logger.info("Thesis #{thesis.id} has been sent to preservation")
1215
results[:processed] += 1
1316
rescue StandardError, Aws::Errors => e
@@ -20,10 +23,34 @@ def perform(theses)
2023

2124
private
2225

23-
def preserve_sip(sip)
24-
SubmissionInformationPackageZipper.new(sip)
25-
sip.preservation_status = 'preserved'
26-
sip.preserved_at = DateTime.now
27-
sip.save
26+
def preserve_payload(payload)
27+
post_payload(payload)
28+
payload.preservation_status = 'preserved'
29+
payload.preserved_at = DateTime.now
30+
payload.save
31+
end
32+
33+
def post_payload(payload)
34+
s3_url = ENV.fetch('APT_LAMBDA_URL', nil)
35+
uri = URI.parse(s3_url)
36+
request = Net::HTTP::Post.new(uri, { 'Content-Type' => 'application/json' })
37+
request.body = payload.payload_json
38+
39+
response = Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
40+
http.request(request)
41+
end
42+
43+
validate_response(response)
44+
end
45+
46+
def validate_response(response)
47+
unless response.is_a?(Net::HTTPSuccess)
48+
raise "Failed to post Archivematica payload to APT: #{response.code} #{response.body}"
49+
end
50+
51+
result = JSON.parse(response.body)
52+
unless result['success'] == true
53+
raise "APT failed to create a bag: #{response.body}"
54+
end
2855
end
2956
end

app/models/archivematica_payload.rb

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# == Schema Information
2+
#
3+
# Table name: archivematica_payloads
4+
#
5+
# id :integer not null, primary key
6+
# preservation_status :integer default("unpreserved"), not null
7+
# payload_json :text
8+
# preserved_at :datetime
9+
# thesis_id :integer not null
10+
# created_at :datetime not null
11+
# updated_at :datetime not null
12+
#
13+
# This class assembles a payload to send to the Archival Packaging Tool (APT), which then creates a bag for
14+
# preservation. It includes the thesis files, metadata, and checksums. The payload is then serialized to JSON
15+
# for transmission.
16+
#
17+
# Instances of this class are invalid without an associated thesis that has a DSpace handle, a copyright, and
18+
# at least one attached file with no duplicate filenames.
19+
#
20+
# There is some intentional duplication between this and the SubmissionInformationPackage model. The
21+
# SubmissionInformationPackage is the legacy model that was used to create the bag, but it is not
22+
# used in the current APT workflow. We are retaining it for historical purposes.
23+
class ArchivematicaPayload < ApplicationRecord
24+
include Checksums
25+
include Baggable
26+
27+
has_paper_trail
28+
belongs_to :thesis
29+
has_one_attached :metadata_csv
30+
31+
validates :baggable?, presence: true
32+
33+
before_create :set_metadata_csv, :set_payload_json
34+
35+
enum preservation_status: %i[unpreserved preserved]
36+
37+
private
38+
39+
# compress_zip is cast to a boolean to override the string value from ENV. APT strictly requires
40+
# a boolean for this field.
41+
def build_payload
42+
{
43+
action: 'create-bagit-zip',
44+
challenge_secret: ENV.fetch('APT_CHALLENGE_SECRET', nil),
45+
verbose: ENV.fetch('APT_VERBOSE', false),
46+
input_files: build_input_files,
47+
checksums_to_generate: ENV.fetch('APT_CHECKSUMS_TO_GENERATE', ['md5']),
48+
output_zip_s3_uri: bag_output_uri,
49+
compress_zip: ActiveModel::Type::Boolean.new.cast(ENV.fetch('APT_COMPRESS_ZIP', true))
50+
}
51+
end
52+
53+
# Build input_files array from thesis files and attached metadata CSV
54+
def build_input_files
55+
files = thesis.files.map { |file| build_file_entry(file) }
56+
files << build_file_entry(metadata_csv) # Metadata CSV is the only file that is generated in this model
57+
files
58+
end
59+
60+
# Build a file entry for each file, including the metadata CSV.
61+
def build_file_entry(file)
62+
{
63+
uri: ["s3://#{ENV.fetch('AWS_S3_BUCKET')}", file.blob.key].join('/'),
64+
filepath: set_filepath(file),
65+
checksums: {
66+
md5: base64_to_hex(file.blob.checksum)
67+
}
68+
}
69+
end
70+
71+
def set_filepath(file)
72+
file == metadata_csv ? 'metadata/metadata.csv' : file.filename.to_s
73+
end
74+
75+
# The bag_name has to be unique due to our using it as the basis of an ActiveStorage key. Using a UUID
76+
# was not preferred as the target system of these bags adds it's own UUID to the file when it arrives there
77+
# so the filename was unwieldy with two UUIDs embedded in it so we simply increment integers.
78+
def bag_name
79+
safe_handle = thesis.dspace_handle.gsub('/', '_')
80+
"#{safe_handle}-thesis-#{thesis.submission_information_packages.count + 1}"
81+
end
82+
83+
# The bag_output_uri key is constructed to match the expected format for Archivematica.
84+
def bag_output_uri
85+
key = "etdsip/#{thesis.graduation_year}/#{thesis.graduation_month}-#{thesis.accession_number}/#{bag_name}.zip"
86+
[ENV.fetch('APT_S3_BUCKET'), key].join('/')
87+
end
88+
89+
def baggable?
90+
baggable_thesis?(thesis)
91+
end
92+
93+
def set_metadata_csv
94+
csv_data = ArchivematicaMetadata.new(thesis).to_csv
95+
metadata_csv.attach(io: StringIO.new(csv_data), filename: 'metadata.csv', content_type: 'text/csv')
96+
end
97+
98+
def set_payload_json
99+
self.payload_json = build_payload.to_json
100+
end
101+
end

app/models/submission_information_package.rb

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@
1414
# updated_at :datetime not null
1515
#
1616

17+
# This model is no longer used, but it is retained for historical purposes and to preserve existing
18+
# data. Its functionality has been replaced by the ArchivematicaPayload model, which is used in the
19+
# current preservation workflow.
20+
#
1721
# Creates the structure for an individual thesis to be preserved in Archivematica according to the BagIt spec:
1822
# https://datatracker.ietf.org/doc/html/rfc8493.
1923
#

app/models/submission_information_package_zipper.rb

Lines changed: 0 additions & 54 deletions
This file was deleted.

app/models/thesis.rb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ class Thesis < ApplicationRecord
4848
has_many :users, through: :authors
4949

5050
has_many :submission_information_packages, dependent: :destroy
51+
has_many :archivematica_payloads, dependent: :destroy
5152

5253
has_many_attached :files
5354
has_one_attached :dspace_metadata

config/environments/test.rb

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,14 @@
4040
ENV['SQS_RESULT_WAIT_TIME_SECONDS'] = '10'
4141
ENV['SQS_RESULT_IDLE_TIMEOUT'] = '0'
4242
ENV['AWS_REGION'] = 'us-east-1'
43+
ENV['AWS_S3_BUCKET'] = 'fake-etd-bucket'
4344
ENV['DSPACE_DOCTORAL_HANDLE'] = '1721.1/999999'
4445
ENV['DSPACE_GRADUATE_HANDLE'] = '1721.1/888888'
4546
ENV['DSPACE_UNDERGRADUATE_HANDLE'] = '1721.1/777777'
47+
ENV['APT_CHALLENGE_SECRET'] = 'fake-challenge-secret'
48+
ENV['APT_S3_BUCKET'] = 's3://fake-apt-bucket'
49+
ENV['APT_LAMBDA_URL'] = 'https://fake-lambda.example.com/'
50+
ENV['APT_COMPRESS_ZIP'] = 'true'
4651

4752
# While tests run files are not watched, reloading is not necessary.
4853
config.enable_reloading = false
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
class CreateArchivematicaPayloads < ActiveRecord::Migration[7.1]
2+
def change
3+
create_table :archivematica_payloads do |t|
4+
t.integer :preservation_status, null: false, default: 0
5+
t.text :payload_json
6+
t.datetime :preserved_at
7+
8+
t.references :thesis, null: false, foreign_key: true
9+
10+
t.timestamps
11+
end
12+
end
13+
end

0 commit comments

Comments
 (0)