-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E2e pipeline with schedule #2851
Conversation
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified FilesNo covered modified files...
|
|
ItemProcessor = { | ||
StartAt = "Extract nodes and edges from source", | ||
States = { | ||
"Extract nodes and edges from source" = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the extractors also need to run in a particular order so this might not work as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may be able to express the steps more succinctly and parallelise some parts, but there are some extractors that rely on previous extractions (through the source), specifically wikidata. That require previous steps to have completed. cc @StepanBrychta
|
||
variable "state_machine_monthly_inputs" { | ||
type = list(object({ label : string, transformer_type : string, entity_type : string })) | ||
default = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than as variable defaults, I would keep these values in a local.
@@ -36,7 +36,7 @@ resource "aws_sfn_state_machine" "catalogue_graph_extractor" { | |||
"--entity-type", | |||
"{% $states.input.entity_type %}", | |||
"--stream-destination", | |||
"{% $states.input.stream_destination %}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These variables are to allow the manual triggering of pipeline steps for testing.
94099b9
to
24d65ca
Compare
cf5b866
to
4562bb4
Compare
53e1c6f
to
884ffec
Compare
7553b02
to
d782888
Compare
437f8c5
to
a36bd75
Compare
aadbcc5
to
b729d04
Compare
262be31
to
86e396e
Compare
"entity_type" : "edges" | ||
}, | ||
{ | ||
"label" : "Catalogue Concept Nodes", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these can be moved into the daily inputs - the available catalogue concepts may change day to day as new works with the concepts are added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments, but otherwise looks good.
60a1d78
to
60f2e03
Compare
1bbc8e3
to
e19b114
Compare
What does this change?
wellcomecollection/platform#5969
This brings together the extraction and loading of nodes and edges from sources to graph, and the ingestion of the resulting concepts data into elasticsearch
Once we've asserted that this works nicely, I reckon we can remove
catalogue_graph_bulk_loaders
andcatalogue_graph_extractors
which are replaced by theMap
states of the concepts_pipelineNot sure about keeping
catalogue_graph_pipeline
around?How to test
We have the
pipeline-2025-03-06
that is not yet live. We could change the schedule to run this against it to test?How can we measure success?
The concepts pipeline run end to end on the desired schedule
Have we considered potential risks?
We should do a test run on an index that is not yet live
terraform plan