-
Notifications
You must be signed in to change notification settings - Fork 22
Convert the v03 pipeline queue to a real queue #1136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -92,6 +92,7 @@ async def test_loading_pipeline_enqueue(self): | |||
'projects_to_run': ['project_a'], | |||
'reference_genome': 'GRCh38', | |||
'sample_type': 'WGS', | |||
'skip_check_sex_and_relatedness': False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this just a stray line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests were failing because that line was missing. No idea why but it didn't seem related to my changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I will take a look into this as it is unexpected!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
v03_pipeline/lib/paths.py
Outdated
@@ -1,6 +1,8 @@ | |||
import hashlib | |||
import os | |||
|
|||
from uuid import uuid1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might actually be better to use the request_id (currently a timestamp) being assigned by the pipeline_worker to instead be passed in to loading_pipeline_queue_path()
. That value is lexically sortable, for example:
20250805-123456
20250805-223000
20250806-001500
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I'll give it a go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've ultimately decided to do a combination of both methods. A queue file would look something like this: request_20250805-123456_<uuid>.json
. I chose to keep the uuid in there to be prevent losses in jobs when multiple jobs are submitted within the same second
v03_pipeline/lib/paths.py
Outdated
f'request_{uuid1().int}.json', | ||
) | ||
|
||
def get_oldest_queue_path() -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method might actually fit nicely in runs.py. You'll probably want to abstract the
os.path.join(LOCAL_DISK_MOUNT_PATH,
'loading_pipeline_queue')
into a helper in some way that would be shared between loading_pipeline_queue_path(run_id: str)
and this method!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the method to runs.py
:)
@nvnieuwk Thanks for contributing! This has been on our roadmap and is definitely needed. I left a few comments, and we'll need to get the build passing, but the idea looks solid to me. |
@@ -35,24 +35,6 @@ async def loading_pipeline_enqueue(request: web.Request) -> web.Response: | |||
except ValueError as e: | |||
raise web.HTTPBadRequest from e | |||
|
|||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to do this on this pr, but a back pressure notion of "there are too many files in the queue" would make this more complete! I can add a ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, what would be 'too many files' in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about 5 or 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would probably want to queue more in our case. Would it be bad if the limit is higher? (I'm thinking 1000 even)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make it an env var? The reason for the low suggested limit is just the simplicity of the mechanism. If you're reliably queuing up dozens of requests from the UI, you might want to consider joint calling your VCF and loading multiple projects at once.
We're actively trying to make the pipeline more performant and better able to support concurrent loads as well, which will hopefully ease some of this burden!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea, thank you! We sadly can't do analysis on large cohorts at our lab due to several reasons, but having a way to configure that limit might be the best way to handle this.
Also really looking forward to the concurrent loads!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the LOADING_QUEUE_LIMIT
environment variable with a default of 10
for this. The app will first check if the queue is full before adding a file to it and will return a 409
error in this case.
v03_pipeline/lib/paths.py
Outdated
) | ||
return os.path.join( | ||
loading_pipeline_queue_dir(), | ||
f'request_{run_id}_{uuid.uuid1().int}.json', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be simpler to just use random.randint(0, 99)
or something similar here. We run into issues with run_ids being too big in our system sometimes, so just keeping it less verbose would be nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see that you're not including it when you regex parse downstream. I think if you shorten the randomness you should be fine to include it as part of the run id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll just use the first 5 characters or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nvnieuwk thanks for incorporating the feedback! looks like I might need to move a few imports around to get the build passing but things looks fully functional and correct to me!
Awesome! I can also take a look at it tomorrow if you haven't done it by then :) |
This PR implements actual queuing into the v03 pipeline by expanding the current system.
Main changes
loading_pipeline_enqueue
route, a new queue file is created with a name structured as follows:request_<uuid>.json
where<uuid>
is a unique identifier to prevent any filename collisions