Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider being agnostic to the backend (currently google cloud) #41

Open
sjfleming opened this issue Jul 28, 2022 · 5 comments
Open

Consider being agnostic to the backend (currently google cloud) #41

sjfleming opened this issue Jul 28, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@sjfleming
Copy link
Member

(Only if people actually want / need this. But I assume some people might. I think the Imaging Platform stores a lot of data on AWS.)

Supposedly Terra will be supporting multiple backends (GCP, AWS, Azure) in the near future. All of our "gsutil" commands (which kind of break the usual WDL logic) only work on GCP.

We should think about whether we can do everything strictly in WDL, without any gsutil commands. Or whether we can have separate sorts of "cloud file copying" commands for separate backends, calling the right ones where appropriate.

@lynnlangit
Copy link

If there is a need for this on AWS or Azure, I would be interested in contributing to this work.

@sjfleming
Copy link
Member Author

Interesting that you should say that @lynnlangit ! We recently had a request to help make this work on AWS (from a AWS solutions architect working on Amazon Omics). We don't have many internal Broad users wanting this at the moment, but a lot of the Imaging Platform does their work on AWS (not using Terra or Cromwell). And, institutionally, there is a push within Broad's Data Sciences Platform currently to get workflows up and running on Azure due to a collaboration with Microsoft.

So we would welcome any contribution you'd be interested in making!

I will mention though: we actively use the current google backend to analyze data, so we want to ensure that part doesn't break / change too much... I think the best path forward is probably to

have separate sorts of "cloud file copying" commands for separate backends

even though this is not the way WDL is supposed to work. But we are open to other opinions! (If we could write one set of WDLs that are really agnostic to the backend, that would be fantastic. The reason we didn't do that at the outset is that there are just so many individual input files - images - involved. There are several ways we could get around this though...)

I also don't think I have a way to test workflows on AWS personally. It would be easier for us to test (using Terra) workflows on Azure, since "Terra on Azure" is now live. I don't really know how I'd review PRs for something running on AWS until I can figure out how to test it...

@carmendv @deflaux

@lynnlangit
Copy link

@sjfleming - thanks for the info - fyi...

  • I am working with a customer to implement Google Batch in multiple workflow scenarios (update from LS API)
  • I am a MSFT Regional Director (award, not employee) and have connection to product teams at MSFT and have been following your work with Terra on Azure (currently waiting for your team to approve my early access)
  • I am also doing a project for a customer which includes a public repo 'aws-for-bioinformatics' with samples and patterns.

Given this - what is the next step on this project?

@deflaux
Copy link
Contributor

deflaux commented Feb 22, 2023

It's great to hear that you would like to contribute @lynnlangit !

Regarding next steps:

  • We'll make a branch named multicloud and contributors can send pull requests from their forks to that branch.
    • As pull requests get merged there, we can use the code in the multicloud branch to test on the various clouds.
    • As mentioned by @sjfleming, one set of WDLs that are agnostic to the backend is the ideal end state.
  • We'll specify some recommended data for testing and validation.
    • To start, the CellProfiler methods developers have let us know a recommended plate to use from their recent data release.
    • We'll post to branch multicloud some cloud-specific inputs.json files for that plate and document where to find the expected outputs for validation.
  • We'll file some GitHub issues for specific changes needed to update these workflows to be agnostic to the backend.
    • Contributors can chime in on those issues to let others know they are working on it for a particular cloud to avoid duplicate work.
    • But the details in the Git issues are just suggestions! There are multiple ways to update these workflows to be agnostic to the backend.

@deflaux
Copy link
Contributor

deflaux commented Mar 22, 2023

@lynnlangit we've completed:

  1. creating a branch for multi-cloud pull request contributions and testing
  2. choosing some specific data for testing and validating the results on GCP, along with corresponding inputs.json files for AWS and GCP.
  3. filing some GitHub issues with concrete suggestions for next steps for one possible way to make these workflows multi-cloud

Is there any other information we can provide to you at this time? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants