Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSPS-417 Support environment-specific pipeline input values #188

Merged
merged 11 commits into from
Jan 29, 2025

Conversation

mmorgantaylor
Copy link
Collaborator

@mmorgantaylor mmorgantaylor commented Jan 14, 2025

Description

Here we add support for service-provided pipeline inputs that have environment-specific values (e.g. the reference panel path prefix for imputation). We achieve this via config values that pull from secrets defined in terra-helmfile.

The logic for formatting the pipeline inputs was largely refactored out of the PrepareImputationInputsStep into a method in PipelineInputsOutputsService. Also refactored some other stuff in PipelinesService to PipelineInputsOutputsService.

One functional change is that previously we were not adding in the default value of optional inputs, if the user did not provide a value. (Imputation doesn't have any optional inputs.) With this PR, we now add the default value of non-provided optional inputs to the formatted pipeline inputs.

terra-helmfile PR to define the IMPUTATION_REFERENCE_PANEL_PATH_PREFIX_CUSTOM_VALUE secret: https://github.com/broadinstitute/terra-helmfile/pull/6014

Logs when running locally showing the gathered and formatted inputs:

2025-01-23T15:20:43.192-05:00 pPAJ7J46  INFO 65434 --- [pipelines] [pool-3-thread-2] b.t.p.s.PipelineInputsOutputsService     : Populated raw pipeline inputs: {contigs=["chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19","chr20","chr21","chr22"], refDict=/hg38/ref_dict/Homo_sapiens_assembly38.dict, geneticMapsPath=/hg38/plink-genetic-maps/, multiSampleVcf=../scratch/teaspoons_demo/NA12878_50_duplicate_renamed.clean.vcf.gz, outputBasename=NA12878_50_duplicate_renamed.clean, referencePanelPathPrefix=/hg38/ref_panels/1000G_HGDP_no_singletons/hgdp.tgp.gwaspy.AN_added.bcf.ac2}
2025-01-23T15:20:43.193-05:00 pPAJ7J46  INFO 65434 --- [pipelines] [pool-3-thread-2] b.t.p.s.PipelineInputsOutputsService     : Formatted pipeline inputs: {genetic_maps_path=gs://fc-secure-10efd4d7-392a-4e9e-89ea-d6629fbb06cc/hg38/plink-genetic-maps/, multi_sample_vcf=gs://fc-secure-6970c3a9-dc92-436d-af3d-917bcb4cf05a/user-input-files/c7ed0aab-289b-4d38-87c5-93cc99bec5e4/NA12878_50_duplicate_renamed.clean.vcf.gz, ref_dict=gs://fc-secure-10efd4d7-392a-4e9e-89ea-d6629fbb06cc/hg38/ref_dict/Homo_sapiens_assembly38.dict, output_basename=NA12878_50_duplicate_renamed.clean, contigs=[chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22], reference_panel_path_prefix=gs://fc-secure-10efd4d7-392a-4e9e-89ea-d6629fbb06cc/hg38/ref_panels/1000G_HGDP_no_singletons/hgdp.tgp.gwaspy.AN_added.bcf.ac2}
2025-01-23T15:20:43.193-05:00 pPAJ7J46  INFO 65434 --- [pipelines] [pool-3-thread-2] b.t.p.s.s.i.PrepareImputationInputsStep  : Constructed and formatted ARRAY_IMPUTATION pipeline inputs

Jira Ticket

https://broadworkbench.atlassian.net/browse/TSPS-417

Checklist (provide links to changes)

Copy link
Collaborator

@jsotobroad jsotobroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is exactly what i was thinking as a solution for this so looks good to me. I did have a question about the logic not related to these changes (sorry!) but I could def be wrong about it.

if (keysToPrependWithStorageURL.contains(keyName)) {
rawValue = storageWorkspaceStorageContainerUrl + allPipelineInputs.get(keyName).toString();
processedValue = storageWorkspaceStorageContainerUrl + rawValue;
} else if (userProvidedInputFileKeys.contains(inputDefinition.getName())) {
Copy link
Collaborator

@jsotobroad jsotobroad Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this isnt apart of what you added but I looked at the code a bit and i think if someone provided an input with a key that matched one of inputs not expected from the user, this would overwrite it. I think for the imputation pipeline the only viable input would be contigs because the rest are either requried by the user or need to be prepended with teh storage URL (so they get caught in the above if block). But if someone passed in this value to the prepare API i think their contigs value would overwrite the default one here

  "pipelineInputs": {
    "outputBasename": "min_quota",
    "multiSampleVcf": "palantir_merged_input_samples.liftedover.vcf.gz",
    "contigs": "my value now!"
  },

I think this is becuase nowhere do we exclude any extra inputs from the user provided inputs, we just make sure they have the required entries.

i think https://github.com/DataBiosphere/terra-scientific-pipelines-service/blob/main/service/src/main/java/bio/terra/pipelines/service/PipelinesService.java#L179 coudl be changed to not only log but also remove the extra inputs from the Map

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does not - in pipelinesService.constructRawInputs (https://github.com/DataBiosphere/terra-scientific-pipelines-service/blob/main/service/src/main/java/bio/terra/pipelines/service/PipelinesService.java#L282), we do start with the user-provided inputs, but then we get all the service-provided inputs and add them, which will overwrite anything that the user provides. (we tested this and confirmed that a user can't override a service-provided input value.)

I personally don't mind that we store all the user inputs, even if they provide extra inputs (which they can't do via CLI but can do via API).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup yup makes sense. created https://broadworkbench.atlassian.net/browse/TSPS-427 to discuss if we want to deal with extra inputs in the service vs the cli (or some other option)

@mmorgantaylor mmorgantaylor force-pushed the TSPS-417_env_specific_inputs branch from 66e8df5 to 73911e1 Compare January 22, 2025 19:52
@mmorgantaylor mmorgantaylor marked this pull request as ready for review January 22, 2025 19:59
if (!inputsWithCustomValues.containsKey(keyName)
&& !allPipelineInputs.containsKey(keyName)
&& !inputDefinition.isRequired()) {
continue;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently we don't add default values for optional user-provided inputs to the pipeline inputs. (we don't have optional inputs in the imputation pipeline.) should we add those if they're present?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i think that makes sense, every optional user-provided input should have a default in our database and that should be used here.

if (!inputsWithCustomValues.containsKey(keyName)
&& !allPipelineInputs.containsKey(keyName)
&& !inputDefinition.isRequired()) {
continue;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i think that makes sense, every optional user-provided input should have a default in our database and that should be used here.

@@ -18,10 +18,10 @@
@SuppressWarnings("java:S107") // Disable "Methods should not have too many parameters"
public class PipelineInputDefinition extends BasePipelineVariableDefinition {
@Column(name = "is_required", nullable = false)
private Boolean isRequired;
private boolean isRequired;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the change to primitive boolean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the new method in PipelinesService, when using inputDefinition.isRequired() when it was a Boolean, Sonar wanted me to use that annoying Boolean.equals() syntax to account for possible null values - but since both isRequired and userProvided are non-nullable, it seemed safe to convert to primitive boolean.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense


@Column(name = "user_provided", nullable = false)
private Boolean userProvided;
private boolean userProvided;

@Column(name = "default_value")
private String defaultValue; // must be a String representation of the value
Copy link
Collaborator Author

@mmorgantaylor mmorgantaylor Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsotobroad do you think we should remove the default value in the db for any custom inputs (i.e. refPanelPathPrefix)? either remove or put in "OVERRIDDEN BY CONFIG" as the default value or something?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would rather have the value it has now over "OVERRIDEN BY CONFIG" and i would rather have no value than the other options. If we do remove the value lets make sure we add it to the different environments (including bees/qa) before we merge.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah regardless we need to add to GSM before merge (and that's noted in the PR description, though not as a checklist item).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i lied, i thought it was in the PR description, but it's only in the terra-helmfile PR description. adding to checklist!

String processedValue;

if (keysToPrependWithStorageWorkspaceContainerUrl.contains(keyName)) {
// the rawValue for this field should start with a / so we don't need to add one here
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't really check this anywhere, not sure if we should add more logic for this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the / could come from the raw value or the storage workspace container url (probably safer to have it in the storage container URL value since its just one place per environment and is easier to update than a database migration change. I probably wouldnt put a check here but maybe a comment on where we define storage workspace container url in terra helmfile? and both having a / shouldnt produce an error AFAIK. In fact we could change it to processedValue = storageWorkspaceContainerUrl + "/" + rawValue; just to cover all bases if we want.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does GCP not care if you have an extra slash?

Copy link
Collaborator

@jsotobroad jsotobroad Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope you are totally right it does, i was thinking of http requests not stupid object storage paths. Hmm i kind of like having the / exist in either the code or the storageWorkspaceContainerUrl (basically in one place) instead of all the values in the database where it'll be added to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a FileUtils method to take care of this

Copy link
Collaborator

@jsotobroad jsotobroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments of the semantic nature 😂 . The logic all looks good and this is definitely easier to understand now.


/**
* Format the pipeline inputs for a pipeline. Apply the following manipulations: - use custom
* (environment-specific) values for certain service-provided inputs - prepend the storage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get these manipulations into a list format, like one line per manipulation. makes it easier to read i think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh yes this keeps getting reformatted poorly, will fix

String processedValue;

if (keysToPrependWithStorageWorkspaceContainerUrl.contains(keyName)) {
// the rawValue for this field should start with a / so we don't need to add one here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the / could come from the raw value or the storage workspace container url (probably safer to have it in the storage container URL value since its just one place per environment and is easier to update than a database migration change. I probably wouldnt put a check here but maybe a comment on where we define storage workspace container url in terra helmfile? and both having a / shouldnt produce an error AFAIK. In fact we could change it to processedValue = storageWorkspaceContainerUrl + "/" + rawValue; just to cover all bases if we want.

private String storageWorkspaceStorageUrl;
private List<String> inputKeysToPrependWithStorageWorkspaceContainerUrl;
private String storageWorkspaceContainerUrl;
private Map<String, Object> inputsWithCustomValues;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to have this be Map<String, String> for now since we thats all we have so far and then if we want to have like a list we can make the necessary changes and test properly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also was thinking we may want to put some validation for these config values so we dont have to have a bunch of checks in the actual logic code. Particularly values that are coming from GSM/helmfile. It'll also give us better messages if we catch things here as opposed to the workflow fails cuz of some weird path thing. We can make a constructor similar to the one in LeonardoServerConfiguration and do some assertions there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually doesnt make sense after talkign about it, ignore me!

@@ -37,7 +33,7 @@

class PrepareImputationInputsStepTest extends BaseEmbeddedDbTest {

@Autowired private PipelinesService pipelinesService;
@Mock PipelineInputsOutputsService pipelineInputsOutputsService;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be mocked?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer it to be mocked since we test the logic of the gathering and formatting of inputs in the service test; for testing the step we just want to make sure it calls the method and puts the output into the working map.

// arguments: user-provided inputs, all pipeline input definitions, expected populated
// inputs
arguments( // one required user input
new HashMap<String, Object>(Map.of("inputName", "user provided value")),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for these arguments does just Map.of() not work? does it have to be wrapped in a HashMap<>() ? Same question about the List.of arguments

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah it does here. in the stairway step, they need to be HashMaps in order to be serialized/deserialized properly from the working map, but here just Map.of and List.of does work. I'll change that to simplify, thanks!

* optional user-provided inputs that weren't specified by the user with default values. Extra
* user inputs that are not defined in the pipeline are not included.
*
* <p>Note that this method does not perform any validation on the inputs. We expect the inputs to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for adding these expectations. I wish we could make these functions private or something to signify you shouldnt probably be calling these willy nilly but i dont think we can

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we can make them package-private and still have them testable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice i like it!

PipelineVariableTypesEnum pipelineInputType = inputDefinition.getType();

// use custom value if present, otherwise use the value from raw inputs (allRawInputs)
String rawValue =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rawOrCustomValue?

new ArrayList<>(List.of("inputName")), // prepend this key with storage workspace url
new HashMap<String, Object>(
Map.of("input_name", "gs://storage-workspace-bucket/value"))),
arguments( // test casting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to have one test where there is more than one item in some of the parameters just to get a more
"realistic" test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do when we test the parent method, but I can add some in these ones too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a two-input-definition test case here and in the previous set


@Column(name = "user_provided", nullable = false)
private Boolean userProvided;
private boolean userProvided;

@Column(name = "default_value")
private String defaultValue; // must be a String representation of the value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would rather have the value it has now over "OVERRIDEN BY CONFIG" and i would rather have no value than the other options. If we do remove the value lets make sure we add it to the different environments (including bees/qa) before we merge.

* @param userProvidedPipelineInputs - the user-provided inputs
* @return Map<String, Object> allPipelineInputs - the combined inputs
*/
Map<String, Object> gatherRawInputs(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently not specifying a visibility defaults to package-private: https://docs.oracle.com/javase/tutorial/java/javaOO/accesscontrol.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup yup makes sense

Copy link
Collaborator

@jsotobroad jsotobroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple a questiosn around the config assertions and the path fixing method and the liquibase naming (you know i love to nit pick that :D). This looks really good though

this.storageWorkspaceContainerUrl = storageWorkspaceContainerUrl;

for (Map.Entry<String, String> entry : inputsWithCustomValues.entrySet()) {
if (entry.getValue() == null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also check if they it is empty - can prob use .isBlank()

*/
public static String constructFilePath(String basePath, String fileName) {
String pathDelimiter = "/";
if (basePath.endsWith(pathDelimiter) && fileName.startsWith(pathDelimiter)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can use the stripTrailing and stripLeading functions in stringutils to just get rid of all leading/trailing / then just add one ourselves

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that's nice! thanks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those methods don't take parameters and only remove whitespace, but we can achieve the same thing using replace.

this.inputKeysToPrependWithStorageWorkspaceContainerUrl =
inputKeysToPrependWithStorageWorkspaceContainerUrl;

if (storageWorkspaceContainerUrl.endsWith("/"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why we're checking this? are there other places this could mess with us that we havent accounted for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there shouldn't be other places where this matters, but now that we're constructing paths more smartly this shouldn't be necessary. happy to remove it.

* optional user-provided inputs that weren't specified by the user with default values. Extra
* user inputs that are not defined in the pipeline are not included.
*
* <p>Note that this method does not perform any validation on the inputs. We expect the inputs to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice i like it!

* @param userProvidedPipelineInputs - the user-provided inputs
* @return Map<String, Object> allPipelineInputs - the combined inputs
*/
Map<String, Object> gatherRawInputs(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup yup makes sense

tableName: pipeline_input_definitions
columns:
- column:
name: is_custom_value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to have this be expects_custom_value or is_environment_specific or maybe something else. I feel like i immediately misintepretted this as the value it has is a custom value (which doesnt make sense so maybe not). I dont feel super strongly about it though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i like expects_custom_value! i shied away from is_environment_specific in case there are future cases that warrant custom values that aren't necessarily env-specific.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Collaborator

@jsotobroad jsotobroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome thanks for all of this!

@mmorgantaylor mmorgantaylor merged commit c8b6504 into main Jan 29, 2025
15 checks passed
@mmorgantaylor mmorgantaylor deleted the TSPS-417_env_specific_inputs branch January 29, 2025 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants