-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding kneaddata pipeline #4
Conversation
Updating to latest pull request version
# --trimmomatic # path to trimmomatic executable | ||
# --bowtie2 # path to bowtie executable | ||
# --bmtagger # path to bmtagger exectuable | ||
# --trf # path to TRF executable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how these executable paths get specified. I assume that's what line 19-23 are referring to, and that we just need to have an executable in the $PATH?
|
||
# Trimmomatic options | ||
'max-memory': ['integer', '500'], # max memory in mb [ DEFAULT : 500 ] | ||
'trimmomatic-options': ['string', 'ILLUMINACLIP:$trimmomatic/adapters/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This depends on a hardcoded filepath. How can we specify this given this option formatting?
|
||
# FastQC options | ||
} | ||
outputs = {'per_sample_FASTQ': 'per_sample_FASTQ'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KneadData should produce a lot of outputs, and the quantity will depend on the other options specified (e.g. if you provide multiple reference dbs, or skip fastQC). how can we deal with this?
# --bowtie2 # path to bowtie executable | ||
# --bmtagger # path to bmtagger exectuable | ||
# --trf # path to TRF executable | ||
'reference-db': ['choice:["human_genome"]', 'human_genome'], # ref db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we may want to specify additional reference DBs by including multiple iterations of this flag, one per ref (examples might include PhiX + Human + Mouse). How can we make that work?
…to adding-kneaddata-pipeline Updating to Antonio's new code
Ok! Current commit has functional but basic KneadData implementation passing tests. |
@antgonza @josenavas @wasade Hey guys! I think I've got the KneadData pipeline more or less put together. Would you mind doing a code review so we can get this merged in time for the the Qiita workshop next week? Note also that I'm putting together some additional functionality in other parts of the shotgun plugin that I think would be beneficial to the workshop. If you have a chance, it would be awesome if you had a chance to look at those too! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few comments
@@ -86,15 +89,32 @@ def generate_humann2_analysis_commands(forward_seqs, reverse_seqs, map_file, | |||
% ', '.join(sn_by_rp.keys())) | |||
|
|||
cmds = [] | |||
params = ['--%s "%s"' % (k, v) for k, v in viewitems(parameters) if v] | |||
params = ['--%s "%s"' % (k, v) if v is not True else '--%s' % k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if v is not True
is the "same" as if not v
. However, because of the conditional if v
following the for statement, will if v is not True
ever evaluate True?
msg = "Step 3 of 5: Executing HUMANn2 job (%d/{0})".format(len(commands)) | ||
success, msg = _run_commands(qclient, job_id, commands, msg) | ||
if not success: | ||
return False, None, msg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are these?
for rp in sn_by_rp: | ||
if f_fn.startswith(rp) and run_prefix is None: | ||
run_prefix = rp | ||
elif f_fn.startswith(rp) and run_prefix is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this indicative of a problem in the mapping file? if so, should it be validated in get_sample_names_by_run_prefix
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per above, the mapping file is currently not aware of exactly what the file is going to be called when it looks for it on the filesystem, so validation has to happen when looking at the sequences themselves.
used_prefixes = [] | ||
for i, f_fp in enumerate(forward_seqs): | ||
# f_fp is the fwd read filepath | ||
f_fn = basename(f_fp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are a lot of calls to startswith
below, would it make sense to splitext
on this as well so that a direct comparison can be made? it would also allow for representing sn_by_rp
as a dict
and doing something like:
run_prefix = sn_by_rp.get(f_fn)
if run_prefix is None:
# complain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that there is a bunch of text that sometimes gets appended after the run prefix and before that file extension.
In general though I think this is a pretty weak part of the system, and liable to cause us problems. I think we would be best served to move towards doing our own demultiplexing into rational and programmatically generated filenames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, that's annoying.
raise ValueError('No run prefix matching this fwd read: %s' | ||
% f_fn) | ||
|
||
if run_prefix in used_prefixes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can used_prefixes
be a set
? this will be a list lookup, and while the number of entries probably wont get too large, it's good practice to use a better lookup structure when the thing being searched over is expected to reasonably be > 5 elements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(rule of thumb)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok!
|
||
used_prefixes.append(run_prefix) | ||
# create the tuple for this read set | ||
if reverse_seqs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...with the forloop change then this if statement becomes unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
for param in sorted(parameters): | ||
value = parameters[param] | ||
|
||
if str(value) == 'True': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does if value is True
not work...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue I was trying to get around with this was problems with integer params of 1 and 0 causing incorrect behavior, but I can see that this is preferable as long as we can depend on Qiita passing True and not 'True' in the parameter dict.
list of str | ||
The run prefixes used | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space
|
||
param_string = format_kneaddata_params(parameters) | ||
prefixes = [] | ||
for run_prefix, sample, f_fp, r_fp in samples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it ever the case that some samples will have reverse reads and others will not within the same study? right now the forloop implicitly supports this but I don't think that is the case and open the door for a subtle bug in the future as code fluxes over time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and it's a big pain in the ass. This frequently happens with shallow shotgun runs like we have been getting a lot of recently, where the quality is not so good and you end up with the reverse read being trimmed to extinction.
I agree that this kind of thing can set us up for problems and we should keep aware of it! Perhaps for early QC tools like kneaddata we can enforce proper paired reads (assuming no trimming so everything should be consistent) while downstream tools can be more flexible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not forcing paring makes sense and more cause it sounds like is something that is already happening. What about adding a comment here, perhaps in the Notes and on this line so future coders/reviewers are aware of this. Additionally, we could add a test for this but I don't feel strongly about this addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, adding that comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think that helps. Perhaps adding a comment on top of the for loop, like # see notes before changing!
, might be good.
ArtifactInfo('clean unpaired R2', 'per_sample_FASTQ', | ||
[(join(sam_out_dir, '%s_unmatched_2.fastq' % prefix), | ||
'per_sample_FASTQ')])] | ||
ainfo += sam_a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the intent for an append
or an extend
? would it be possible to use the desired method instead of an overloaded in place operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, i did not know about the extend method!
Can one of the admins verify this patch? |
@jdereus @josenavas Hey guys jenkins seems not to be noticing this PR... Is there something I need to do on my end to enable it? |
ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really nice!! I think once the comments are addressed this is ready to go. Thanks @tanaes
@@ -39,7 +45,7 @@ before_script: | |||
script: | |||
- sleep 10 # give enough time to the webserver to start | |||
- configure_shotgun --env-script "source activate qp-shotgun" --server-cert $QIITA_SERVER_CERT | |||
- nosetests --with-doctest --with-coverage | |||
- nosetests --with-doctest --with-coverage -e humann2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, didn't know this was possible, I guess it makes sense. Could you pull from master and remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will do on final upload!
|
||
param_string = format_kneaddata_params(parameters) | ||
prefixes = [] | ||
for run_prefix, sample, f_fp, r_fp in samples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not forcing paring makes sense and more cause it sounds like is something that is already happening. What about adding a comment here, perhaps in the Notes and on this line so future coders/reviewers are aware of this. Additionally, we could add a test for this but I don't feel strongly about this addition.
Returns | ||
------- | ||
samples: list of tup | ||
list of 3-tuples with run prefix, sample name, fwd read fp, rev read fp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add a catch so that if the run prefix isn't in the mapping file that it assumes the run prefix is equal to the full file name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea but I think that change will change the behavior compared to other plugins, which points that perhaps we should have a method to do this in qiita_client vs. having ad hoc functions. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree in principle! Probably this is going to change if we make a more sensible fastq file format or artifact structure, though, so maybe wait until then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, could you add an issue so we don't forget ... thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be covered by qiita-spots/qtp-sequencing#8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure as that doesn't deal with run_prefix as the full name, perhaps pointing to this discussion, adding a note or both will make it easier to follow up.
'especially data from microbiome experiments.') | ||
|
||
# Define the HUMAnN2 command | ||
req_params = {'input': ('artifact', ['per_sample_FASTQ'])} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we specify that it needs forward and (or or) reverse fastqs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code will always return all the available ones, in your code you need to check if they exist, for example this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I get it... Currently this check occurs later on in processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tanaes ! Some comments - I understand that Qiita has some limitations and I would be happy to talk with you to see what can we do to remove those limitations, but it is possible that some of them can't be removed so the plugin needs to be adapted.
- source activate qiita_env | ||
- export CONDA_BIN=/home/travis/miniconda3/bin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you are using this anymore, but this no longer running on travis so this path is incorrect. Check this line for an idea on how to fix it, unless you don't need anymore, in which case can be removed.
__all__ = ['kneaddata'] | ||
|
||
# Initialize the plugin | ||
plugin = QiitaPlugin( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is going to work. Each repository should have a single "QiitaPlugin" instance. In case that we want to have multiple "QiitaPlugin" instances we need a configure and a start_* script for each of them, in which case I don't see the benefit of having all these tools in a single repository. I think this can be added as a command to the shotgun plugin.
'run-trf': ['boolean', 'False'], # run TRF repeat finder tool | ||
'run-fastqc-start': ['boolean', 'True'], # run FastQC on original data | ||
'run-fastqc-end': ['boolean', 'True'], # run FastQC on filtered data | ||
'store-temp-output': ['boolean', 'False'], # store temp output files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this parameter needs to be exposed. In which cases are we going to let the user store the temp output?
'DEBUG'], | ||
|
||
# Trimmomatic options | ||
'max-memory': ['string', '500m'], # max memory in mb [ DEFAULT : 500m ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need more info about this parameter. In general we don't expose these types of parameters to the user since it is system dependent (and we don't want the users to request, let's say, 3TB...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, for admin and this is required by JAVA! 😭
|
||
# Trimmomatic options | ||
'max-memory': ['string', '500m'], # max memory in mb [ DEFAULT : 500m ] | ||
'trimmomatic': ['string', '$TRIMMOMATIC_DIR'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter and the one below shouldn't be exposed here. The parameters exposed here are parameters that are going to be eventually exposed to the user. I this case, this parameter should never be changed (i.e. it is set up at installation time and then it always is the same).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this.
|
||
if value is True: | ||
params.append('--%s' % param) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since these are if/elif, the continue
statements are not needed - can you remove them?
prefixes = [] | ||
for run_prefix, sample, f_fp, r_fp in samples: | ||
prefixes.append(run_prefix) | ||
if r_fp is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, but I think this if/else block can be reduced to:
r_fp_str = ' --input "%s"' % r_fp if r_fp is not None else ""
cmds.append('kneaddata --input "%s"%s --output "%s" --output-prefix "%s" %s'
% (f_fp, r_fp_str, join(out_dir, run_prefix), run_prefix, param_string))
to avoid some code duplication
return True, "" | ||
|
||
|
||
def _per_sample_ainfo(out_dir, samples): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is adding 1 artifact per output file - I don't think this is the desired behavior right? I think what we want to get is a single artifact with all the sequence file. Note that the number of output artifacts from a command should be deterministic (i.e. it is known in advance how many artifacts a command is going to generate). Note that artifact != file so it shouldn't be a problem.
Now, extending on that, it looks like we can potentially group the output files in 3 different artifacts: (1) clean paired reads with R1 and R2, (2) clean unpaired R1, and (3) clean unpaired R2. However, this doesn't work for the case in which R2 is not provided, since we only have 1 output artifact. To solve this case, I would add 2 commands, 1 for paired reads and 1 for unpaired reads. Note that internally you can be using the same functions if you want, but from the Qiita point of view this simplifies things as the output artifacts are known in advance.
To create the 3 artifacts for the paired case, the code would be something like:
paired_files = []
r1_files = []
r2_files = []
for run_prefix, sample, f_fp, r_fp in samples:
sam_out_dir = join(out_dir, run_prefix)
paired_files.append((join(sam_out_dir, '%s_paired_1.fastq' % run_prefix), 'preprocessed_fastq'))
paired_files.append((join(sam_out_dir, '%s_paired_2.fastq' % run_prefix), 'preprocessed_fastq'))
r1_files.append((join(sam_out_dir, '%s_unmatched_1.fastq' % run_prefix), 'preprocessed_fastq'))
r2_files.append((join(sam_out_dir, '%s_unmatched_2.fastq' % run_prefix), 'preprocessed_fastq'))
ainfo = [ArtifactInfo('clean paired', 'per_sample_FASTQ', paired_files),
ArtifactInfo('clean unmatched R1', 'per_sample_FASTQ', r1_files),
ArtifactInfo('clean unmatched R1', 'per_sample_FASTQ', r2_files)
Note that the definition of the command in the __init__.py
file would need to be update to show that this command outputs 3 artifacts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I understand better now how this works. How can we retain information about the readdirectionality in the paired reads then given the current structure? Is there something besides 'preprocessed_fastq'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a totally valid point and it shows how much we need a small type decorator system that will allow this differentiation.
The results of the job | ||
""" | ||
# Step 1 get the rest of the information need to run kneaddata | ||
qclient.update_job_step(job_id, "Step 1 of 3: Collecting information") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Step 1 of 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K
@@ -21,7 +22,9 @@ def config(env_script, server_cert): | |||
"""Generates the Qiita configuration files""" | |||
if server_cert == 'None': | |||
server_cert = None | |||
plugin.generate_config(env_script, 'start_shotgun', | |||
kd_plugin.generate_config(env_script, 'start_shotgun', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this may work for the configure case, it will not work for the start_shotgun
case, different scripts will be needed. If that is the route, I would suggest creating 2 different repos, but I think it will be easier to have a single plugin here in which we add the different commands, rather than creating a plugin per command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a few minutes I'll create a new PR that has these changes + addressing @josenavas comments + fixing all errors. So closing this one.
# Trimmomatic options | ||
'max-memory': ['string', '500m'], # max memory in mb [ DEFAULT : 500m ] | ||
'trimmomatic': ['string', '$TRIMMOMATIC_DIR'], | ||
'trimmomatic-options': ['string', 'ILLUMINACLIP:$TRIMMOMATIC_DIR/adapters/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this so admins can change it.
'DEBUG'], | ||
|
||
# Trimmomatic options | ||
'max-memory': ['string', '500m'], # max memory in mb [ DEFAULT : 500m ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, for admin and this is required by JAVA! 😭
|
||
# Trimmomatic options | ||
'max-memory': ['string', '500m'], # max memory in mb [ DEFAULT : 500m ] | ||
'trimmomatic': ['string', '$TRIMMOMATIC_DIR'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this.
} | ||
outputs = {'per_sample_FASTQ': 'per_sample_FASTQ'} | ||
dflt_param_set = { | ||
'Defaults': { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
The results of the job | ||
""" | ||
# Step 1 get the rest of the information need to run kneaddata | ||
qclient.update_job_step(job_id, "Step 1 of 3: Collecting information") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K
@@ -21,7 +22,9 @@ def config(env_script, server_cert): | |||
"""Generates the Qiita configuration files""" | |||
if server_cert == 'None': | |||
server_cert = None | |||
plugin.generate_config(env_script, 'start_shotgun', | |||
kd_plugin.generate_config(env_script, 'start_shotgun', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K
Just an empty initial commit with some file stubs for kneaddata