Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding kneaddata pipeline #4

Closed
wants to merge 69 commits into from

Conversation

tanaes
Copy link
Contributor

@tanaes tanaes commented Sep 29, 2016

Just an empty initial commit with some file stubs for kneaddata

# --trimmomatic # path to trimmomatic executable
# --bowtie2 # path to bowtie executable
# --bmtagger # path to bmtagger exectuable
# --trf # path to TRF executable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how these executable paths get specified. I assume that's what line 19-23 are referring to, and that we just need to have an executable in the $PATH?


# Trimmomatic options
'max-memory': ['integer', '500'], # max memory in mb [ DEFAULT : 500 ]
'trimmomatic-options': ['string', 'ILLUMINACLIP:$trimmomatic/adapters/'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends on a hardcoded filepath. How can we specify this given this option formatting?


# FastQC options
}
outputs = {'per_sample_FASTQ': 'per_sample_FASTQ'}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KneadData should produce a lot of outputs, and the quantity will depend on the other options specified (e.g. if you provide multiple reference dbs, or skip fastQC). how can we deal with this?

# --bowtie2 # path to bowtie executable
# --bmtagger # path to bmtagger exectuable
# --trf # path to TRF executable
'reference-db': ['choice:["human_genome"]', 'human_genome'], # ref db
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we may want to specify additional reference DBs by including multiple iterations of this flag, one per ref (examples might include PhiX + Human + Mouse). How can we make that work?

@coveralls
Copy link

Coverage Status

Coverage increased (+2.9%) to 95.455% when pulling a22922d on tanaes:adding-kneaddata-pipeline into a35abcf on qiita-spots:master.

@tanaes
Copy link
Contributor Author

tanaes commented Oct 17, 2016

Ok! Current commit has functional but basic KneadData implementation passing tests.

@tanaes
Copy link
Contributor Author

tanaes commented Oct 18, 2016

@antgonza @josenavas @wasade Hey guys! I think I've got the KneadData pipeline more or less put together. Would you mind doing a code review so we can get this merged in time for the the Qiita workshop next week?

Note also that I'm putting together some additional functionality in other parts of the shotgun plugin that I think would be beneficial to the workshop. If you have a chance, it would be awesome if you had a chance to look at those too!

Copy link

@wasade wasade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments

@@ -86,15 +89,32 @@ def generate_humann2_analysis_commands(forward_seqs, reverse_seqs, map_file,
% ', '.join(sn_by_rp.keys()))

cmds = []
params = ['--%s "%s"' % (k, v) for k, v in viewitems(parameters) if v]
params = ['--%s "%s"' % (k, v) if v is not True else '--%s' % k
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if v is not True is the "same" as if not v. However, because of the conditional if v following the for statement, will if v is not True ever evaluate True?

msg = "Step 3 of 5: Executing HUMANn2 job (%d/{0})".format(len(commands))
success, msg = _run_commands(qclient, job_id, commands, msg)
if not success:
return False, None, msg
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these?

for rp in sn_by_rp:
if f_fn.startswith(rp) and run_prefix is None:
run_prefix = rp
elif f_fn.startswith(rp) and run_prefix is not None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this indicative of a problem in the mapping file? if so, should it be validated in get_sample_names_by_run_prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per above, the mapping file is currently not aware of exactly what the file is going to be called when it looks for it on the filesystem, so validation has to happen when looking at the sequences themselves.

used_prefixes = []
for i, f_fp in enumerate(forward_seqs):
# f_fp is the fwd read filepath
f_fn = basename(f_fp)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a lot of calls to startswith below, would it make sense to splitext on this as well so that a direct comparison can be made? it would also allow for representing sn_by_rp as a dict and doing something like:

run_prefix = sn_by_rp.get(f_fn)

if run_prefix is None:
    # complain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that there is a bunch of text that sometimes gets appended after the run prefix and before that file extension.

In general though I think this is a pretty weak part of the system, and liable to cause us problems. I think we would be best served to move towards doing our own demultiplexing into rational and programmatically generated filenames.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, that's annoying.

raise ValueError('No run prefix matching this fwd read: %s'
% f_fn)

if run_prefix in used_prefixes:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can used_prefixes be a set? this will be a list lookup, and while the number of entries probably wont get too large, it's good practice to use a better lookup structure when the thing being searched over is expected to reasonably be > 5 elements

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(rule of thumb)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!


used_prefixes.append(run_prefix)
# create the tuple for this read set
if reverse_seqs:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...with the forloop change then this if statement becomes unnecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

for param in sorted(parameters):
value = parameters[param]

if str(value) == 'True':
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does if value is True not work...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue I was trying to get around with this was problems with integer params of 1 and 0 causing incorrect behavior, but I can see that this is preferable as long as we can depend on Qiita passing True and not 'True' in the parameter dict.

list of str
The run prefixes used


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space


param_string = format_kneaddata_params(parameters)
prefixes = []
for run_prefix, sample, f_fp, r_fp in samples:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it ever the case that some samples will have reverse reads and others will not within the same study? right now the forloop implicitly supports this but I don't think that is the case and open the door for a subtle bug in the future as code fluxes over time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and it's a big pain in the ass. This frequently happens with shallow shotgun runs like we have been getting a lot of recently, where the quality is not so good and you end up with the reverse read being trimmed to extinction.

I agree that this kind of thing can set us up for problems and we should keep aware of it! Perhaps for early QC tools like kneaddata we can enforce proper paired reads (assuming no trimming so everything should be consistent) while downstream tools can be more flexible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not forcing paring makes sense and more cause it sounds like is something that is already happening. What about adding a comment here, perhaps in the Notes and on this line so future coders/reviewers are aware of this. Additionally, we could add a test for this but I don't feel strongly about this addition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, adding that comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think that helps. Perhaps adding a comment on top of the for loop, like # see notes before changing!, might be good.

ArtifactInfo('clean unpaired R2', 'per_sample_FASTQ',
[(join(sam_out_dir, '%s_unmatched_2.fastq' % prefix),
'per_sample_FASTQ')])]
ainfo += sam_a
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the intent for an append or an extend? would it be possible to use the desired method instead of an overloaded in place operator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, i did not know about the extend method!

@jdereus
Copy link
Contributor

jdereus commented Oct 18, 2016

Can one of the admins verify this patch?

@tanaes
Copy link
Contributor Author

tanaes commented Oct 19, 2016

@jdereus @josenavas Hey guys jenkins seems not to be noticing this PR... Is there something I need to do on my end to enable it?

@jdereus
Copy link
Contributor

jdereus commented Oct 19, 2016

ok to test

Copy link
Member

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really nice!! I think once the comments are addressed this is ready to go. Thanks @tanaes

@@ -39,7 +45,7 @@ before_script:
script:
- sleep 10 # give enough time to the webserver to start
- configure_shotgun --env-script "source activate qp-shotgun" --server-cert $QIITA_SERVER_CERT
- nosetests --with-doctest --with-coverage
- nosetests --with-doctest --with-coverage -e humann2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, didn't know this was possible, I guess it makes sense. Could you pull from master and remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do on final upload!


param_string = format_kneaddata_params(parameters)
prefixes = []
for run_prefix, sample, f_fp, r_fp in samples:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not forcing paring makes sense and more cause it sounds like is something that is already happening. What about adding a comment here, perhaps in the Notes and on this line so future coders/reviewers are aware of this. Additionally, we could add a test for this but I don't feel strongly about this addition.

Returns
-------
samples: list of tup
list of 3-tuples with run prefix, sample name, fwd read fp, rev read fp
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a catch so that if the run prefix isn't in the mapping file that it assumes the run prefix is equal to the full file name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea but I think that change will change the behavior compared to other plugins, which points that perhaps we should have a method to do this in qiita_client vs. having ad hoc functions. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree in principle! Probably this is going to change if we make a more sensible fastq file format or artifact structure, though, so maybe wait until then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, could you add an issue so we don't forget ... thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be covered by qiita-spots/qtp-sequencing#8?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure as that doesn't deal with run_prefix as the full name, perhaps pointing to this discussion, adding a note or both will make it easier to follow up.

'especially data from microbiome experiments.')

# Define the HUMAnN2 command
req_params = {'input': ('artifact', ['per_sample_FASTQ'])}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we specify that it needs forward and (or or) reverse fastqs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code will always return all the available ones, in your code you need to check if they exist, for example this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I get it... Currently this check occurs later on in processing.

Copy link
Contributor

@josenavas josenavas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tanaes ! Some comments - I understand that Qiita has some limitations and I would be happy to talk with you to see what can we do to remove those limitations, but it is possible that some of them can't be removed so the plugin needs to be adapted.

- source activate qiita_env
- export CONDA_BIN=/home/travis/miniconda3/bin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you are using this anymore, but this no longer running on travis so this path is incorrect. Check this line for an idea on how to fix it, unless you don't need anymore, in which case can be removed.

__all__ = ['kneaddata']

# Initialize the plugin
plugin = QiitaPlugin(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is going to work. Each repository should have a single "QiitaPlugin" instance. In case that we want to have multiple "QiitaPlugin" instances we need a configure and a start_* script for each of them, in which case I don't see the benefit of having all these tools in a single repository. I think this can be added as a command to the shotgun plugin.

'run-trf': ['boolean', 'False'], # run TRF repeat finder tool
'run-fastqc-start': ['boolean', 'True'], # run FastQC on original data
'run-fastqc-end': ['boolean', 'True'], # run FastQC on filtered data
'store-temp-output': ['boolean', 'False'], # store temp output files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this parameter needs to be exposed. In which cases are we going to let the user store the temp output?

'DEBUG'],

# Trimmomatic options
'max-memory': ['string', '500m'], # max memory in mb [ DEFAULT : 500m ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need more info about this parameter. In general we don't expose these types of parameters to the user since it is system dependent (and we don't want the users to request, let's say, 3TB...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, for admin and this is required by JAVA! 😭


# Trimmomatic options
'max-memory': ['string', '500m'], # max memory in mb [ DEFAULT : 500m ]
'trimmomatic': ['string', '$TRIMMOMATIC_DIR'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter and the one below shouldn't be exposed here. The parameters exposed here are parameters that are going to be eventually exposed to the user. I this case, this parameter should never be changed (i.e. it is set up at installation time and then it always is the same).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this.


if value is True:
params.append('--%s' % param)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since these are if/elif, the continue statements are not needed - can you remove them?

prefixes = []
for run_prefix, sample, f_fp, r_fp in samples:
prefixes.append(run_prefix)
if r_fp is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but I think this if/else block can be reduced to:

r_fp_str = ' --input "%s"' % r_fp if r_fp is not None else ""
cmds.append('kneaddata --input "%s"%s --output "%s" --output-prefix "%s" %s'
            % (f_fp, r_fp_str, join(out_dir, run_prefix), run_prefix, param_string))

to avoid some code duplication

return True, ""


def _per_sample_ainfo(out_dir, samples):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is adding 1 artifact per output file - I don't think this is the desired behavior right? I think what we want to get is a single artifact with all the sequence file. Note that the number of output artifacts from a command should be deterministic (i.e. it is known in advance how many artifacts a command is going to generate). Note that artifact != file so it shouldn't be a problem.

Now, extending on that, it looks like we can potentially group the output files in 3 different artifacts: (1) clean paired reads with R1 and R2, (2) clean unpaired R1, and (3) clean unpaired R2. However, this doesn't work for the case in which R2 is not provided, since we only have 1 output artifact. To solve this case, I would add 2 commands, 1 for paired reads and 1 for unpaired reads. Note that internally you can be using the same functions if you want, but from the Qiita point of view this simplifies things as the output artifacts are known in advance.

To create the 3 artifacts for the paired case, the code would be something like:

paired_files = []
r1_files = []
r2_files = []
for run_prefix, sample, f_fp, r_fp in samples:
    sam_out_dir = join(out_dir, run_prefix)
    paired_files.append((join(sam_out_dir, '%s_paired_1.fastq' % run_prefix), 'preprocessed_fastq'))
    paired_files.append((join(sam_out_dir, '%s_paired_2.fastq' % run_prefix), 'preprocessed_fastq'))
    r1_files.append((join(sam_out_dir, '%s_unmatched_1.fastq' % run_prefix), 'preprocessed_fastq'))
    r2_files.append((join(sam_out_dir, '%s_unmatched_2.fastq' % run_prefix), 'preprocessed_fastq'))

ainfo = [ArtifactInfo('clean paired', 'per_sample_FASTQ', paired_files),
         ArtifactInfo('clean unmatched R1', 'per_sample_FASTQ', r1_files),
         ArtifactInfo('clean unmatched R1', 'per_sample_FASTQ', r2_files)

Note that the definition of the command in the __init__.py file would need to be update to show that this command outputs 3 artifacts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I understand better now how this works. How can we retain information about the readdirectionality in the paired reads then given the current structure? Is there something besides 'preprocessed_fastq'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a totally valid point and it shows how much we need a small type decorator system that will allow this differentiation.

The results of the job
"""
# Step 1 get the rest of the information need to run kneaddata
qclient.update_job_step(job_id, "Step 1 of 3: Collecting information")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step 1 of 5

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K

@@ -21,7 +22,9 @@ def config(env_script, server_cert):
"""Generates the Qiita configuration files"""
if server_cert == 'None':
server_cert = None
plugin.generate_config(env_script, 'start_shotgun',
kd_plugin.generate_config(env_script, 'start_shotgun',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this may work for the configure case, it will not work for the start_shotgun case, different scripts will be needed. If that is the route, I would suggest creating 2 different repos, but I think it will be easier to have a single plugin here in which we add the different commands, rather than creating a plugin per command.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K

Copy link
Member

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a few minutes I'll create a new PR that has these changes + addressing @josenavas comments + fixing all errors. So closing this one.

# Trimmomatic options
'max-memory': ['string', '500m'], # max memory in mb [ DEFAULT : 500m ]
'trimmomatic': ['string', '$TRIMMOMATIC_DIR'],
'trimmomatic-options': ['string', 'ILLUMINACLIP:$TRIMMOMATIC_DIR/adapters/'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this so admins can change it.

'DEBUG'],

# Trimmomatic options
'max-memory': ['string', '500m'], # max memory in mb [ DEFAULT : 500m ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, for admin and this is required by JAVA! 😭


# Trimmomatic options
'max-memory': ['string', '500m'], # max memory in mb [ DEFAULT : 500m ]
'trimmomatic': ['string', '$TRIMMOMATIC_DIR'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this.

}
outputs = {'per_sample_FASTQ': 'per_sample_FASTQ'}
dflt_param_set = {
'Defaults': {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

The results of the job
"""
# Step 1 get the rest of the information need to run kneaddata
qclient.update_job_step(job_id, "Step 1 of 3: Collecting information")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K

@@ -21,7 +22,9 @@ def config(env_script, server_cert):
"""Generates the Qiita configuration files"""
if server_cert == 'None':
server_cert = None
plugin.generate_config(env_script, 'start_shotgun',
kd_plugin.generate_config(env_script, 'start_shotgun',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K

@antgonza antgonza closed this Nov 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants