Adding kneaddata pipeline #4

tanaes · 2016-09-29T16:12:45Z

Just an empty initial commit with some file stubs for kneaddata

Updating to latest pull request version

tanaes · 2016-09-29T17:12:06Z

qp_shotgun/kneaddata/__init__.py

+    # --trimmomatic # path to trimmomatic executable
+    # --bowtie2 # path to bowtie executable
+    # --bmtagger # path to bmtagger exectuable
+    # --trf # path to TRF executable


Not sure how these executable paths get specified. I assume that's what line 19-23 are referring to, and that we just need to have an executable in the $PATH?

tanaes · 2016-09-29T17:12:31Z

qp_shotgun/kneaddata/__init__.py

+
+    # Trimmomatic options
+    'max-memory': ['integer', '500'], # max memory in mb [ DEFAULT : 500 ]
+    'trimmomatic-options': ['string', 'ILLUMINACLIP:$trimmomatic/adapters/'


This depends on a hardcoded filepath. How can we specify this given this option formatting?

tanaes · 2016-09-29T17:14:31Z

qp_shotgun/kneaddata/__init__.py

+
+    # FastQC options
+    }
+outputs = {'per_sample_FASTQ': 'per_sample_FASTQ'}


KneadData should produce a lot of outputs, and the quantity will depend on the other options specified (e.g. if you provide multiple reference dbs, or skip fastQC). how can we deal with this?

tanaes · 2016-09-29T17:15:50Z

qp_shotgun/kneaddata/__init__.py

+    # --bowtie2 # path to bowtie executable
+    # --bmtagger # path to bmtagger exectuable
+    # --trf # path to TRF executable
+    'reference-db': ['choice:["human_genome"]', 'human_genome'], # ref db


Here we may want to specify additional reference DBs by including multiple iterations of this flag, one per ref (examples might include PhiX + Human + Mouse). How can we make that work?

…to adding-kneaddata-pipeline Updating to Antonio's new code

coveralls · 2016-10-17T07:21:09Z

Coverage increased (+2.9%) to 95.455% when pulling a22922d on tanaes:adding-kneaddata-pipeline into a35abcf on qiita-spots:master.

tanaes · 2016-10-17T07:23:01Z

Ok! Current commit has functional but basic KneadData implementation passing tests.

tanaes · 2016-10-18T03:16:59Z

@antgonza @josenavas @wasade Hey guys! I think I've got the KneadData pipeline more or less put together. Would you mind doing a code review so we can get this merged in time for the the Qiita workshop next week?

Note also that I'm putting together some additional functionality in other parts of the shotgun plugin that I think would be beneficial to the workshop. If you have a chance, it would be awesome if you had a chance to look at those too!

wasade

few comments

wasade · 2016-10-18T16:48:08Z

qp_shotgun/humann2/humann2.py

@@ -86,15 +89,32 @@ def generate_humann2_analysis_commands(forward_seqs, reverse_seqs, map_file,
            % ', '.join(sn_by_rp.keys()))

    cmds = []
-    params = ['--%s "%s"' % (k, v) for k, v in viewitems(parameters) if v]
+    params = ['--%s "%s"' % (k, v) if v is not True else '--%s' % k


if v is not True is the "same" as if not v. However, because of the conditional if v following the for statement, will if v is not True ever evaluate True?

wasade · 2016-10-18T17:46:58Z

qp_shotgun/humann2/humann2.py

+    msg = "Step 3 of 5: Executing HUMANn2 job (%d/{0})".format(len(commands))
+    success, msg = _run_commands(qclient, job_id, commands, msg)
+    if not success:
+        return False, None, msg


what are these?

wasade · 2016-10-18T17:51:31Z

qp_shotgun/kneaddata/kneaddata.py

+        for rp in sn_by_rp:
+            if f_fn.startswith(rp) and run_prefix is None:
+                run_prefix = rp
+            elif f_fn.startswith(rp) and run_prefix is not None:


is this indicative of a problem in the mapping file? if so, should it be validated in get_sample_names_by_run_prefix?

As per above, the mapping file is currently not aware of exactly what the file is going to be called when it looks for it on the filesystem, so validation has to happen when looking at the sequences themselves.

wasade · 2016-10-18T17:53:08Z

qp_shotgun/kneaddata/kneaddata.py

+    used_prefixes = []
+    for i, f_fp in enumerate(forward_seqs):
+        # f_fp is the fwd read filepath
+        f_fn = basename(f_fp)


there are a lot of calls to startswith below, would it make sense to splitext on this as well so that a direct comparison can be made? it would also allow for representing sn_by_rp as a dict and doing something like:

run_prefix = sn_by_rp.get(f_fn) if run_prefix is None: # complain

The issue is that there is a bunch of text that sometimes gets appended after the run prefix and before that file extension.

In general though I think this is a pretty weak part of the system, and liable to cause us problems. I think we would be best served to move towards doing our own demultiplexing into rational and programmatically generated filenames.

oh, that's annoying.

wasade · 2016-10-18T17:54:27Z

qp_shotgun/kneaddata/kneaddata.py

+            raise ValueError('No run prefix matching this fwd read: %s'
+                             % f_fn)
+
+        if run_prefix in used_prefixes:


can used_prefixes be a set? this will be a list lookup, and while the number of entries probably wont get too large, it's good practice to use a better lookup structure when the thing being searched over is expected to reasonably be > 5 elements

(rule of thumb)

wasade · 2016-10-18T17:57:43Z

qp_shotgun/kneaddata/kneaddata.py

+
+        used_prefixes.append(run_prefix)
+        # create the tuple for this read set
+        if reverse_seqs:


...with the forloop change then this if statement becomes unnecessary

wasade · 2016-10-18T17:58:05Z

qp_shotgun/kneaddata/kneaddata.py

+    for param in sorted(parameters):
+        value = parameters[param]
+
+        if str(value) == 'True':


does if value is True not work...?

The issue I was trying to get around with this was problems with integer params of 1 and 0 causing incorrect behavior, but I can see that this is preferable as long as we can depend on Qiita passing True and not 'True' in the parameter dict.

wasade · 2016-10-18T17:58:15Z

qp_shotgun/kneaddata/kneaddata.py

+    list of str
+        The run prefixes used
+
+


extra space

wasade · 2016-10-18T17:59:55Z

qp_shotgun/kneaddata/kneaddata.py

+
+    param_string = format_kneaddata_params(parameters)
+    prefixes = []
+    for run_prefix, sample, f_fp, r_fp in samples:


is it ever the case that some samples will have reverse reads and others will not within the same study? right now the forloop implicitly supports this but I don't think that is the case and open the door for a subtle bug in the future as code fluxes over time

Yes, and it's a big pain in the ass. This frequently happens with shallow shotgun runs like we have been getting a lot of recently, where the quality is not so good and you end up with the reverse read being trimmed to extinction.

I agree that this kind of thing can set us up for problems and we should keep aware of it! Perhaps for early QC tools like kneaddata we can enforce proper paired reads (assuming no trimming so everything should be consistent) while downstream tools can be more flexible?

I think not forcing paring makes sense and more cause it sounds like is something that is already happening. What about adding a comment here, perhaps in the Notes and on this line so future coders/reviewers are aware of this. Additionally, we could add a test for this but I don't feel strongly about this addition.

Good idea, adding that comment.

Thanks, I think that helps. Perhaps adding a comment on top of the for loop, like # see notes before changing!, might be good.

wasade · 2016-10-18T18:01:32Z

qp_shotgun/kneaddata/kneaddata.py

+            ArtifactInfo('clean unpaired R2', 'per_sample_FASTQ',
+                         [(join(sam_out_dir, '%s_unmatched_2.fastq' % prefix),
+                          'per_sample_FASTQ')])]
+        ainfo += sam_a


is the intent for an append or an extend? would it be possible to use the desired method instead of an overloaded in place operator?

Awesome, i did not know about the extend method!

jdereus · 2016-10-18T23:18:51Z

Can one of the admins verify this patch?

tanaes · 2016-10-19T01:16:13Z

@jdereus @josenavas Hey guys jenkins seems not to be noticing this PR... Is there something I need to do on my end to enable it?

jdereus · 2016-10-19T01:28:07Z

ok to test

antgonza

Looking really nice!! I think once the comments are addressed this is ready to go. Thanks @tanaes

antgonza · 2016-10-20T09:42:12Z

.travis.yml

@@ -39,7 +45,7 @@ before_script:
 script:
  - sleep 10 # give enough time to the webserver to start
  - configure_shotgun --env-script "source activate qp-shotgun" --server-cert $QIITA_SERVER_CERT
-  - nosetests --with-doctest --with-coverage
+  - nosetests --with-doctest --with-coverage -e humann2


Cool, didn't know this was possible, I guess it makes sense. Could you pull from master and remove it?

Yes, will do on final upload!

antgonza · 2016-10-20T09:48:05Z

qp_shotgun/kneaddata/kneaddata.py

+
+    param_string = format_kneaddata_params(parameters)
+    prefixes = []
+    for run_prefix, sample, f_fp, r_fp in samples:


I think not forcing paring makes sense and more cause it sounds like is something that is already happening. What about adding a comment here, perhaps in the Notes and on this line so future coders/reviewers are aware of this. Additionally, we could add a test for this but I don't feel strongly about this addition.

tanaes · 2016-10-20T22:36:33Z

qp_shotgun/kneaddata/kneaddata.py

+    Returns
+    -------
+    samples: list of tup
+        list of 3-tuples with run prefix, sample name, fwd read fp, rev read fp


Do we need to add a catch so that if the run prefix isn't in the mapping file that it assumes the run prefix is equal to the full file name?

Good idea but I think that change will change the behavior compared to other plugins, which points that perhaps we should have a method to do this in qiita_client vs. having ad hoc functions. What do you think?

Agree in principle! Probably this is going to change if we make a more sensible fastq file format or artifact structure, though, so maybe wait until then?

Yup, could you add an issue so we don't forget ... thanks!

Should be covered by qiita-spots/qtp-sequencing#8?

Not sure as that doesn't deal with run_prefix as the full name, perhaps pointing to this discussion, adding a note or both will make it easier to follow up.

tanaes · 2016-10-20T22:37:06Z

qp_shotgun/kneaddata/__init__.py

+    'especially data from microbiome experiments.')
+
+# Define the HUMAnN2 command
+req_params = {'input': ('artifact', ['per_sample_FASTQ'])}


How do we specify that it needs forward and (or or) reverse fastqs?

The code will always return all the available ones, in your code you need to check if they exist, for example this.

Ok I get it... Currently this check occurs later on in processing.

josenavas

Thanks @tanaes ! Some comments - I understand that Qiita has some limitations and I would be happy to talk with you to see what can we do to remove those limitations, but it is possible that some of them can't be removed so the plugin needs to be adapted.

josenavas · 2016-10-25T14:09:31Z

.travis.yml

  - source activate qiita_env
+  - export CONDA_BIN=/home/travis/miniconda3/bin


Not sure if you are using this anymore, but this no longer running on travis so this path is incorrect. Check this line for an idea on how to fix it, unless you don't need anymore, in which case can be removed.

josenavas · 2016-10-25T14:13:52Z

qp_shotgun/kneaddata/__init__.py

+__all__ = ['kneaddata']
+
+# Initialize the plugin
+plugin = QiitaPlugin(


I'm not sure if this is going to work. Each repository should have a single "QiitaPlugin" instance. In case that we want to have multiple "QiitaPlugin" instances we need a configure and a start_* script for each of them, in which case I don't see the benefit of having all these tools in a single repository. I think this can be added as a command to the shotgun plugin.

josenavas · 2016-10-25T14:15:06Z

qp_shotgun/kneaddata/__init__.py

+    'run-trf': ['boolean', 'False'],  # run TRF repeat finder tool
+    'run-fastqc-start': ['boolean', 'True'],  # run FastQC on original data
+    'run-fastqc-end': ['boolean', 'True'],  # run FastQC on filtered data
+    'store-temp-output': ['boolean', 'False'],  # store temp output files


I don't think this parameter needs to be exposed. In which cases are we going to let the user store the temp output?

josenavas · 2016-10-25T14:16:42Z

qp_shotgun/kneaddata/__init__.py

+                  'DEBUG'],
+
+    # Trimmomatic options
+    'max-memory': ['string', '500m'],  # max memory in mb [ DEFAULT : 500m ]


I think I need more info about this parameter. In general we don't expose these types of parameters to the user since it is system dependent (and we don't want the users to request, let's say, 3TB...)

Again, for admin and this is required by JAVA! 😭

josenavas · 2016-10-25T14:18:28Z

qp_shotgun/kneaddata/__init__.py

+
+    # Trimmomatic options
+    'max-memory': ['string', '500m'],  # max memory in mb [ DEFAULT : 500m ]
+    'trimmomatic': ['string', '$TRIMMOMATIC_DIR'],


This parameter and the one below shouldn't be exposed here. The parameters exposed here are parameters that are going to be eventually exposed to the user. I this case, this parameter should never be changed (i.e. it is set up at installation time and then it always is the same).

Removing this.

josenavas · 2016-10-25T14:24:22Z

qp_shotgun/kneaddata/kneaddata.py

+
+        if value is True:
+            params.append('--%s' % param)
+            continue


since these are if/elif, the continue statements are not needed - can you remove them?

josenavas · 2016-10-25T14:30:09Z

qp_shotgun/kneaddata/kneaddata.py

+    prefixes = []
+    for run_prefix, sample, f_fp, r_fp in samples:
+        prefixes.append(run_prefix)
+        if r_fp is None:


Minor, but I think this if/else block can be reduced to:

r_fp_str = ' --input "%s"' % r_fp if r_fp is not None else "" cmds.append('kneaddata --input "%s"%s --output "%s" --output-prefix "%s" %s' % (f_fp, r_fp_str, join(out_dir, run_prefix), run_prefix, param_string))

to avoid some code duplication

josenavas · 2016-10-25T14:44:15Z

qp_shotgun/kneaddata/kneaddata.py

+    return True, ""
+
+
+def _per_sample_ainfo(out_dir, samples):


This is adding 1 artifact per output file - I don't think this is the desired behavior right? I think what we want to get is a single artifact with all the sequence file. Note that the number of output artifacts from a command should be deterministic (i.e. it is known in advance how many artifacts a command is going to generate). Note that artifact != file so it shouldn't be a problem.

Now, extending on that, it looks like we can potentially group the output files in 3 different artifacts: (1) clean paired reads with R1 and R2, (2) clean unpaired R1, and (3) clean unpaired R2. However, this doesn't work for the case in which R2 is not provided, since we only have 1 output artifact. To solve this case, I would add 2 commands, 1 for paired reads and 1 for unpaired reads. Note that internally you can be using the same functions if you want, but from the Qiita point of view this simplifies things as the output artifacts are known in advance.

To create the 3 artifacts for the paired case, the code would be something like:

paired_files = [] r1_files = [] r2_files = [] for run_prefix, sample, f_fp, r_fp in samples: sam_out_dir = join(out_dir, run_prefix) paired_files.append((join(sam_out_dir, '%s_paired_1.fastq' % run_prefix), 'preprocessed_fastq')) paired_files.append((join(sam_out_dir, '%s_paired_2.fastq' % run_prefix), 'preprocessed_fastq')) r1_files.append((join(sam_out_dir, '%s_unmatched_1.fastq' % run_prefix), 'preprocessed_fastq')) r2_files.append((join(sam_out_dir, '%s_unmatched_2.fastq' % run_prefix), 'preprocessed_fastq')) ainfo = [ArtifactInfo('clean paired', 'per_sample_FASTQ', paired_files), ArtifactInfo('clean unmatched R1', 'per_sample_FASTQ', r1_files), ArtifactInfo('clean unmatched R1', 'per_sample_FASTQ', r2_files)

Note that the definition of the command in the __init__.py file would need to be update to show that this command outputs 3 artifacts.

Ok I understand better now how this works. How can we retain information about the readdirectionality in the paired reads then given the current structure? Is there something besides 'preprocessed_fastq'?

That is a totally valid point and it shows how much we need a small type decorator system that will allow this differentiation.

josenavas · 2016-10-25T14:44:35Z

qp_shotgun/kneaddata/kneaddata.py

+        The results of the job
+    """
+    # Step 1 get the rest of the information need to run kneaddata
+    qclient.update_job_step(job_id, "Step 1 of 3: Collecting information")


Step 1 of 5

josenavas · 2016-10-25T14:46:48Z

scripts/configure_shotgun

@@ -21,7 +22,9 @@ def config(env_script, server_cert):
    """Generates the Qiita configuration files"""
    if server_cert == 'None':
        server_cert = None
-    plugin.generate_config(env_script, 'start_shotgun',
+    kd_plugin.generate_config(env_script, 'start_shotgun',


Although this may work for the configure case, it will not work for the start_shotgun case, different scripts will be needed. If that is the route, I would suggest creating 2 different repos, but I think it will be easier to have a single plugin here in which we add the different commands, rather than creating a plugin per command.

antgonza

In a few minutes I'll create a new PR that has these changes + addressing @josenavas comments + fixing all errors. So closing this one.

antgonza · 2016-11-02T20:39:18Z

qp_shotgun/kneaddata/__init__.py

+    # Trimmomatic options
+    'max-memory': ['string', '500m'],  # max memory in mb [ DEFAULT : 500m ]
+    'trimmomatic': ['string', '$TRIMMOMATIC_DIR'],
+    'trimmomatic-options': ['string', 'ILLUMINACLIP:$TRIMMOMATIC_DIR/adapters/'


We need this so admins can change it.

antgonza · 2016-11-02T20:39:54Z

qp_shotgun/kneaddata/__init__.py

+                  'DEBUG'],
+
+    # Trimmomatic options
+    'max-memory': ['string', '500m'],  # max memory in mb [ DEFAULT : 500m ]


Again, for admin and this is required by JAVA! 😭

antgonza · 2016-11-02T20:40:06Z

qp_shotgun/kneaddata/__init__.py

+
+    # Trimmomatic options
+    'max-memory': ['string', '500m'],  # max memory in mb [ DEFAULT : 500m ]
+    'trimmomatic': ['string', '$TRIMMOMATIC_DIR'],


Removing this.

antgonza · 2016-11-02T20:40:15Z

qp_shotgun/kneaddata/__init__.py

+    }
+outputs = {'per_sample_FASTQ': 'per_sample_FASTQ'}
+dflt_param_set = {
+    'Defaults': {


antgonza · 2016-11-02T20:55:59Z

qp_shotgun/kneaddata/kneaddata.py

+        The results of the job
+    """
+    # Step 1 get the rest of the information need to run kneaddata
+    qclient.update_job_step(job_id, "Step 1 of 3: Collecting information")


antgonza · 2016-11-02T20:56:10Z

scripts/configure_shotgun

@@ -21,7 +22,9 @@ def config(env_script, server_cert):
    """Generates the Qiita configuration files"""
    if server_cert == 'None':
        server_cert = None
-    plugin.generate_config(env_script, 'start_shotgun',
+    kd_plugin.generate_config(env_script, 'start_shotgun',


Jon Sanders added 5 commits September 28, 2016 11:53

Initial creation of blank kneaddata scripts

2ac9903

Merge branch 'adding-humann2-pipeline' into adding-kneaddata-pipeline

20495f5

Updating to latest pull request version

Created empty kneaddata command

b142aa8

Added kneaddata import to __init__.py

6ff7909

Added kneaddata default params to __init__.py

7ee38b6

tanaes commented Sep 29, 2016

View reviewed changes

Jon Sanders added 21 commits September 29, 2016 14:15

Merging Antonio's newest pull request

c12f0d0

adding kneaddata functionality

370611b

Merge branch 'master' of https://github.com/qiita-spots/qp-shotgun in…

6715d47

…to adding-kneaddata-pipeline Updating to Antonio's new code

Added method for matching reads and run prefixes

8b5abb3

Added tests for make_read_pairs_per_sample

978f613

Added generate_kneaddata_commands

e65adba

Added tests for generate_kneaddata_commands

d306b90

commented out unused parameter for running TRF

4f1cbf4

Fixed flake8 errors

85a15ee

Added kneaddata to setup

3c2bbda

Added kneaddata plugin

caf10fd

Typos and flake8

38a3724

Fixed plugin name collision

8aa7664

Fixed error in generate_kneaddata_commands

0415af8

Fixed flake8 errors

e7f42ef

Fixing unit tests

47fb1cc

Commented out unused commands for flake8

55c7cb2

Fixing flake8 errors and command tests

e6a41a5

Fixing flake8 errors and command tests

7ba8411

Moved param string formatting to function

2a718f5

Fixed param string formatting

c24774d

Jon Sanders added 6 commits October 16, 2016 23:41

Fixed KneadData execution

6b5bb74

Fixed KneadData execution

9c6635a

Added some more fastq test files

6d03d18

Fixed local db reference

af9278a

Flake8

8b5fc09

hiding humann2 tests from nose

a22922d

wasade reviewed Oct 18, 2016

View reviewed changes

Jon Sanders added 4 commits October 18, 2016 14:36

Fixed to handle single read case in artifact generation

cd5f23f

Fixed merge conflicts with master

8457f93

Updated unit tests

6f0f4a6

Addressing @wasade's comments

830bd81

Jon Sanders added 4 commits October 18, 2016 18:39

Please work jarvis

6faf771

Please work jarvis

04533eb

flake8

d0ec922

Fixed merge conflicts with upstream master

1448263

antgonza reviewed Oct 20, 2016

View reviewed changes

tanaes commented Oct 20, 2016

View reviewed changes

Jon Sanders added 4 commits October 20, 2016 16:18

Adding some comments about read pairing

40e9ed7

Removing humann2 test masking from .travis.yml

5167ad4

flake8

d76b614

Merging Antonios humann2 code

ea6dfff

josenavas suggested changes Oct 25, 2016

View reviewed changes

antgonza reviewed Nov 2, 2016

View reviewed changes

antgonza closed this Nov 2, 2016

		- source activate qiita_env
		- export CONDA_BIN=/home/travis/miniconda3/bin

Adding kneaddata pipeline #4

Adding kneaddata pipeline #4

Conversation

tanaes commented Sep 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Oct 17, 2016

tanaes commented Oct 17, 2016

tanaes commented Oct 18, 2016 • edited Loading

wasade left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdereus commented Oct 18, 2016

tanaes commented Oct 19, 2016

jdereus commented Oct 19, 2016

antgonza left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josenavas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antgonza left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tanaes commented Oct 18, 2016 •

edited

Loading