Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start project with readme, environment, and itial snakemake workflow to execute sourmash commands #1

Merged
merged 9 commits into from
Aug 15, 2022

Conversation

taylorreiter
Copy link
Member

@taylorreiter taylorreiter commented Aug 11, 2022

This PR is the first PR for this repository. It does two things:

  1. Starts a README describing the goals of the repository, provides background information on the code found in this repo, provides instructions on how to execute the repo, and leaves place holders for future information (e.g. visualization and notebooks).
  2. Includes a snakefile and other associated files to make a snakemake workflow for sourmash commands (see readme for motivations)
    a. Snakefile: snakemake workflow that coordinates the execution of sourmash commands on metagenome assemblies. I have run this workflow and can confirm it runs correctly :) Eventually, I will add notebooks that will visualize the output of the workflow, but I wanted to have this portion reviewed before dumping a bunch more code.
    b. environment.yml: specifies the run environment for the workflow. See README.md for more information.
    c. envs/*yml: environments created and managed by the snakefile (see the conda: directive in each rule to know which environment is used by each step of the workflow.
    d. scripts/: folder for auxiliary scripts executed by the snakemake workflow. In this case, it only includes sig_to_csv.py, a python script to convert a sourmash sketch into a csv file.
    e. inputs/metadata.csv: metadata file encoding sample names. Used by the snakefile to determine file prefixes.

@taylorreiter taylorreiter marked this pull request as draft August 11, 2022 21:55
@taylorreiter taylorreiter changed the title Init Start project with readme, environment, and itial snakemake workflow to execute sourmash commands Aug 15, 2022
@taylorreiter taylorreiter marked this pull request as ready for review August 15, 2022 15:35
@taylorreiter
Copy link
Member Author

and maybe of interest to @mertcelebi...this is a demo of how I'm expecting code pushes and PRs to look for data analysis projects.

## Download sourmash databases & taxonomy files
##########################################################

rule download_genbank_bacteria_zip_k21:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this will change to pulling from S3 once we have the databases bucket figured out

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! or be an ifelse/parameterized statement to avoid download. Although...download ended up being less horrible than I thought...I was using some weird links originally and it was taking FOREVER to get the data, but then I updated to these and it was 50mb/s and that was perfectly acceptable. See sourmash-bio/sourmash#2179 and sourmash-bio/sourmash#2136

Copy link

@elizabethmcd elizabethmcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great and the README looks excellent!

@taylorreiter
Copy link
Member Author

Awesome, thanks @elizabethmcd!

@taylorreiter taylorreiter merged commit 5ef8412 into main Aug 15, 2022
import argparse

def main():
p = argparse.ArgumentParser()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might wanna check click for creating command-line runnable scripts! It's slightly neater than argparse IMO. Or you can try to use the help option for add_argument in argparse to clean up the comments.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for @taylorreiter - we use click in many places - e.g. genome-grist link. I prefer argparse for complex situations (b/c it's been around longer, has more stackoverflow answers for oddball things) but click is way friendlier!

- defaults
dependencies:
- sourmash-minimal=4.4.3
- pandas
Copy link

@mertcelebi mertcelebi Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably pin versions of all of these (like you did in environment.yml)

shell:'''
python scripts/sig_to_csv.py {wildcards.ksize} {input} {output}
'''

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is more of an aesthetic thing, but I like the consistency of it. Using a linter (or a text editor plugin), we should make sure there is only a single trailing newline at EOF

@taylorreiter taylorreiter deleted the init branch September 20, 2022 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants