-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating config file #64
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Major changes:
- New config layout especially for paired sample reads.
- Instead of one workflow at a time, all rules are compiled in Snakemake's DAG graph and executed.
- Instead of creating python objects outside of Snakemake, we now create them in the workflow.
- Cleaned up code significantly.
All of the following changes will make grid support and hybrid assembly easier to implement.
yeat/cli/__init__.py
Outdated
def main(args=None): | ||
if args is None: | ||
args = cli.get_parser().parse_args() # pragma: no cover | ||
workflows.run_workflows(args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved main()
from yeat/cli/cli.py
to here.
illumina.add_argument( | ||
"-l", | ||
"--length-required", | ||
type=int, | ||
metavar="L", | ||
default=50, | ||
help="discard reads shorter than the required L length after pre-preocessing; by default, L=50", | ||
metavar="L", | ||
type=int, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the CLI a bit more organized and fleshed out some metavars.
(yeat) dane.jo$ yeat -h
usage: yeat [--init] [-h] [-v] [-n] [-o DIR] [-t T] [-l L] [-c C] [-d D] [-g G] [-s S] config
positional arguments:
config config file
optional arguments:
--init print a template assembly config file to the terminal (stdout) and exit
-h, --help show this help message and exit
-v, --version show program's version number and exit
workflow arguments:
-n, --dry-run construct workflow DAG and print a summary but do not execute
-o DIR, --outdir DIR output directory; default is current working directory
-t T, --threads T execute workflow with T threads; by default, T=1
fastp arguments:
-l L, --length-required L
discard reads shorter than the required L length after pre-preocessing; by default, L=50
downsample arguments:
-c C, --coverage C target an average depth of coverage Cx when auto-downsampling; by default, C=150
-d D, --downsample D randomly sample D reads from the input rather than assembling the full set; set D=0 to perform auto-downsampling to a
desired level of coverage (see --coverage); set D=-1 to disable downsampling; by default, D=0
-g G, --genome-size G
provide known genome size in base pairs (bp); by default, G=0
-s S, --seed S seed for the random number generator used for downsampling; by default, the seed is chosen randomly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "positional arguments" term isn't wrong, but I'd recommend something like "required arguments" in this case. Actually, I have a few suggestions.
- "options" instead of "optional arguments"
- "workflow configuration"
- "fastp configuration"
- "downsampling configuration"
Names are easy to edit for argument groups that you explicitly create. But for default groups, you have to "get under the hood" to edit the group name. You should be able to find something helpful on StackOverflow or in one of our other projects—something like ._positionals
and ._optionals
. Let me know if you need a pointer in the right direction.
yeat/config/assembly.py
Outdated
class Assembly: | ||
def __init__(self, label, data, threads): | ||
self.label = label | ||
self.algorithm = data["algorithm"] | ||
self.extra_args = data["extra_args"] | ||
self.samples = data["samples"] | ||
self.mode = data["mode"] | ||
self.threads = threads | ||
self.validate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created class Assembly
its own dedicated file. Was clumped up with the config
and sample
classes.
def check_valid_mode(self): | ||
if self.mode not in ALGORITHMS.keys(): | ||
message = f"Invalid assembly mode '{self.mode}' for '{self.label}'" | ||
raise AssemblyConfigError(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New sanity check for sample object.
- Must have a mode set to
paired
,single
,pacbio
, orOxford
.
yeat/config/config.py
Outdated
class AssemblyConfig: | ||
def __init__(self, data, threads): | ||
self.data = data | ||
self.validate() | ||
self.threads = threads | ||
self.create_sample_and_assembly_objects() | ||
self.validate_samples_to_assembly_modes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created class AssemblyConfig
its own dedicated file. Was clumped up with Sample
and Assembly
classes.
@pytest.mark.parametrize("test", ["flat_list", "nested_lists"]) | ||
def test_check_reads(test): | ||
label = "sample1" | ||
if test == "flat_list": | ||
data = {"single": [[]]} | ||
elif test == "nested_lists": | ||
data = {"paired": [[data_file("Animal_289_R1.fq.gz"), []]]} | ||
pattern = rf"Input read is not a string '\[\]' for '{label}'" | ||
with pytest.raises(AssemblyConfigError, match=pattern): | ||
Sample(label, data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a lot of tests to test the sanity checks. Coverage is at 99%.
def test_validate_samples_to_assembly_modes(mode, label): | ||
data = json.load(open(data_file("configs/example.cfg"))) | ||
if mode == "paired": | ||
data["assemblies"]["spades-default"]["samples"] = ["sample4"] | ||
elif mode == "pacbio": | ||
data["assemblies"]["hicanu"]["samples"] = ["sample1"] | ||
elif mode == "oxford": | ||
data["assemblies"]["flye_ONT"]["samples"] = ["sample1"] | ||
pattern = rf"No samples can interact with assembly mode '{mode}' for '{label}'" | ||
with pytest.raises(AssemblyConfigError, match=pattern): | ||
AssemblyConfig(data, 4) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coverage is at 99%. I can't get the last remaining 1%... Not sure why my tests are unable to get the logic jump coverage. If you use # pragma: no cover
coverage will go up to 100%...
---------- coverage: platform darwin, python 3.9.18-final-0 ----------
Name Stmts Miss Branch BrPart Cover Missing
------------------------------------------------------------------------
yeat/__init__.py 3 0 0 0 100%
yeat/cli/__init__.py 5 0 0 0 100%
yeat/cli/cli.py 29 0 2 0 100%
yeat/cli/illumina.py 18 0 2 0 100%
yeat/config/__init__.py 7 0 0 0 100%
yeat/config/assembly.py 35 0 12 0 100%
yeat/config/config.py 52 0 24 1 99% 66->60
yeat/config/sample.py 66 0 32 0 100%
yeat/workflows/__init__.py 21 0 10 0 100%
------------------------------------------------------------------------
TOTAL 236 0 82 1 99%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure, but this might be the result of the lack of an else
clause in the for loop of that case. You could see if that makes a difference. In any case, 100% coverage is a great goal but not always feasible.
yeat/workflows/snakefiles/Workflows
Outdated
from yeat.config.config import AssemblyConfig | ||
from yeat.workflows.snakefiles import get_expected_files | ||
|
||
|
||
cfg = AssemblyConfig(config["data"], config["threads"]) | ||
config["samples"] = cfg.samples | ||
config["assemblies"] = cfg.assemblies | ||
|
||
|
||
rule all: | ||
input: | ||
get_expected_files(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This snake file is the "bread-and-butter" of this workflow.
Instead of creating a bunch of objects in python and then deflating them into nested dictionaries, we instead create the python objects themselves here in the snake file. This has made my life easier and certainly, the ability to do grid support as well.
In this snake file, we import rules from all snake files and run all combinations of rules to produce YEAT's output. (See get_expected_files()
from yeat/workflows/snakefiles/__init__.py
.)
yeat/workflows/snakefiles/Single
Outdated
p = Path("seq/{wildcards.sample}/downsample") | ||
p.mkdir(parents=True, exist_ok=True) | ||
copyfile(input.reads, output.sub) | ||
return | ||
if params.genome_size == 0: | ||
df = pd.read_csv(input.mash_report, sep="\t") | ||
genome_size = df["Length"].iloc[0] | ||
else: | ||
genome_size = params.genome_size | ||
with open(params.fastp_report, "r") as fh: | ||
qcdata = json.load(fh) | ||
base_count = qcdata["summary"]["after_filtering"]["total_bases"] | ||
read_count = qcdata["summary"]["after_filtering"]["total_reads"] | ||
avg_read_length = base_count / read_count | ||
if params.downsample == 0: | ||
down = int((genome_size * params.coverage) / (2 * avg_read_length)) | ||
else: | ||
down = params.downsample | ||
if params.seed == "None": | ||
seed = randint(1, 2**16-1) | ||
else: | ||
seed = params.seed | ||
print(f"[yeat] genome size: {genome_size}") | ||
print(f"[yeat] average read length: {avg_read_length}") | ||
print(f"[yeat] target depth of coverage: {params.coverage}x") | ||
print(f"[yeat] number of reads to sample: {down}") | ||
print(f"[yeat] random seed for sampling: {seed}") | ||
genome_size = get_genome_size(params.genome_size, input.mash_report) | ||
avg_read_length = get_avg_read_length(params.fastp_report) | ||
down = get_down(params.downsample, genome_size, params.coverage, avg_read_length) | ||
seed = get_seed(params.seed) | ||
print_downsample_values(genome_size, avg_read_length, params.coverage, down, seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved shared code (Paired.smk
) into __init__.py
.
yeat/workflows/snakefiles/Paired
Outdated
if params.downsample == -1: | ||
p = Path("seq/{wildcards.sample}/downsample") | ||
p.mkdir(parents=True, exist_ok=True) | ||
copyfile(input.read1, output.sub1) | ||
copyfile(input.read2, output.sub2) | ||
return | ||
if params.genome_size == 0: | ||
df = pd.read_csv(input.mash_report, sep="\t") | ||
genome_size = df["Length"].iloc[0] | ||
else: | ||
genome_size = params.genome_size | ||
with open(params.fastp_report, "r") as fh: | ||
qcdata = json.load(fh) | ||
base_count = qcdata["summary"]["after_filtering"]["total_bases"] | ||
read_count = qcdata["summary"]["after_filtering"]["total_reads"] | ||
avg_read_length = base_count / read_count | ||
if params.downsample == 0: | ||
down = int((genome_size * params.coverage) / (2 * avg_read_length)) | ||
else: | ||
down = params.downsample | ||
if params.seed == "None": | ||
seed = randint(1, 2**16-1) | ||
else: | ||
seed = params.seed | ||
print(f"[yeat] genome size: {genome_size}") | ||
print(f"[yeat] average read length: {avg_read_length}") | ||
print(f"[yeat] target depth of coverage: {params.coverage}x") | ||
print(f"[yeat] number of reads to sample: {down}") | ||
print(f"[yeat] random seed for sampling: {seed}") | ||
genome_size = get_genome_size(params.genome_size, input.mash_report) | ||
avg_read_length = get_avg_read_length(params.fastp_report) | ||
down = get_down(params.downsample, genome_size, params.coverage, avg_read_length) | ||
seed = get_seed(params.seed) | ||
print_downsample_values(genome_size, avg_read_length, params.coverage, down, seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved shared code (Paired.smk
) into __init__.py
.
@standage This PR is ready for review! I'm honestly not sure where you should start. I've made a lot of changes; however, most of the changes were reorganizing and throwing out unnecessary code. Main things:
Once this PR is merged, integration for grid support and hybrid will be pretty straight forward. One of the biggest changes is instead of calling multiple snakemake jobs, we have consolidated it down to 1. Example, Now, in one go, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big improvement overall. Instantiating config objects in the Snakefile eliminates the need to flatten for passing to the API. Programmatically determining the expected output files from the config object enabled you to coordinate all assembly tasks with a single invocation of Snakemake, which will make grid support more impactful. I have a few questions and comments: see below.
illumina.add_argument( | ||
"-l", | ||
"--length-required", | ||
type=int, | ||
metavar="L", | ||
default=50, | ||
help="discard reads shorter than the required L length after pre-preocessing; by default, L=50", | ||
metavar="L", | ||
type=int, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "positional arguments" term isn't wrong, but I'd recommend something like "required arguments" in this case. Actually, I have a few suggestions.
- "options" instead of "optional arguments"
- "workflow configuration"
- "fastp configuration"
- "downsampling configuration"
Names are easy to edit for argument groups that you explicitly create. But for default groups, you have to "get under the hood" to edit the group name. You should be able to find something helpful on StackOverflow or in one of our other projects—something like ._positionals
and ._optionals
. Let me know if you need a pointer in the right direction.
yeat/config/config.py
Outdated
def create_sample_and_assembly_objects(self): | ||
self.samples = {} | ||
self.assemblies = {} | ||
for key, value in self.data["samples"].items(): | ||
self.samples[key] = Sample(key, value) | ||
for key, value in self.data["assemblies"].items(): | ||
self.assemblies[key] = Assembly(key, value, self.threads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments.
- It's good practice to instantiate instance data in the constructor, even if you fill it in with a dedicated function like this. The code here technically works, but it's common to try to get a sense for an object's contents by looking at the constructor. And in this case, they'd miss
.samples
and.assemblies
. - Instead of "key" and "value", I'd use more descriptive variable names like "sample_name", "label", and "config".
yeat/config/sample.py
Outdated
self.long_readtype = next(iter(long)) if long else None | ||
|
||
def check_input_reads(self): | ||
self.all_reads = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same concern here: please instantiate in the constructor.
@patch.dict(os.environ, {"PATH": "ROUGE"}) | ||
@patch.dict(os.environ, {"PATH": "DNE"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[confused caveman grunt]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better readability, I changed the word "ROUGE" to "DNE" which stands for "Does Not Exist".
yeat/tests/test_bandage.py
Outdated
from yeat.workflows import bandage | ||
from yeat.workflows import snakefiles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain your thinking on introducing a "snakefiles" subsubpackage here? The "workflows" subpackage is almost empty except for "snakefiles", which suggests the new layer of nesting might be unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, we had more python code in the workflows dir; however, as you pointed out, we don't have much any more with the new changes! I agree that the nesting is unnecessary. To keep all of the python functions called in the snakemake rules, I've created yeat/workflows/aux.py
to hold them instead of dropping them in yeat/workflows/__init__.py
to keep the __init__.py
as the jumping point to run the snakemake command only.
def test_validate_samples_to_assembly_modes(mode, label): | ||
data = json.load(open(data_file("configs/example.cfg"))) | ||
if mode == "paired": | ||
data["assemblies"]["spades-default"]["samples"] = ["sample4"] | ||
elif mode == "pacbio": | ||
data["assemblies"]["hicanu"]["samples"] = ["sample1"] | ||
elif mode == "oxford": | ||
data["assemblies"]["flye_ONT"]["samples"] = ["sample1"] | ||
pattern = rf"No samples can interact with assembly mode '{mode}' for '{label}'" | ||
with pytest.raises(AssemblyConfigError, match=pattern): | ||
AssemblyConfig(data, 4) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure, but this might be the result of the lack of an else
clause in the for loop of that case. You could see if that makes a difference. In any case, 100% coverage is a great goal but not always feasible.
yeat/workflows/snakefiles/Oxford
Outdated
rule flye: | ||
output: | ||
contigs="analysis/{sample}/{label}/flye/{sample}_contigs.fasta" | ||
contigs="analysis/{sample}/{readtype,nano-raw|nano-corr|nano-hq}/{label}/flye/contigs.fasta" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wildcard constraints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Originally, we did not import all rules from all snakefiles into Workflows.snk
. (We ran each workflow separately—one at-a-time.) Now that we import everything and there are multiple rules that have the same name and output, I needed to make the wildcard constraints even stricter by using regex. If the wildcard constraints aren't made, when you import rules that have the same output, snakemake gets confused because it doesn't know which rule to execute.
Here's the documentation for it: https://snakemake.readthedocs.io/en/stable/tutorial/additional_features.html#constraining-wildcards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added suggestions in recent pushes.
- Removed the unnecessary nesting in
yeat/workflow
directory. - Improved variable names
- More unit tests
Github is getting pretty slow and laggy with all of the changes! 50 files changed!
Cleaned up the test suite and added more tests to test snakemake python code.
All tests passing. It took my computer 4 hrs and 26 mins to run the entire thing. As the project grows, I'm unsure what to do to improve test run time. I understand that running them all is necessary but it's very costly. Hopefully, we can keep it at this time range for the time being.
Ready for another review!
from yeat.tests import data_file | ||
from yeat.workflow import aux | ||
|
||
pytestmark = pytest.mark.short |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the most recent update, I created a bunch of tests that tests the python code executed in the snakemake workflow (see yeat/workflow/aux.py
). Previously, because I left all the python code in the snakemake files, in order to test them, I would have to run the entire workflow and do a bunch of unnecessary calculations. However, by extracting out the python code into their own individual functions in aux.py
, I have been able to confine and test the code's function on an individual level.
All of the tests in test_aux.py
are very quick and can be ran with make tests
.
Improved the test coverage. As you mentioned, the elif
statements causes the misses of logic jump coverages. I added # pragma: no cover
at the end of these if statements. The remaining 1% that is not covered is caused by a bandage test. This is not convered in make test
or make testall
due to the dependency of the user's operating system. Run make testbandage
to check that code's coverage.
---------- coverage: platform darwin, python 3.9.18-final-0 ----------
Name Stmts Miss Branch BrPart Cover Missing
-----------------------------------------------------------------------
yeat/__init__.py 3 0 0 0 100%
yeat/cli/__init__.py 5 0 0 0 100%
yeat/cli/cli.py 31 0 2 0 100%
yeat/cli/illumina.py 18 0 2 0 100%
yeat/config/__init__.py 7 0 0 0 100%
yeat/config/assembly.py 35 0 12 0 100%
yeat/config/config.py 50 0 20 0 100%
yeat/config/sample.py 70 0 36 0 100%
yeat/workflow/__init__.py 21 0 10 0 100%
yeat/workflow/aux.py 79 3 24 1 96% 137-139
-----------------------------------------------------------------------
TOTAL 319 3 106 1 99%
The purpose of this PR is to update the config file as discussed in #56 to enable hybrid assembly and grid support #61.
In this PR, users will need to follow a strict config file format that they must provide.
Large improvements were made to enable hybrid and grid support.
Example of the new config file: