Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating config file #64

Merged
merged 19 commits into from
Dec 11, 2023
Merged

Updating config file #64

merged 19 commits into from
Dec 11, 2023

Conversation

danejo3
Copy link
Collaborator

@danejo3 danejo3 commented Nov 15, 2023

The purpose of this PR is to update the config file as discussed in #56 to enable hybrid assembly and grid support #61.

In this PR, users will need to follow a strict config file format that they must provide.

Large improvements were made to enable hybrid and grid support.

  • Removed unnecessary or redundant code.
  • Refactored and organized code.
  • Provided multiple sanity checks for the user's config file.

Example of the new config file:

{
    "samples": {
        "sample1": {
            "paired": [
                [
                    "yeat/tests/data/short_reads_1.fastq.gz",
                    "yeat/tests/data/short_reads_2.fastq.gz"
                ]
            ]
        },
        "sample2": {
            "paired": [
                [
                    "yeat/tests/data/Animal_289_R1.fq.gz",
                    "yeat/tests/data/Animal_289_R2.fq.gz"
                ]
            ]
        },
        "sample3": {
            "pacbio-hifi": [
                "yeat/tests/data/ecoli.fastq.gz"
            ]
        },
        "sample4": {
            "nano-hq": [
                "yeat/tests/data/ecolk12mg1655_R10_3_guppy_345_HAC.fastq.gz"
            ]
        }
    },
    "assemblies": {
        "spades-default": {
            "algorithm": "spades",
            "extra_args": "",
            "samples": [
                "sample1",
                "sample2"
            ],
            "mode": "paired"
        },
        "hicanu": {
            "algorithm": "canu",
            "extra_args": "genomeSize=4.8m",
            "samples": [
                "sample3"
            ],
            "mode": "pacbio"
        },
        "flye_ONT": {
            "algorithm": "flye",
            "extra_args": "",
            "samples": [
                "sample4"
            ],
            "mode": "oxford"
        }
    }
}

Copy link
Collaborator Author

@danejo3 danejo3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major changes:

  • New config layout especially for paired sample reads.
  • Instead of one workflow at a time, all rules are compiled in Snakemake's DAG graph and executed.
  • Instead of creating python objects outside of Snakemake, we now create them in the workflow.
  • Cleaned up code significantly.

All of the following changes will make grid support and hybrid assembly easier to implement.

Comment on lines 14 to 17
def main(args=None):
if args is None:
args = cli.get_parser().parse_args() # pragma: no cover
workflows.run_workflows(args)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved main() from yeat/cli/cli.py to here.

Comment on lines 15 to 22
illumina.add_argument(
"-l",
"--length-required",
type=int,
metavar="L",
default=50,
help="discard reads shorter than the required L length after pre-preocessing; by default, L=50",
metavar="L",
type=int,
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the CLI a bit more organized and fleshed out some metavars.

(yeat) dane.jo$ yeat -h
usage: yeat [--init] [-h] [-v] [-n] [-o DIR] [-t T] [-l L] [-c C] [-d D] [-g G] [-s S] config

positional arguments:
  config                config file

optional arguments:
  --init                print a template assembly config file to the terminal (stdout) and exit
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

workflow arguments:
  -n, --dry-run         construct workflow DAG and print a summary but do not execute
  -o DIR, --outdir DIR  output directory; default is current working directory
  -t T, --threads T     execute workflow with T threads; by default, T=1

fastp arguments:
  -l L, --length-required L
                        discard reads shorter than the required L length after pre-preocessing; by default, L=50

downsample arguments:
  -c C, --coverage C    target an average depth of coverage Cx when auto-downsampling; by default, C=150
  -d D, --downsample D  randomly sample D reads from the input rather than assembling the full set; set D=0 to perform auto-downsampling to a
                        desired level of coverage (see --coverage); set D=-1 to disable downsampling; by default, D=0
  -g G, --genome-size G
                        provide known genome size in base pairs (bp); by default, G=0
  -s S, --seed S        seed for the random number generator used for downsampling; by default, the seed is chosen randomly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "positional arguments" term isn't wrong, but I'd recommend something like "required arguments" in this case. Actually, I have a few suggestions.

  • "options" instead of "optional arguments"
  • "workflow configuration"
  • "fastp configuration"
  • "downsampling configuration"

Names are easy to edit for argument groups that you explicitly create. But for default groups, you have to "get under the hood" to edit the group name. You should be able to find something helpful on StackOverflow or in one of our other projects—something like ._positionals and ._optionals. Let me know if you need a pointer in the right direction.

Comment on lines 22 to 30
class Assembly:
def __init__(self, label, data, threads):
self.label = label
self.algorithm = data["algorithm"]
self.extra_args = data["extra_args"]
self.samples = data["samples"]
self.mode = data["mode"]
self.threads = threads
self.validate()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created class Assembly its own dedicated file. Was clumped up with the config and sample classes.

Comment on lines +38 to +41
def check_valid_mode(self):
if self.mode not in ALGORITHMS.keys():
message = f"Invalid assembly mode '{self.mode}' for '{self.label}'"
raise AssemblyConfigError(message)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New sanity check for sample object.

  1. Must have a mode set to paired, single, pacbio, or Oxford.

Comment on lines 19 to 25
class AssemblyConfig:
def __init__(self, data, threads):
self.data = data
self.validate()
self.threads = threads
self.create_sample_and_assembly_objects()
self.validate_samples_to_assembly_modes()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created class AssemblyConfig its own dedicated file. Was clumped up with Sample and Assembly classes.

Comment on lines +234 to +243
@pytest.mark.parametrize("test", ["flat_list", "nested_lists"])
def test_check_reads(test):
label = "sample1"
if test == "flat_list":
data = {"single": [[]]}
elif test == "nested_lists":
data = {"paired": [[data_file("Animal_289_R1.fq.gz"), []]]}
pattern = rf"Input read is not a string '\[\]' for '{label}'"
with pytest.raises(AssemblyConfigError, match=pattern):
Sample(label, data)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a lot of tests to test the sanity checks. Coverage is at 99%.

Comment on lines +143 to 154
def test_validate_samples_to_assembly_modes(mode, label):
data = json.load(open(data_file("configs/example.cfg")))
if mode == "paired":
data["assemblies"]["spades-default"]["samples"] = ["sample4"]
elif mode == "pacbio":
data["assemblies"]["hicanu"]["samples"] = ["sample1"]
elif mode == "oxford":
data["assemblies"]["flye_ONT"]["samples"] = ["sample1"]
pattern = rf"No samples can interact with assembly mode '{mode}' for '{label}'"
with pytest.raises(AssemblyConfigError, match=pattern):
AssemblyConfig(data, 4)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coverage is at 99%. I can't get the last remaining 1%... Not sure why my tests are unable to get the logic jump coverage. If you use # pragma: no cover coverage will go up to 100%...

---------- coverage: platform darwin, python 3.9.18-final-0 ----------
Name                         Stmts   Miss Branch BrPart  Cover   Missing
------------------------------------------------------------------------
yeat/__init__.py                 3      0      0      0   100%
yeat/cli/__init__.py             5      0      0      0   100%
yeat/cli/cli.py                 29      0      2      0   100%
yeat/cli/illumina.py            18      0      2      0   100%
yeat/config/__init__.py          7      0      0      0   100%
yeat/config/assembly.py         35      0     12      0   100%
yeat/config/config.py           52      0     24      1    99%   66->60
yeat/config/sample.py           66      0     32      0   100%
yeat/workflows/__init__.py      21      0     10      0   100%
------------------------------------------------------------------------
TOTAL                          236      0     82      1    99%

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure, but this might be the result of the lack of an else clause in the for loop of that case. You could see if that makes a difference. In any case, 100% coverage is a great goal but not always feasible.

Comment on lines 10 to 21
from yeat.config.config import AssemblyConfig
from yeat.workflows.snakefiles import get_expected_files


cfg = AssemblyConfig(config["data"], config["threads"])
config["samples"] = cfg.samples
config["assemblies"] = cfg.assemblies


rule all:
input:
get_expected_files(config)
Copy link
Collaborator Author

@danejo3 danejo3 Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This snake file is the "bread-and-butter" of this workflow.

Instead of creating a bunch of objects in python and then deflating them into nested dictionaries, we instead create the python objects themselves here in the snake file. This has made my life easier and certainly, the ability to do grid support as well.

In this snake file, we import rules from all snake files and run all combinations of rules to produce YEAT's output. (See get_expected_files() from yeat/workflows/snakefiles/__init__.py.)

Comment on lines 83 to 67
p = Path("seq/{wildcards.sample}/downsample")
p.mkdir(parents=True, exist_ok=True)
copyfile(input.reads, output.sub)
return
if params.genome_size == 0:
df = pd.read_csv(input.mash_report, sep="\t")
genome_size = df["Length"].iloc[0]
else:
genome_size = params.genome_size
with open(params.fastp_report, "r") as fh:
qcdata = json.load(fh)
base_count = qcdata["summary"]["after_filtering"]["total_bases"]
read_count = qcdata["summary"]["after_filtering"]["total_reads"]
avg_read_length = base_count / read_count
if params.downsample == 0:
down = int((genome_size * params.coverage) / (2 * avg_read_length))
else:
down = params.downsample
if params.seed == "None":
seed = randint(1, 2**16-1)
else:
seed = params.seed
print(f"[yeat] genome size: {genome_size}")
print(f"[yeat] average read length: {avg_read_length}")
print(f"[yeat] target depth of coverage: {params.coverage}x")
print(f"[yeat] number of reads to sample: {down}")
print(f"[yeat] random seed for sampling: {seed}")
genome_size = get_genome_size(params.genome_size, input.mash_report)
avg_read_length = get_avg_read_length(params.fastp_report)
down = get_down(params.downsample, genome_size, params.coverage, avg_read_length)
seed = get_seed(params.seed)
print_downsample_values(genome_size, avg_read_length, params.coverage, down, seed)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved shared code (Paired.smk) into __init__.py.

Comment on lines 94 to 102
if params.downsample == -1:
p = Path("seq/{wildcards.sample}/downsample")
p.mkdir(parents=True, exist_ok=True)
copyfile(input.read1, output.sub1)
copyfile(input.read2, output.sub2)
return
if params.genome_size == 0:
df = pd.read_csv(input.mash_report, sep="\t")
genome_size = df["Length"].iloc[0]
else:
genome_size = params.genome_size
with open(params.fastp_report, "r") as fh:
qcdata = json.load(fh)
base_count = qcdata["summary"]["after_filtering"]["total_bases"]
read_count = qcdata["summary"]["after_filtering"]["total_reads"]
avg_read_length = base_count / read_count
if params.downsample == 0:
down = int((genome_size * params.coverage) / (2 * avg_read_length))
else:
down = params.downsample
if params.seed == "None":
seed = randint(1, 2**16-1)
else:
seed = params.seed
print(f"[yeat] genome size: {genome_size}")
print(f"[yeat] average read length: {avg_read_length}")
print(f"[yeat] target depth of coverage: {params.coverage}x")
print(f"[yeat] number of reads to sample: {down}")
print(f"[yeat] random seed for sampling: {seed}")
genome_size = get_genome_size(params.genome_size, input.mash_report)
avg_read_length = get_avg_read_length(params.fastp_report)
down = get_down(params.downsample, genome_size, params.coverage, avg_read_length)
seed = get_seed(params.seed)
print_downsample_values(genome_size, avg_read_length, params.coverage, down, seed)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved shared code (Paired.smk) into __init__.py.

@danejo3
Copy link
Collaborator Author

danejo3 commented Nov 27, 2023

@standage This PR is ready for review!

I'm honestly not sure where you should start. I've made a lot of changes; however, most of the changes were reorganizing and throwing out unnecessary code.

Main things:

  • I split up the 3 main class objects (AssemblyConfig, Sample, and Assembly) into their own individual files.
  • I changed the class object instantiation in the snakemake workflow instead of doing it before and doing a bunch of crazy roundabout ways to pass their data into the snakemake workflow.
  • I added a lot of checks to ensure that the user provided the correct config format.

Once this PR is merged, integration for grid support and hybrid will be pretty straight forward.

One of the biggest changes is instead of calling multiple snakemake jobs, we have consolidated it down to 1.

Example,
Previously,
We would have a paired run, then single run, then pacbio run, etc.., then bandage.
Before a workflow could run, the current one has to finish first.

Now, in one go,
config -- to --> various final outputs created by YEAT

@danejo3 danejo3 requested a review from standage November 27, 2023 19:14
Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big improvement overall. Instantiating config objects in the Snakefile eliminates the need to flatten for passing to the API. Programmatically determining the expected output files from the config object enabled you to coordinate all assembly tasks with a single invocation of Snakemake, which will make grid support more impactful. I have a few questions and comments: see below.

Comment on lines 15 to 22
illumina.add_argument(
"-l",
"--length-required",
type=int,
metavar="L",
default=50,
help="discard reads shorter than the required L length after pre-preocessing; by default, L=50",
metavar="L",
type=int,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "positional arguments" term isn't wrong, but I'd recommend something like "required arguments" in this case. Actually, I have a few suggestions.

  • "options" instead of "optional arguments"
  • "workflow configuration"
  • "fastp configuration"
  • "downsampling configuration"

Names are easy to edit for argument groups that you explicitly create. But for default groups, you have to "get under the hood" to edit the group name. You should be able to find something helpful on StackOverflow or in one of our other projects—something like ._positionals and ._optionals. Let me know if you need a pointer in the right direction.

Comment on lines 49 to 55
def create_sample_and_assembly_objects(self):
self.samples = {}
self.assemblies = {}
for key, value in self.data["samples"].items():
self.samples[key] = Sample(key, value)
for key, value in self.data["assemblies"].items():
self.assemblies[key] = Assembly(key, value, self.threads)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments.

  • It's good practice to instantiate instance data in the constructor, even if you fill it in with a dedicated function like this. The code here technically works, but it's common to try to get a sense for an object's contents by looking at the constructor. And in this case, they'd miss .samples and .assemblies.
  • Instead of "key" and "value", I'd use more descriptive variable names like "sample_name", "label", and "config".

self.long_readtype = next(iter(long)) if long else None

def check_input_reads(self):
self.all_reads = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern here: please instantiate in the constructor.

Comment on lines -34 to +35
@patch.dict(os.environ, {"PATH": "ROUGE"})
@patch.dict(os.environ, {"PATH": "DNE"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[confused caveman grunt]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better readability, I changed the word "ROUGE" to "DNE" which stands for "Does Not Exist".

from yeat.workflows import bandage
from yeat.workflows import snakefiles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain your thinking on introducing a "snakefiles" subsubpackage here? The "workflows" subpackage is almost empty except for "snakefiles", which suggests the new layer of nesting might be unnecessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, we had more python code in the workflows dir; however, as you pointed out, we don't have much any more with the new changes! I agree that the nesting is unnecessary. To keep all of the python functions called in the snakemake rules, I've created yeat/workflows/aux.py to hold them instead of dropping them in yeat/workflows/__init__.py to keep the __init__.py as the jumping point to run the snakemake command only.

Comment on lines +143 to 154
def test_validate_samples_to_assembly_modes(mode, label):
data = json.load(open(data_file("configs/example.cfg")))
if mode == "paired":
data["assemblies"]["spades-default"]["samples"] = ["sample4"]
elif mode == "pacbio":
data["assemblies"]["hicanu"]["samples"] = ["sample1"]
elif mode == "oxford":
data["assemblies"]["flye_ONT"]["samples"] = ["sample1"]
pattern = rf"No samples can interact with assembly mode '{mode}' for '{label}'"
with pytest.raises(AssemblyConfigError, match=pattern):
AssemblyConfig(data, 4)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure, but this might be the result of the lack of an else clause in the for loop of that case. You could see if that makes a difference. In any case, 100% coverage is a great goal but not always feasible.

rule flye:
output:
contigs="analysis/{sample}/{label}/flye/{sample}_contigs.fasta"
contigs="analysis/{sample}/{readtype,nano-raw|nano-corr|nano-hq}/{label}/flye/contigs.fasta"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wildcard constraints?

Copy link
Collaborator Author

@danejo3 danejo3 Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Originally, we did not import all rules from all snakefiles into Workflows.snk. (We ran each workflow separately—one at-a-time.) Now that we import everything and there are multiple rules that have the same name and output, I needed to make the wildcard constraints even stricter by using regex. If the wildcard constraints aren't made, when you import rules that have the same output, snakemake gets confused because it doesn't know which rule to execute.

Here's the documentation for it: https://snakemake.readthedocs.io/en/stable/tutorial/additional_features.html#constraining-wildcards

Copy link
Collaborator Author

@danejo3 danejo3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added suggestions in recent pushes.

  • Removed the unnecessary nesting in yeat/workflow directory.
  • Improved variable names
  • More unit tests

Github is getting pretty slow and laggy with all of the changes! 50 files changed!

Cleaned up the test suite and added more tests to test snakemake python code.

All tests passing. It took my computer 4 hrs and 26 mins to run the entire thing. As the project grows, I'm unsure what to do to improve test run time. I understand that running them all is necessary but it's very costly. Hopefully, we can keep it at this time range for the time being.

Ready for another review!

from yeat.tests import data_file
from yeat.workflow import aux

pytestmark = pytest.mark.short
Copy link
Collaborator Author

@danejo3 danejo3 Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the most recent update, I created a bunch of tests that tests the python code executed in the snakemake workflow (see yeat/workflow/aux.py). Previously, because I left all the python code in the snakemake files, in order to test them, I would have to run the entire workflow and do a bunch of unnecessary calculations. However, by extracting out the python code into their own individual functions in aux.py, I have been able to confine and test the code's function on an individual level.

All of the tests in test_aux.py are very quick and can be ran with make tests.

Improved the test coverage. As you mentioned, the elif statements causes the misses of logic jump coverages. I added # pragma: no cover at the end of these if statements. The remaining 1% that is not covered is caused by a bandage test. This is not convered in make test or make testall due to the dependency of the user's operating system. Run make testbandage to check that code's coverage.

---------- coverage: platform darwin, python 3.9.18-final-0 ----------
Name                        Stmts   Miss Branch BrPart  Cover   Missing
-----------------------------------------------------------------------
yeat/__init__.py                3      0      0      0   100%
yeat/cli/__init__.py            5      0      0      0   100%
yeat/cli/cli.py                31      0      2      0   100%
yeat/cli/illumina.py           18      0      2      0   100%
yeat/config/__init__.py         7      0      0      0   100%
yeat/config/assembly.py        35      0     12      0   100%
yeat/config/config.py          50      0     20      0   100%
yeat/config/sample.py          70      0     36      0   100%
yeat/workflow/__init__.py      21      0     10      0   100%
yeat/workflow/aux.py           79      3     24      1    96%   137-139
-----------------------------------------------------------------------
TOTAL                         319      3    106      1    99%

@standage standage merged commit 02f1c1e into main Dec 11, 2023
@standage standage deleted the update-config branch December 11, 2023 20:15
@danejo3 danejo3 mentioned this pull request Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants