Updating config file #64

danejo3 · 2023-11-15T20:29:24Z

The purpose of this PR is to update the config file as discussed in #56 to enable hybrid assembly and grid support #61.

In this PR, users will need to follow a strict config file format that they must provide.

Large improvements were made to enable hybrid and grid support.

Removed unnecessary or redundant code.
Refactored and organized code.
Provided multiple sanity checks for the user's config file.

Example of the new config file:

{
    "samples": {
        "sample1": {
            "paired": [
                [
                    "yeat/tests/data/short_reads_1.fastq.gz",
                    "yeat/tests/data/short_reads_2.fastq.gz"
                ]
            ]
        },
        "sample2": {
            "paired": [
                [
                    "yeat/tests/data/Animal_289_R1.fq.gz",
                    "yeat/tests/data/Animal_289_R2.fq.gz"
                ]
            ]
        },
        "sample3": {
            "pacbio-hifi": [
                "yeat/tests/data/ecoli.fastq.gz"
            ]
        },
        "sample4": {
            "nano-hq": [
                "yeat/tests/data/ecolk12mg1655_R10_3_guppy_345_HAC.fastq.gz"
            ]
        }
    },
    "assemblies": {
        "spades-default": {
            "algorithm": "spades",
            "extra_args": "",
            "samples": [
                "sample1",
                "sample2"
            ],
            "mode": "paired"
        },
        "hicanu": {
            "algorithm": "canu",
            "extra_args": "genomeSize=4.8m",
            "samples": [
                "sample3"
            ],
            "mode": "pacbio"
        },
        "flye_ONT": {
            "algorithm": "flye",
            "extra_args": "",
            "samples": [
                "sample4"
            ],
            "mode": "oxford"
        }
    }
}

danejo3

Major changes:

New config layout especially for paired sample reads.
Instead of one workflow at a time, all rules are compiled in Snakemake's DAG graph and executed.
Instead of creating python objects outside of Snakemake, we now create them in the workflow.
Cleaned up code significantly.

All of the following changes will make grid support and hybrid assembly easier to implement.

danejo3 · 2023-11-27T17:15:46Z

yeat/cli/__init__.py

+def main(args=None):
+    if args is None:
+        args = cli.get_parser().parse_args()  # pragma: no cover
+    workflows.run_workflows(args)


Moved main() from yeat/cli/cli.py to here.

danejo3 · 2023-11-27T17:19:50Z

yeat/cli/illumina.py

    illumina.add_argument(
        "-l",
        "--length-required",
-        type=int,
-        metavar="L",
        default=50,
        help="discard reads shorter than the required L length after pre-preocessing; by default, L=50",
+        metavar="L",
+        type=int,
    )


Made the CLI a bit more organized and fleshed out some metavars.

(yeat) dane.jo$ yeat -h usage: yeat [--init] [-h] [-v] [-n] [-o DIR] [-t T] [-l L] [-c C] [-d D] [-g G] [-s S] config positional arguments: config config file optional arguments: --init print a template assembly config file to the terminal (stdout) and exit -h, --help show this help message and exit -v, --version show program's version number and exit workflow arguments: -n, --dry-run construct workflow DAG and print a summary but do not execute -o DIR, --outdir DIR output directory; default is current working directory -t T, --threads T execute workflow with T threads; by default, T=1 fastp arguments: -l L, --length-required L discard reads shorter than the required L length after pre-preocessing; by default, L=50 downsample arguments: -c C, --coverage C target an average depth of coverage Cx when auto-downsampling; by default, C=150 -d D, --downsample D randomly sample D reads from the input rather than assembling the full set; set D=0 to perform auto-downsampling to a desired level of coverage (see --coverage); set D=-1 to disable downsampling; by default, D=0 -g G, --genome-size G provide known genome size in base pairs (bp); by default, G=0 -s S, --seed S seed for the random number generator used for downsampling; by default, the seed is chosen randomly

The "positional arguments" term isn't wrong, but I'd recommend something like "required arguments" in this case. Actually, I have a few suggestions.

"options" instead of "optional arguments"

"workflow configuration"

"fastp configuration"

"downsampling configuration"

Names are easy to edit for argument groups that you explicitly create. But for default groups, you have to "get under the hood" to edit the group name. You should be able to find something helpful on StackOverflow or in one of our other projects—something like ._positionals and ._optionals. Let me know if you need a pointer in the right direction.

danejo3 · 2023-11-27T17:21:59Z

yeat/config/assembly.py

+class Assembly:
+    def __init__(self, label, data, threads):
+        self.label = label
+        self.algorithm = data["algorithm"]
+        self.extra_args = data["extra_args"]
+        self.samples = data["samples"]
+        self.mode = data["mode"]
+        self.threads = threads
+        self.validate()


Created class Assembly its own dedicated file. Was clumped up with the config and sample classes.

danejo3 · 2023-11-27T17:24:39Z

yeat/config/assembly.py

+    def check_valid_mode(self):
+        if self.mode not in ALGORITHMS.keys():
+            message = f"Invalid assembly mode '{self.mode}' for '{self.label}'"
+            raise AssemblyConfigError(message)


New sanity check for sample object.

Must have a mode set to paired, single, pacbio, or Oxford.

danejo3 · 2023-11-27T17:25:29Z

yeat/config/config.py

+class AssemblyConfig:
+    def __init__(self, data, threads):
+        self.data = data
+        self.validate()
+        self.threads = threads
+        self.create_sample_and_assembly_objects()
+        self.validate_samples_to_assembly_modes()


Created class AssemblyConfig its own dedicated file. Was clumped up with Sample and Assembly classes.

danejo3 · 2023-11-27T17:43:05Z

yeat/tests/test_config.py

+@pytest.mark.parametrize("test", ["flat_list", "nested_lists"])
+def test_check_reads(test):
+    label = "sample1"
+    if test == "flat_list":
+        data = {"single": [[]]}
+    elif test == "nested_lists":
+        data = {"paired": [[data_file("Animal_289_R1.fq.gz"), []]]}
+    pattern = rf"Input read is not a string '\[\]' for '{label}'"
+    with pytest.raises(AssemblyConfigError, match=pattern):
+        Sample(label, data)


Added a lot of tests to test the sanity checks. Coverage is at 99%.

danejo3 · 2023-11-27T17:45:43Z

yeat/tests/test_config.py

+def test_validate_samples_to_assembly_modes(mode, label):
+    data = json.load(open(data_file("configs/example.cfg")))
+    if mode == "paired":
+        data["assemblies"]["spades-default"]["samples"] = ["sample4"]
+    elif mode == "pacbio":
+        data["assemblies"]["hicanu"]["samples"] = ["sample1"]
+    elif mode == "oxford":
+        data["assemblies"]["flye_ONT"]["samples"] = ["sample1"]
+    pattern = rf"No samples can interact with assembly mode '{mode}' for '{label}'"
+    with pytest.raises(AssemblyConfigError, match=pattern):
+        AssemblyConfig(data, 4)



Coverage is at 99%. I can't get the last remaining 1%... Not sure why my tests are unable to get the logic jump coverage. If you use # pragma: no cover coverage will go up to 100%...

---------- coverage: platform darwin, python 3.9.18-final-0 ---------- Name Stmts Miss Branch BrPart Cover Missing ------------------------------------------------------------------------ yeat/__init__.py 3 0 0 0 100% yeat/cli/__init__.py 5 0 0 0 100% yeat/cli/cli.py 29 0 2 0 100% yeat/cli/illumina.py 18 0 2 0 100% yeat/config/__init__.py 7 0 0 0 100% yeat/config/assembly.py 35 0 12 0 100% yeat/config/config.py 52 0 24 1 99% 66->60 yeat/config/sample.py 66 0 32 0 100% yeat/workflows/__init__.py 21 0 10 0 100% ------------------------------------------------------------------------ TOTAL 236 0 82 1 99%

I'm not 100% sure, but this might be the result of the lack of an else clause in the for loop of that case. You could see if that makes a difference. In any case, 100% coverage is a great goal but not always feasible.

danejo3 · 2023-11-27T17:51:37Z

yeat/workflows/snakefiles/Workflows

+from yeat.config.config import AssemblyConfig
+from yeat.workflows.snakefiles import get_expected_files
+
+
+cfg = AssemblyConfig(config["data"], config["threads"])
+config["samples"] = cfg.samples
+config["assemblies"] = cfg.assemblies
+
+
+rule all:
+    input:
+        get_expected_files(config)


This snake file is the "bread-and-butter" of this workflow.

Instead of creating a bunch of objects in python and then deflating them into nested dictionaries, we instead create the python objects themselves here in the snake file. This has made my life easier and certainly, the ability to do grid support as well.

In this snake file, we import rules from all snake files and run all combinations of rules to produce YEAT's output. (See get_expected_files() from yeat/workflows/snakefiles/__init__.py.)

danejo3 · 2023-11-27T18:52:16Z

yeat/workflows/snakefiles/Single

-            p = Path("seq/{wildcards.sample}/downsample")
-            p.mkdir(parents=True, exist_ok=True)
            copyfile(input.reads, output.sub)
            return
-        if params.genome_size == 0:
-            df = pd.read_csv(input.mash_report, sep="\t")
-            genome_size = df["Length"].iloc[0]
-        else:
-            genome_size = params.genome_size
-        with open(params.fastp_report, "r") as fh:
-            qcdata = json.load(fh)
-        base_count = qcdata["summary"]["after_filtering"]["total_bases"]
-        read_count = qcdata["summary"]["after_filtering"]["total_reads"]
-        avg_read_length = base_count / read_count
-        if params.downsample == 0:
-            down = int((genome_size * params.coverage) / (2 * avg_read_length))
-        else:
-            down = params.downsample
-        if params.seed == "None":
-            seed = randint(1, 2**16-1)
-        else:
-            seed = params.seed
-        print(f"[yeat] genome size: {genome_size}")
-        print(f"[yeat] average read length: {avg_read_length}")
-        print(f"[yeat] target depth of coverage: {params.coverage}x")
-        print(f"[yeat] number of reads to sample: {down}")
-        print(f"[yeat] random seed for sampling: {seed}")
+        genome_size = get_genome_size(params.genome_size, input.mash_report)
+        avg_read_length = get_avg_read_length(params.fastp_report)
+        down = get_down(params.downsample, genome_size, params.coverage, avg_read_length)
+        seed = get_seed(params.seed)
+        print_downsample_values(genome_size, avg_read_length, params.coverage, down, seed)


Moved shared code (Paired.smk) into __init__.py.

danejo3 · 2023-11-27T18:52:40Z

yeat/workflows/snakefiles/Paired

        if params.downsample == -1:
-            p = Path("seq/{wildcards.sample}/downsample")
-            p.mkdir(parents=True, exist_ok=True)
            copyfile(input.read1, output.sub1)
            copyfile(input.read2, output.sub2)
            return
-        if params.genome_size == 0:
-            df = pd.read_csv(input.mash_report, sep="\t")
-            genome_size = df["Length"].iloc[0]
-        else:
-            genome_size = params.genome_size
-        with open(params.fastp_report, "r") as fh:
-            qcdata = json.load(fh)
-        base_count = qcdata["summary"]["after_filtering"]["total_bases"]
-        read_count = qcdata["summary"]["after_filtering"]["total_reads"]
-        avg_read_length = base_count / read_count
-        if params.downsample == 0:
-            down = int((genome_size * params.coverage) / (2 * avg_read_length))
-        else:
-            down = params.downsample
-        if params.seed == "None":
-            seed = randint(1, 2**16-1)
-        else:
-            seed = params.seed
-        print(f"[yeat] genome size: {genome_size}")
-        print(f"[yeat] average read length: {avg_read_length}")
-        print(f"[yeat] target depth of coverage: {params.coverage}x")
-        print(f"[yeat] number of reads to sample: {down}")
-        print(f"[yeat] random seed for sampling: {seed}")
+        genome_size = get_genome_size(params.genome_size, input.mash_report)
+        avg_read_length = get_avg_read_length(params.fastp_report)
+        down = get_down(params.downsample, genome_size, params.coverage, avg_read_length)
+        seed = get_seed(params.seed)
+        print_downsample_values(genome_size, avg_read_length, params.coverage, down, seed)


Moved shared code (Paired.smk) into __init__.py.

danejo3 · 2023-11-27T19:13:56Z

@standage This PR is ready for review!

I'm honestly not sure where you should start. I've made a lot of changes; however, most of the changes were reorganizing and throwing out unnecessary code.

Main things:

I split up the 3 main class objects (AssemblyConfig, Sample, and Assembly) into their own individual files.
I changed the class object instantiation in the snakemake workflow instead of doing it before and doing a bunch of crazy roundabout ways to pass their data into the snakemake workflow.
I added a lot of checks to ensure that the user provided the correct config format.

Once this PR is merged, integration for grid support and hybrid will be pretty straight forward.

One of the biggest changes is instead of calling multiple snakemake jobs, we have consolidated it down to 1.

Example,
Previously,
We would have a paired run, then single run, then pacbio run, etc.., then bandage.
Before a workflow could run, the current one has to finish first.

Now, in one go,
config -- to --> various final outputs created by YEAT

standage

Big improvement overall. Instantiating config objects in the Snakefile eliminates the need to flatten for passing to the API. Programmatically determining the expected output files from the config object enabled you to coordinate all assembly tasks with a single invocation of Snakemake, which will make grid support more impactful. I have a few questions and comments: see below.

standage · 2023-11-29T15:56:25Z

yeat/cli/illumina.py

    illumina.add_argument(
        "-l",
        "--length-required",
-        type=int,
-        metavar="L",
        default=50,
        help="discard reads shorter than the required L length after pre-preocessing; by default, L=50",
+        metavar="L",
+        type=int,
    )


The "positional arguments" term isn't wrong, but I'd recommend something like "required arguments" in this case. Actually, I have a few suggestions.

"options" instead of "optional arguments"

"workflow configuration"

"fastp configuration"

"downsampling configuration"

Names are easy to edit for argument groups that you explicitly create. But for default groups, you have to "get under the hood" to edit the group name. You should be able to find something helpful on StackOverflow or in one of our other projects—something like ._positionals and ._optionals. Let me know if you need a pointer in the right direction.

standage · 2023-11-29T16:07:05Z

yeat/config/config.py

+    def create_sample_and_assembly_objects(self):
+        self.samples = {}
+        self.assemblies = {}
+        for key, value in self.data["samples"].items():
+            self.samples[key] = Sample(key, value)
+        for key, value in self.data["assemblies"].items():
+            self.assemblies[key] = Assembly(key, value, self.threads)


Two comments.

It's good practice to instantiate instance data in the constructor, even if you fill it in with a dedicated function like this. The code here technically works, but it's common to try to get a sense for an object's contents by looking at the constructor. And in this case, they'd miss .samples and .assemblies.

Instead of "key" and "value", I'd use more descriptive variable names like "sample_name", "label", and "config".

standage · 2023-11-29T16:08:21Z

yeat/config/sample.py

+        self.long_readtype = next(iter(long)) if long else None
+
+    def check_input_reads(self):
+        self.all_reads = []


Same concern here: please instantiate in the constructor.

standage · 2023-11-29T16:13:05Z

yeat/tests/test_bandage.py

-@patch.dict(os.environ, {"PATH": "ROUGE"})
+@patch.dict(os.environ, {"PATH": "DNE"})


[confused caveman grunt]

For better readability, I changed the word "ROUGE" to "DNE" which stands for "Does Not Exist".

standage · 2023-11-29T16:16:58Z

yeat/tests/test_bandage.py

-from yeat.workflows import bandage
+from yeat.workflows import snakefiles


Could you explain your thinking on introducing a "snakefiles" subsubpackage here? The "workflows" subpackage is almost empty except for "snakefiles", which suggests the new layer of nesting might be unnecessary.

Originally, we had more python code in the workflows dir; however, as you pointed out, we don't have much any more with the new changes! I agree that the nesting is unnecessary. To keep all of the python functions called in the snakemake rules, I've created yeat/workflows/aux.py to hold them instead of dropping them in yeat/workflows/__init__.py to keep the __init__.py as the jumping point to run the snakemake command only.

standage · 2023-11-29T16:33:36Z

yeat/tests/test_config.py

+def test_validate_samples_to_assembly_modes(mode, label):
+    data = json.load(open(data_file("configs/example.cfg")))
+    if mode == "paired":
+        data["assemblies"]["spades-default"]["samples"] = ["sample4"]
+    elif mode == "pacbio":
+        data["assemblies"]["hicanu"]["samples"] = ["sample1"]
+    elif mode == "oxford":
+        data["assemblies"]["flye_ONT"]["samples"] = ["sample1"]
+    pattern = rf"No samples can interact with assembly mode '{mode}' for '{label}'"
+    with pytest.raises(AssemblyConfigError, match=pattern):
+        AssemblyConfig(data, 4)



I'm not 100% sure, but this might be the result of the lack of an else clause in the for loop of that case. You could see if that makes a difference. In any case, 100% coverage is a great goal but not always feasible.

standage · 2023-11-29T16:36:15Z

yeat/workflows/snakefiles/Oxford

 rule flye:
    output:
-        contigs="analysis/{sample}/{label}/flye/{sample}_contigs.fasta"
+        contigs="analysis/{sample}/{readtype,nano-raw|nano-corr|nano-hq}/{label}/flye/contigs.fasta"


Wildcard constraints?

Yes. Originally, we did not import all rules from all snakefiles into Workflows.snk. (We ran each workflow separately—one at-a-time.) Now that we import everything and there are multiple rules that have the same name and output, I needed to make the wildcard constraints even stricter by using regex. If the wildcard constraints aren't made, when you import rules that have the same output, snakemake gets confused because it doesn't know which rule to execute.

Here's the documentation for it: https://snakemake.readthedocs.io/en/stable/tutorial/additional_features.html#constraining-wildcards

danejo3

Added suggestions in recent pushes.

Removed the unnecessary nesting in yeat/workflow directory.
Improved variable names
More unit tests

Github is getting pretty slow and laggy with all of the changes! 50 files changed!

Cleaned up the test suite and added more tests to test snakemake python code.

All tests passing. It took my computer 4 hrs and 26 mins to run the entire thing. As the project grows, I'm unsure what to do to improve test run time. I understand that running them all is necessary but it's very costly. Hopefully, we can keep it at this time range for the time being.

Ready for another review!

danejo3 · 2023-12-08T14:37:54Z

yeat/tests/test_aux.py

+from yeat.tests import data_file
+from yeat.workflow import aux
+
+pytestmark = pytest.mark.short


In the most recent update, I created a bunch of tests that tests the python code executed in the snakemake workflow (see yeat/workflow/aux.py). Previously, because I left all the python code in the snakemake files, in order to test them, I would have to run the entire workflow and do a bunch of unnecessary calculations. However, by extracting out the python code into their own individual functions in aux.py, I have been able to confine and test the code's function on an individual level.

All of the tests in test_aux.py are very quick and can be ran with make tests.

Improved the test coverage. As you mentioned, the elif statements causes the misses of logic jump coverages. I added # pragma: no cover at the end of these if statements. The remaining 1% that is not covered is caused by a bandage test. This is not convered in make test or make testall due to the dependency of the user's operating system. Run make testbandage to check that code's coverage.

---------- coverage: platform darwin, python 3.9.18-final-0 ---------- Name Stmts Miss Branch BrPart Cover Missing ----------------------------------------------------------------------- yeat/__init__.py 3 0 0 0 100% yeat/cli/__init__.py 5 0 0 0 100% yeat/cli/cli.py 31 0 2 0 100% yeat/cli/illumina.py 18 0 2 0 100% yeat/config/__init__.py 7 0 0 0 100% yeat/config/assembly.py 35 0 12 0 100% yeat/config/config.py 50 0 20 0 100% yeat/config/sample.py 70 0 36 0 100% yeat/workflow/__init__.py 21 0 10 0 100% yeat/workflow/aux.py 79 3 24 1 96% 137-139 ----------------------------------------------------------------------- TOTAL 319 3 106 1 99%

danejo3 added 16 commits November 6, 2023 14:34

first commit; saving work

6e78b13

saving work; removing hybrid

149838c

saving work; all snakemake files updated

c463ac4

saving work; major updates

d5b098d

saving work; cleaning up code

9428ddb

saving work; cleaning up code

566d5ae

saving work; updating tests

97abafd

saving work; cleaning up tests

8135134

cleaning up json

237ed77

cleaning up code

58a14a6

saving work; cleaning up snakefiles

7d138ec

saving work; cleaning up snakefiles

0871a8e

saving work; cleaning up tests

428de55

saving work; cleaning up tests

7494173

updated error messages

9971fff

finished tests; coverage 99%

085fb31

danejo3 commented Nov 27, 2023

View reviewed changes

danejo3 requested a review from standage November 27, 2023 19:14

hifiasm_meta instead of hifiasm-meta

450509d

standage reviewed Nov 29, 2023

View reviewed changes

danejo3 added 2 commits December 6, 2023 11:40

added suggestions

dab5a39

added more tests; added suggestions

cd8e6c6

danejo3 commented Dec 8, 2023

View reviewed changes

standage approved these changes Dec 11, 2023

View reviewed changes

standage merged commit 02f1c1e into main Dec 11, 2023

standage deleted the update-config branch December 11, 2023 20:15

danejo3 mentioned this pull request Jan 8, 2024

0.5-RC #67

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating config file #64

Updating config file #64

danejo3 commented Nov 15, 2023 •

edited

Loading

danejo3 left a comment

danejo3 Nov 27, 2023

danejo3 Nov 27, 2023

standage Nov 29, 2023

danejo3 Nov 27, 2023

danejo3 Nov 27, 2023

danejo3 Nov 27, 2023

danejo3 Nov 27, 2023

danejo3 Nov 27, 2023

standage Nov 29, 2023

danejo3 Nov 27, 2023 •

edited

Loading

danejo3 Nov 27, 2023

danejo3 Nov 27, 2023

danejo3 commented Nov 27, 2023 •

edited

Loading

standage left a comment

standage Nov 29, 2023

standage Nov 29, 2023

standage Nov 29, 2023

standage Nov 29, 2023

danejo3 Dec 6, 2023

standage Nov 29, 2023

danejo3 Dec 6, 2023

standage Nov 29, 2023

standage Nov 29, 2023

danejo3 Dec 6, 2023 •

edited

Loading

danejo3 left a comment •

edited

Loading

danejo3 Dec 8, 2023 •

edited

Loading

		@patch.dict(os.environ, {"PATH": "ROUGE"})
		@patch.dict(os.environ, {"PATH": "DNE"})

		from yeat.workflows import bandage
		from yeat.workflows import snakefiles

Updating config file #64

Updating config file #64

Conversation

danejo3 commented Nov 15, 2023 • edited Loading

danejo3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danejo3 Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danejo3 commented Nov 27, 2023 • edited Loading

standage left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danejo3 Dec 6, 2023 • edited Loading

Choose a reason for hiding this comment

danejo3 left a comment • edited Loading

Choose a reason for hiding this comment

danejo3 Dec 8, 2023 • edited Loading

Choose a reason for hiding this comment

danejo3 commented Nov 15, 2023 •

edited

Loading

danejo3 Nov 27, 2023 •

edited

Loading

danejo3 commented Nov 27, 2023 •

edited

Loading

danejo3 Dec 6, 2023 •

edited

Loading

danejo3 left a comment •

edited

Loading

danejo3 Dec 8, 2023 •

edited

Loading