-
Notifications
You must be signed in to change notification settings - Fork 2
merge-haplotypes for simulating linkedreads #221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
imputeparams -> template impute hpc xxx -> template hpc-xxx unified simulate clarified docstrings
📝 WalkthroughWalkthroughThe changes update multiple components of the project by removing deprecated cleanup and HPC commands and standardizing file path management. In the GitHub workflow, cleanup commands are removed, and the HPC scheduler command order is altered. In the Python modules, support for HTCondor is removed in favor of SLURM, with corresponding simplifications in documentation. The simulation command now accepts a new Changes
Sequence Diagram(s)sequenceDiagram
participant U as User
participant CLI as Harpy CLI
participant SW as Snakemake Workflow
U->>CLI: Execute `harpy simulate linkedreads [--merge-haplotypes]`
CLI->>SW: Initiate simulation (flag: merge_haplotypes)
alt merge_haplotypes true
SW->>SW: Run rule `combine_haplotypes` to merge haplotype files
SW->>CLI: Output merged read file
else merge_haplotypes false
SW->>CLI: Output separate haplotype files for each haplotype
end
Possibly related PRs
Suggested labels
Tip ⚡🧪 Multi-step agentic review comment chat (experimental)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (1)
resources/changelog.md (1)
13-14
: Fix grammatical mistake in the bullet for--merge-haplotypes
.
Currently states “as a convenience features.” Recommend singular:- - `harpy simulate linkedreads` adds `--merge-haplotypes` as a convenience features to merge R1 reads for hap0 and hap1 (same for R2) + - `harpy simulate linkedreads` adds `--merge-haplotypes` as a convenience feature to merge R1 reads for hap0 and hap1 (same for R2)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (28)
.github/filters.yml
(3 hunks).github/workflows/tests.yml
(2 hunks)harpy/__main__.py
(3 hunks)harpy/_validations.py
(2 hunks)harpy/align.py
(1 hunks)harpy/hpc.py
(6 hunks)harpy/impute.py
(1 hunks)harpy/imputeparams.py
(0 hunks)harpy/simulate.py
(1 hunks)harpy/simulate_linkedreads.py
(5 hunks)harpy/simulate_variants.py
(4 hunks)harpy/snakefiles/align_bwa.smk
(12 hunks)harpy/snakefiles/align_ema.smk
(13 hunks)harpy/snakefiles/align_strobealign.smk
(10 hunks)harpy/snakefiles/phase.smk
(11 hunks)harpy/snakefiles/simulate_linkedreads.smk
(2 hunks)harpy/snakefiles/snp_freebayes.smk
(10 hunks)harpy/snakefiles/snp_mpileup.smk
(12 hunks)harpy/snakefiles/sv_leviathan.smk
(7 hunks)harpy/snakefiles/sv_leviathan_pop.smk
(10 hunks)harpy/snakefiles/sv_naibr.smk
(7 hunks)harpy/snakefiles/sv_naibr_phase.smk
(9 hunks)harpy/snakefiles/sv_naibr_pop.smk
(10 hunks)harpy/snakefiles/sv_naibr_pop_phase.smk
(8 hunks)harpy/snp.py
(4 hunks)harpy/sv.py
(4 hunks)harpy/template.py
(2 hunks)resources/changelog.md
(3 hunks)
💤 Files with no reviewable changes (1)
- harpy/imputeparams.py
🧰 Additional context used
🪛 LanguageTool
resources/changelog.md
[grammar] ~18-~18: This phrase is duplicated. You should probably use “now is” only once.
Context: ... Renamed Commands - harpy popgroup
is now harpy template groupings
- harpy imputeparams
is now harpy template impute
- harpy hpc
is now harpy template hpc-*
- where `*...
(PHRASE_REPETITION)
[duplication] ~61-~61: Possible typo: you repeated a word.
Context: ... more [direct] reliance on subprocess
calls - calls pygmentize
directly through python AP...
(ENGLISH_WORD_REPEAT_RULE)
🪛 Ruff (0.8.2)
harpy/simulate_linkedreads.py
95-95: Undefined name combine_haplotypes
(F821)
🔇 Additional comments (140)
harpy/align.py (2)
23-27
: Documentation improvement with tabular format 👍The change to a tabular format for aligner documentation enhances readability and makes it easier to compare the different aligners' capabilities at a glance.
33-41
: Well-structured module documentationAdding the
module_docstring
dictionary provides a consistent and structured way to document the available commands in the align module. This enhancement aligns with similar documentation improvements in other files.harpy/snp.py (4)
28-36
: Good addition of structured documentationThe
module_docstring
dictionary provides a clean, consistent way to document the available commands in the SNP module, following the same pattern used in other modules.
63-63
: Appropriate integration of module documentationUsing the dictionary merging operator to combine the existing docstring with the new module_docstring maintains code clarity while adding the new structured documentation.
94-94
: Command reference updated for consistencyReplacing
harpy popgroup
withharpy template
aligns with the PR's restructuring of commands and ensures documentation consistency.
198-198
: Command reference updated for consistencyReplacing
harpy popgroup
withharpy template
in this docstring maintains consistency with the changes elsewhere in the codebase.harpy/sv.py (5)
24-27
: Enhanced documentation with feature comparison tableThe tabular format clearly illustrates the capabilities of each SV caller, making it easier for users to choose the appropriate tool for their needs.
33-41
: Well-structured module documentationAdding the
module_docstring
dictionary provides a consistent documentation approach across modules, improving overall code organization.
68-68
: Appropriate integration of module documentationUsing the dictionary merging operator to combine the existing docstring with the new module_docstring maintains code clarity.
97-97
: Command reference updated for consistencyReplacing
harpy popgroup
withharpy template
aligns with the PR's command restructuring.
203-203
: Command reference updated for consistencyReplacing
harpy popgroup
withharpy template
maintains documentation consistency.harpy/impute.py (1)
54-54
: Command reference updated for consistencyReplacing
harpy stitchparams
withharpy template
reflects the consolidation of related functionalities under the template command group as described in the PR objectives.harpy/_validations.py (2)
67-67
: Command reference updatedThe error message has been updated to use the new command name
harpy template
instead of the oldharpy stitchparams
. This aligns with the larger CLI restructuring in the PR.
97-97
: Command reference updatedUpdated the error message to use the new command name
harpy template
instead ofharpy stitchparams
, consistent with the command structure changes.harpy/simulate_variants.py (4)
149-149
: Improved documentation clarityThe help text for the
--only-vcf
option has been updated to be more precise about what the option does. It now clearly states that it will "only generate the second VCF file and not simulate a second FASTA file" rather than the previous wording about generating only VCF files without creating the diploid genome.
273-273
: Consistent help text updateThe help text for the
--only-vcf
option has been updated in theinversion
function to match the same clarification made in thesnpindel
function.
378-378
: Consistent help text updateThe help text for the
--only-vcf
option has been updated in thecnv
function to match the same clarification made in the other functions.
487-487
: Consistent help text updateThe help text for the
--only-vcf
option has been updated in thetranslocation
function to match the same clarification made in the other functions..github/filters.yml (3)
142-142
: Added new module file to simvars sectionAdded
harpy/simulate.py
to the simvars section. This file likely consolidates simulation commands for genomic variants, replacing or complementing the existing simulation functionality.
151-151
: Added new module file to simreads sectionAdded
harpy/simulate.py
to the simreads section, indicating that this file contains functionality related to simulating reads as well as variants.
168-168
: Updated module files in other sectionAdded
harpy/template.py
to the other section, likely replacing the removed filesharpy/imputeparams.py
andharpy/popgroup.py
as part of the CLI restructuring mentioned in the PR objectives.harpy/snakefiles/simulate_linkedreads.smk (3)
16-16
: Added new configuration parameterAdded a new configuration parameter
combine_haplotypes
to support the new functionality mentioned in the PR objectives that enhances simulation of linked reads.
177-187
: Added new rule to merge haplotype outputsImplemented a new rule
combine_haplotypes
that concatenates the haplotype files from both simulated haplotypes into single output files. This adds the capability to merge haplotypes for simulating linked reads, which was a key objective of this PR.
191-192
: Updated workflow summary to include merged haplotypesModified the
workflow_summary
rule to conditionally include the outputs from thecombine_haplotypes
rule when that feature is enabled. This properly integrates the new functionality into the workflow's summary mechanism.harpy/snakefiles/sv_leviathan_pop.smk (3)
18-18
: Introduction of centralized workflow directory pathThe addition of
workflowdir
as a centralized variable for workflow-related paths is a good architectural improvement that will help with maintainability.
35-37
: Improved genome file path organizationThe genome path is now properly organized under the workflow directory structure instead of the previous "Genome/" folder, which aligns with the PR objective to update genome processing to occur on a per-execution basis.
333-333
: Consistent path structure for workflow summaryThe workflow summary path has been updated to use the new directory structure, maintaining consistency with other path changes throughout the file.
harpy/snakefiles/sv_naibr_pop.smk (3)
19-19
: Standardized workflow directory definitionThe
workflowdir
variable definition is consistent with the changes in other Snakemake files, ensuring uniformity across the codebase.
33-35
: Properly handled genome path constructionThe conditional logic for handling gzipped genomes is correctly maintained while updating the path structure to use the workflow directory.
316-316
: Updated workflow summary pathThis change ensures the workflow summary is stored in a consistent location with the other Snakemake files.
harpy/snakefiles/phase.smk (3)
17-18
: Consistent workflow directory implementationThe
workflowdir
variable is properly introduced, maintaining the same pattern seen in other Snakemake files in this PR.
46-47
: Updated genome path variablesThe genome path variables have been properly updated to use the new workflow directory structure, consistent with the changes in other files.
314-314
: Consistent update to summary file locationThe workflow summary file is now properly placed in the workflow directory, which aligns with the PR's goal of restructuring the output organization.
harpy/snakefiles/snp_freebayes.smk (3)
17-17
: Standardized workflow directory definitionThe introduction of the
workflowdir
variable follows the same pattern as other Snakemake files in this PR, ensuring consistency.
33-33
: Consolidated genome path definitionThe
workflow_geno
variable now properly uses the new directory structure, eliminating the hardcoded "Genome/" path used previously.
260-260
: Consistent workflow summary path updateThe location of the workflow summary file has been updated to use the new directory structure, maintaining consistency with other Snakemake files.
harpy/simulate.py (3)
1-40
: Well-structured command group for simulationThe new
simulate
command group effectively unifies the simulation commands for genomic variants and linked-reads under a single interface. The docstring provides clear instructions on the purpose and usage of each subcommand.
19-32
: Good organization of commands by categoryThe docstring dictionary effectively categorizes commands into "Linked Read Sequences" and "Genomic Variants" with appropriate visual styling, making the CLI more user-friendly.
36-40
: Clean command registration approachThe code properly registers all simulation subcommands from their respective modules using
add_command()
, maintaining modularity while providing a unified interface.harpy/simulate_linkedreads.py (4)
24-24
: Appropriate placement of new option in UI documentationThe new
--merge-haplotypes
option is correctly added to the "Workflow Options" section in the docstring dictionary.
40-40
: Well-defined flag for the new merge-haplotypes featureThe CLI option is correctly defined with appropriate default value (False) and helpful description.
49-49
: Function signature correctly updatedThe
linkedreads
function signature has been properly updated to include the newmerge_haplotypes
parameter.
58-59
: Clear documentation for the new featureThe docstring clearly explains the effect of using the
--merge-haplotypes
option, which is helpful for users.harpy/snakefiles/align_strobealign.smk (3)
25-25
: Good approach for standardizing genome file pathsThe
workflow_geno
variable effectively centralizes the path construction for genome files using the new workflow directory structure.
47-51
: Updated output path for process_genome ruleThe output path now correctly uses the
workflow_geno
variable, which aligns with the PR objective of updating genome processing to occur on a per-execution basis.
356-357
: Path updated for workflow summary fileThe path for the workflow summary file now uses the
workflowdir
variable, which is consistent with the other path changes in this file.harpy/snakefiles/snp_mpileup.smk (4)
17-17
: Proper definition of workflowdirThe
workflowdir
variable is correctly defined asf"{outdir}/workflow"
, which aligns with the PR objective of organizing output under outdir/workflow/genome.
28-30
: Well-structured genome file path variablesThe creation of
workflow_geno
andworkflow_geno_idx
variables provides a clean way to handle both the genome file and its index, with proper handling of compressed files.
50-51
: Updated output path for process_genome ruleThe output path now correctly uses the
workflow_geno
variable, which aligns with the PR objective of updating genome processing.
299-300
: Path updated for workflow summary fileThe path for the workflow summary file now uses the
workflowdir
variable, which is consistent with the other path changes in this file.harpy/snakefiles/align_ema.smk (14)
17-17
: Good addition of workflowdir variable for better organizationAdding this variable is a good practice for improving the organization of output files by having a consistent workflow directory structure.
25-29
: Good refactoring of genome file pathsThe modification of genome file paths to use the new workflowdir variable creates a more organized directory structure, matching the PR objectives of updating genome processing to occur on a per-execution basis.
49-49
: Path updated correctly to use workflow_genoThis path update is consistent with the structural changes in the codebase, moving away from hardcoded paths to the new workflow directory structure.
64-68
: Consistent path usage in index_genome ruleGood job updating both input and output paths consistently to use the new workflow_geno variable. This ensures that all references to the genome are properly aligned with the new structure.
84-89
: Consistent path usage in bwa_index ruleFile paths have been correctly updated to use the workflow_geno variable consistently for both input and outputs.
97-97
: Consistent usage in make_depth_intervals rule inputThe path for the genome faidx file has been correctly updated to use the workflow_geno variable.
158-160
: Properly updated genome references in align_ema ruleAll genome references in the align_ema rule have been updated to use the workflow_geno and related variables, ensuring consistency with the new structure.
192-194
: Alignment with new directory structure in align_bwa ruleThe genome paths in align_bwa rule have been correctly updated to use the workflow_geno variable and its related indices.
216-217
: Consistent path usage in mark_duplicates ruleThe genome references in mark_duplicates rule have been properly updated to use workflow_geno.
303-303
: Proper path reference in molecule_coverage ruleThe faidx path has been correctly updated to use the workflow_geno variable.
315-316
: Configuration file paths updated in report_config ruleThe report configuration file paths have been correctly updated to reference the workflowdir.
332-332
: Updated QMD path in sample_reports ruleThe QMD template path has been correctly updated to use workflowdir.
389-389
: Updated QMD path in barcode_report ruleThe QMD template path has been correctly updated to use workflowdir.
450-451
: Updated summary file location in workflow_summaryThe summary file path has been correctly updated to use workflowdir, consistent with the overall restructuring approach.
harpy/snakefiles/align_bwa.smk (13)
17-17
: Good addition of workflowdir variable for consistent directory structureAdding this variable establishes a clear and consistent directory structure for workflow outputs, aligning with the PR's objective to improve file organization.
24-28
: Good refactoring of genome file variablesThe modification of genome file paths to use the new workflowdir variable creates a more organized directory structure and encapsulates the path construction logic effectively.
45-45
: Path updated correctly in process_genome ruleOutput path is correctly updated to use workflow_geno in the process_genome rule.
60-65
: Consistent path usage in samtools_faidx ruleBoth input and output paths are correctly updated to use workflow_geno in the samtools_faidx rule, maintaining consistency with the new directory structure.
81-81
: Updated path in make_depth_intervals ruleThe input path is correctly updated to use workflow_geno in the make_depth_intervals rule.
99-101
: Consistent path usage in bwa_index ruleBoth input and output paths are correctly updated to use workflow_geno in the bwa_index rule.
112-113
: Updated paths in align ruleThe genome paths are correctly updated to use workflow_geno in the align rule.
137-138
: Consistent path usage in mark_duplicates ruleThe genome paths are correctly updated to use workflow_geno in the mark_duplicates rule.
199-199
: Updated path in molecule_coverage ruleThe faidx path is correctly updated to use workflow_geno in the molecule_coverage rule.
223-224
: Updated configuration file pathsThe paths for report configuration files are correctly updated to use workflowdir.
240-240
: Updated QMD path in sample_reports ruleThe QMD template path is correctly updated to use workflowdir.
312-312
: Updated QMD path in barcode_report ruleThe QMD template path is correctly updated to use workflowdir.
357-357
: Updated summary file locationThe summary file path is correctly updated to use workflowdir in the workflow_summary rule.
.github/workflows/tests.yml (3)
827-829
: Command renamed from imputeparams to template imputeThis change is part of the command-line interface restructuring mentioned in the PR. The command has been renamed from
harpy imputeparams
toharpy template impute
, making it more consistent with a template-based approach.
830-832
: Command renamed from popgroup to template groupingsThe
harpy popgroup
command has been renamed toharpy template groupings
, which better describes its purpose and aligns with the template-based naming convention.
842-845
: HPC commands updated to use template-based approachThe HPC-related commands have been updated to use a template-based approach, replacing individual commands with template variants. This change is consistent with the overall restructuring of the command-line interface.
harpy/snakefiles/sv_leviathan.smk (9)
18-18
: Good addition of workflowdir variable for consistent structureAdding this variable establishes a consistent directory structure for the workflow outputs, aligning with the PR's objective to improve file organization.
35-38
: Improved genome file path handlingThe conditional logic for determining the genome filename has been updated to use the workflowdir variable, ensuring that the genome is stored in a consistent location within the workflow directory.
67-67
: Path updated correctly in process_genome ruleThe output path is correctly updated to use workflow_geno in the process_genome rule.
75-79
: Consistent path usage in index_genome ruleBoth input and output paths are correctly updated to use workflow_geno in the index_genome rule.
87-91
: Updated paths in bwa_index_genome ruleThe genome paths are correctly updated to use workflow_geno in the bwa_index_genome rule.
101-102
: Properly updated genome references in call_variants ruleAll genome references in the call_variants rule have been updated to use the workflow_geno variable, ensuring consistent path usage throughout the workflow.
190-191
: Updated configuration file pathsThe paths for report configuration files are correctly updated to use workflowdir.
204-206
: Updated paths in sample_reports ruleThe faidx and QMD paths are correctly updated to use workflow_geno and workflowdir in the sample_reports rule.
249-249
: Updated summary file locationThe summary file path is correctly updated to use workflowdir in the workflow_summary rule.
resources/changelog.md (9)
8-8
: Looks good for the new template creation note.
No issues found with this addition referencingharpy template
.
17-21
: Changelog duplication flagged by static analysis can be ignored.
LanguageTool warns about “is now” repeated, but it’s a standard way to highlight renaming in release notes.🧰 Tools
🪛 LanguageTool
[grammar] ~18-~18: This phrase is duplicated. You should probably use “now is” only once.
Context: ... Renamed Commands -harpy popgroup
is nowharpy template groupings
-harpy imputeparams
is nowharpy template impute
-harpy hpc
is nowharpy template hpc-*
- where `*...(PHRASE_REPETITION)
23-23
: Documentation is consistent with HPC changes.
No further suggestions—this bullet clearly explains the new output path.
36-36
: Removal note is clear.
The line about dropping direct HTCondor support aligns with the PR changes.
50-51
: Clear explanation of the new genome folder structure.
The updated bullet accurately warns about the trade-off of increased disk usage.
54-54
: Error-handling improvement is well-noted.
No further changes needed.
61-61
: Directly callingpygmentize
is fine.
No duplication or style issues found.🧰 Tools
🪛 LanguageTool
[duplication] ~61-~61: Possible typo: you repeated a word.
Context: ... more [direct] reliance onsubprocess
calls - callspygmentize
directly through python AP...(ENGLISH_WORD_REPEAT_RULE)
63-63
: References topopgroup
andstitchparams
are properly updated.
No concerns; matches the rename patterns in the code.
65-68
: Internal merges are documented well.
All items align with the broader refactoring efforts.harpy/__main__.py (7)
14-14
: Import statement forsimulate
.
No issues; it accurately references the new simulation module.
18-18
: Import statement fortemplate
.
This supports the new HPC template commands.
53-53
:simulate.simulate
command registration is consistent.
No refactoring concerns; the repeated name is clear in context.
60-60
:template.template
command registration is consistent.
Same pattern assimulate.simulate
, no issues to address.
71-72
: “Other Commands” group settings.
Correctly includestemplate
. No concerns.
80-80
: Dictionary union for docstring references.
Innovative usage of Python’s dictionary merge operator. No problems found.
83-83
: Looping through modules for docstring merges.
Correctly addssimulate
and other modules. No issues to report.harpy/hpc.py (6)
20-21
: Helpful notice for missing plugins.
The instructions about installing the Snakemake plugin are clear and accurate.
31-37
:hpc_generic
docstring is straightforward.
No improvements needed; the usage instructions align with the new HPC approach.
72-78
: LSF template configuration guidance looks good.
All references to the new plugin remain consistent. No suggestions.
108-113
: SLURM template is fully aligned with the HPC changes.
No further edits needed.
142-147
: Google Batch template approach is correct.
Clear instructions for installingsnakemake-executor-plugin-googlebatch
.
204-204
:googlebatch_snippets
option is well-defined.
No issues with adding this advanced configuration field.harpy/template.py (5)
10-10
: Good import organization.The import statement properly imports HPC-related functions from
.hpc
module, maintaining a clean module organization.
12-25
: Well-structured docstring dictionary for the CLI.The docstring dictionary provides clear organization of commands with appropriate categorization into "Input Files" and "HPC Configurations". This enhances usability and follows good CLI design practices.
27-36
: Clean implementation of the template command group.The template command group is well-implemented with comprehensive documentation that clearly explains the purpose of each subcommand.
87-111
: Good implementation of the impute command.The impute command properly:
- Includes clear documentation
- Validates existing files to prevent accidental overwrites
- Creates a structured template with example parameters
- Provides informative user feedback
The parameter file generation follows best practices with appropriate error handling.
113-118
: Proper command registration.All commands are correctly added to the template group, maintaining a consistent CLI structure.
harpy/snakefiles/sv_naibr_pop_phase.smk (8)
19-19
: Good standardization of workflow directory.The addition of
workflowdir
centralizes path construction and aligns with the PR objective of reorganizing output under outdir/workflow/genome.
37-39
: Improved genome path handling.The code now properly handles both gzipped and non-gzipped genome files, constructing appropriate paths under the workflow directory structure.
95-95
: Output path standardization in process_genome rule.Updated to use the
workflow_geno
variable, eliminating hardcoded paths and improving consistency.
103-105
: Consistent path updates in index_genome rule.Both input and output paths now use the
workflow_geno
variable, maintaining consistency with the new directory structure.
126-130
: Standardized reference paths in phase_alignments rule.The genome reference paths now use the
workflow_geno
variable, ensuring consistency throughout the workflow.
259-260
: Consistent report configuration paths.Updated to use
workflowdir
for report configuration files, maintaining the standardized directory structure.
273-275
: Standardized paths in sample_reports rule.Both the genome index and report template paths now use the appropriate variables, enhancing consistency.
317-317
: Updated summary file path.The summary file is now written to the workflow directory, consistent with the new structure.
harpy/snakefiles/sv_naibr.smk (6)
18-18
: Good standardization of workflow directory.The addition of
workflowdir
centralizes path construction and aligns with the PR objective of reorganizing output.
31-32
: Well-designed genome path variables.The creation of
workflow_geno
andworkflow_geno_idx
variables properly handles both the genome file and its index, ensuring consistent path references throughout the workflow.
181-184
: Consistent file paths in index_genome rule.Both input and output paths now use the
workflow_geno
variable, maintaining consistency with the new directory structure.
202-203
: Standardized report configuration paths.The report configuration paths now use
workflowdir
, maintaining consistency with the new structure.
216-218
: Updated paths in sample_reports rule.The paths for genome index and report template now use appropriate variables, enhancing consistency.
256-256
: Consistent summary file path.The summary file is now written to the workflow directory, aligning with the new structure.
harpy/snakefiles/sv_naibr_phase.smk (10)
18-18
: Good standardization of workflow directory.The addition of
workflowdir
centralizes path construction and aligns with the PR objective of reorganizing output.
34-36
: Improved genome path handling.The code now properly handles both gzipped and non-gzipped genome files, constructing appropriate paths under the workflow directory structure.
74-74
: Output path standardization in process_genome rule.Updated to use the
workflow_geno
variable, eliminating hardcoded paths and improving consistency.
82-84
: Consistent path updates in index_genome rule.Both input and output paths now use the
workflow_geno
variable, maintaining consistency with the new directory structure.
126-130
: Standardized reference paths in phase_alignments rule.The genome reference paths now use the
workflow_geno
variable, ensuring consistency throughout the workflow.
163-163
: Updated naibr configuration path.The configuration file path now uses
workflowdir
, maintaining consistency with the new structure.
190-190
: Consistent configuration path in call_variants rule.The configuration path now uses
workflowdir
, maintaining the standardized directory structure.
259-260
: Standardized report configuration paths.Updated to use
workflowdir
for report configuration files, maintaining the standardized directory structure.
273-275
: Consistent paths in sample_reports rule.Both the genome index and report template paths now use the appropriate variables, enhancing consistency.
317-317
: Updated summary file path.The summary file is now written to the workflow directory, consistent with the new structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
.github/workflows/tests.yml (1)
841-846
:⚠️ Potential issueUpdated HPC template commands
The ordering of HPC template commands has been adjusted, and
hpc-htcondor
is still included even though HTCondor support has been removed from the rest of the codebase.This command will fail since
hpc_htcondor
function has been removed from the imported modules. Remove line 845 to maintain consistency with the changes in the rest of the codebase:- harpy template hpc-htcondor
🧹 Nitpick comments (1)
harpy/hpc.py (1)
108-108
: Simplified SLURM configuration docstringThe docstring for the
hpc_slurm
function has been simplified. While this maintains functionality, consider providing more detailed documentation to match the thoroughness of other function docstrings in this file.- """Configuration for SLURM""" + """ + Create a template config for SLURM + + This command creates a configuration (`hpc/slurm.yaml`) for the SLURM HPC scheduler. + You will also need to install `snakemake-executor-plugin-slurm` for the HPC job submission to work. + """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
.github/workflows/tests.yml
(2 hunks)harpy/hpc.py
(1 hunks)harpy/simulate_linkedreads.py
(5 hunks)harpy/snakefiles/align_bwa.smk
(12 hunks)harpy/snakefiles/align_ema.smk
(13 hunks)harpy/snakefiles/align_strobealign.smk
(10 hunks)harpy/snakefiles/phase.smk
(11 hunks)harpy/snakefiles/simulate_linkedreads.smk
(2 hunks)harpy/snakefiles/snp_freebayes.smk
(10 hunks)harpy/snakefiles/snp_mpileup.smk
(12 hunks)harpy/snakefiles/sv_leviathan.smk
(7 hunks)harpy/snakefiles/sv_leviathan_pop.smk
(10 hunks)harpy/snakefiles/sv_naibr.smk
(7 hunks)harpy/snakefiles/sv_naibr_phase.smk
(9 hunks)harpy/snakefiles/sv_naibr_pop.smk
(10 hunks)harpy/snakefiles/sv_naibr_pop_phase.smk
(8 hunks)harpy/template.py
(2 hunks)resources/changelog.md
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: impute
- GitHub Check: simulate linkedreads
- GitHub Check: assembly
🔇 Additional comments (45)
harpy/template.py (2)
10-10
: Import modification correctly removes HTCondor supportThe import statement has been updated to remove the
hpc_htcondor
module, which aligns with the PR objective of removing HTCondor support.
21-21
: Updated HPC Configurations documentationThe documentation for available HPC configuration commands has been properly updated to exclude the HTCondor option, maintaining consistency with the code changes.
harpy/snakefiles/simulate_linkedreads.smk (3)
16-16
: New configuration variable for merging haplotypesThe new
merge_haplotypes
variable correctly pulls from the configuration dictionary, enabling the new merge haplotypes functionality.
177-187
: New rule for combining haplotypesExcellent implementation of the new
combine_haplotypes
rule that concatenates the two haplotype outputs into a single file. The rule structure is clear with well-defined inputs and outputs.
192-192
: Conditional inclusion of merged haplotypes in workflow summaryThe workflow summary correctly includes the combined haplotype files conditionally based on the
merge_haplotypes
flag. This ensures that the workflow only attempts to collect files that actually exist..github/workflows/tests.yml (4)
639-641
: Removed cleanup commands from naibr workflowThe cleanup command (
rm -r Genome
) has been removed, which aligns with the PR objective of changing how genomes are processed. Genomes are now processed on a per-execution basis with output organized under outdir/workflow/genome.
643-645
: Removed cleanup commands from naibr pop workflowSimilar to previous changes, removing the cleanup command ensures consistent processing of genomes across the workflow.
647-650
: Removed cleanup commands from naibr phasing workflowThe cleanup command has been consistently removed across all naibr-related steps, maintaining the new genome processing approach.
652-654
: Removed cleanup commands from naibr pop with phasing workflowConsistent implementation of the change in genome handling approach across all naibr variants.
harpy/snakefiles/sv_leviathan_pop.smk (4)
18-18
: Good addition of workflowdir variable to standardize paths.This change introduces a centralized workflow directory path definition, which will improve maintainability by reducing hardcoded path strings throughout the file.
35-37
: Improved workflow_geno path handling.The updates to the workflow_geno variable logic ensure consistent path handling regardless of whether the genome file is gzipped, providing better organization and isolation between workflow runs.
127-127
: Consistent output path for processed genome.Using workflow_geno for the output path of the process_genome rule aligns with the goal of standardizing file paths and eliminates the need for a shared Genome folder.
333-333
: Updated summary file location.This change maintains consistency with the new directory structure organization and preserves the summary information in the appropriate workflow directory.
harpy/snakefiles/sv_naibr_pop.smk (3)
19-19
: Good addition of workflowdir variable to standardize paths.This change introduces a centralized workflow directory path that will improve maintainability by reducing hardcoded path strings throughout the file.
33-35
: Improved workflow_geno path handling.The updates to the workflow_geno variable logic ensure consistent path handling regardless of whether the genome file is gzipped, providing better organization and isolation between workflow runs.
316-316
: Updated summary file location.This change maintains consistency with the new directory structure organization and preserves the summary information in the appropriate workflow directory.
harpy/snakefiles/align_strobealign.smk (2)
25-25
: Good workflow_geno path definition.The change to define workflow_geno using the new workflowdir variable structure helps maintain consistency across workflows. This aligns with the PR's goal of eliminating the shared Genome folder.
356-356
: Updated summary file location.This change maintains consistency with the new directory structure organization and preserves the summary information in the appropriate workflow directory.
resources/changelog.md (3)
13-14
: Clear documentation of new features.The changelog clearly documents the new features: cluster submission for downsampling and the merge-haplotypes functionality for simulating linked reads. These additions align well with the PR objective of enhancing simulation of linked reads.
36-36
: Good documentation of HTCondor support removal.The changelog clearly notes the breaking change regarding HTCondor support removal, which aligns with the PR objective mentioned in issue #220 about removing HTCondor support due to the missing Snakemake plugin.
50-51
: Clear explanation of the Genome folder restructuring.The changelog provides a clear explanation of the change from a unified
Genome/
folder to a per-output-folder structure, addressing the issue #219 mentioned in the PR objectives. It also provides a good explanation of the trade-off (more disk space usage vs. more reliable behavior).harpy/snakefiles/sv_naibr_pop_phase.smk (3)
19-19
: Central workflow directory standardization looks good.Adding a standardized
workflowdir
variable improves path organization and creates a consistent pattern for file management across the codebase.
37-39
: Clean implementation of workflow genome path handling.The conditional path construction properly accounts for gzipped files by removing the extension in the output path.
95-95
: Path standardization looks consistent.All paths have been properly updated to use the new
workflowdir
andworkflow_geno
variables, maintaining consistency throughout the file.Also applies to: 103-103, 105-105, 109-109, 147-147, 150-150, 319-320, 333-333, 335-335, 358-358, 406-406
harpy/snakefiles/snp_mpileup.smk (2)
17-17
: Good implementation of workflow directory variables.The introduction of
workflowdir
,workflow_geno
, andworkflow_geno_idx
variables creates a consistent pattern for file management. The conditional logic for handling gzipped files is properly implemented.Also applies to: 28-28, 30-30
50-50
: Path standardization is consistent.All file paths have been properly updated to use the new workflow directory variables, ensuring consistency across the file.
Also applies to: 65-68, 70-70, 88-88, 119-120, 134-134, 235-236, 299-299
harpy/snakefiles/phase.smk (2)
17-18
: Good directory structure standardization.The introduction of
workflowdir
variable is consistent with changes in other files, supporting a unified approach to path management.
46-47
: Path updates are consistent throughout the file.All relevant paths have been properly updated to use the new
workflowdir
variable, maintaining consistency with other files in the codebase.Also applies to: 74-74, 129-129, 149-149, 164-164, 192-192, 200-200, 256-257, 271-271, 314-314
harpy/snakefiles/align_bwa.smk (2)
17-17
: Clean implementation of workflow directory variables.The introduction of
workflowdir
,workflow_geno
, andworkflow_geno_idx
follows the same pattern as other files, maintaining consistency across the codebase.Also applies to: 24-24, 26-28
45-45
: Path standardization is comprehensive.All file paths have been properly updated to use the new workflow directory variables. The changes are consistent with the project's new file organization strategy.
Also applies to: 60-64, 81-81, 99-102, 112-113, 137-138, 199-199, 223-224, 240-240, 312-312, 357-357
harpy/snakefiles/snp_freebayes.smk (3)
17-17
: Good addition of a centralized workflow directory path variable.This addition of the
workflowdir
variable helps standardize paths across the workflow, making the code more maintainable.
33-33
: Good refactoring for genome file path management.Creating a consistent reference for the genome file path using the
workflow_geno
variable improves maintainability by centralizing path construction.
53-53
: Consistent path standardization across rules.The systematic replacement of hardcoded paths with the new path variables (
workflowdir
andworkflow_geno
) throughout all rules creates a more maintainable codebase.Also applies to: 62-62, 70-72, 74-74, 106-107, 167-167, 184-185, 199-200, 214-214, 260-260
harpy/snakefiles/align_ema.smk (2)
17-17
: Good standardization of path variables.The introduction of
workflowdir
and related variables creates a consistent approach to path management throughout the workflow.Also applies to: 27-29
49-49
: Comprehensive path refactoring across rules.The systematic replacement of hardcoded paths with the standardized variables improves maintainability and reduces the potential for path-related errors.
Also applies to: 64-69, 85-89, 97-97, 158-160, 192-194, 216-217, 261-261, 303-303, 315-316, 332-332, 389-389, 450-450
harpy/snakefiles/sv_naibr_phase.smk (2)
18-18
: Good introduction of workflow directory variable and genome path.Replacing the
validgenome
approach with the standardizedworkflow_geno
variable improves consistency with other workflow files.Also applies to: 34-36
74-74
: Consistent path standardization across rules.The systematic update of paths across all rules ensures consistency with the path management approach used in other workflow files.
Also applies to: 82-86, 126-126, 129-129, 163-163, 190-190, 259-260, 273-275, 317-317
harpy/simulate_linkedreads.py (3)
24-24
: Good implementation of the new merge-haplotypes feature.The addition of the
--merge-haplotypes
flag follows the existing CLI pattern and properly updates both the workflow options list and function signature.Also applies to: 40-40, 49-49
58-59
: Clear documentation of the new feature.The docstring update clearly explains the purpose and effect of the new
--merge-haplotypes
option, helping users understand when to use it.
95-95
: Proper configuration integration for the new feature.Adding the
merge_haplotypes
parameter to the configuration dictionary ensures the option is properly passed to the Snakemake workflow.harpy/snakefiles/sv_leviathan.smk (3)
18-18
: Improved directory structure with workflowdir variableAdding a dedicated variable for workflow directory structure creates better organization and makes the file path management more maintainable.
35-38
: Good refactoring of genome file path handlingThe conditional logic for handling gzipped genomes is well-structured, creating a consistent path based on the workflowdir variable and properly handling file extensions.
67-67
: Consistent path updates throughout the workflowAll path references have been correctly updated to use the new workflowdir and workflow_geno variables, ensuring consistency throughout the workflow.
Also applies to: 75-76, 77-77, 79-79, 87-88, 89-89, 91-91, 101-102, 190-191, 204-204, 206-206, 249-249
harpy/snakefiles/sv_naibr.smk (2)
18-18
: Good standardization of workflow directory structureImplementing workflowdir and deriving workflow_geno paths creates consistency with other workflow files in the project.
Also applies to: 31-32
181-186
: Consistent path updates throughout the workflowAll path references have been correctly updated to use the new workflowdir and workflow_geno variables.
Also applies to: 202-203, 216-216, 218-218, 256-256
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
harpy/snakefiles/align_strobealign.smk (2)
18-18
: Consider using workflowdir in envdir definition for consistencyFor improved consistency with the new path standardization approach, consider refactoring the envdir definition to use the workflowdir variable.
-envdir = os.path.join(os.getcwd(), outdir, "workflow", "envs") +envdir = os.path.join(os.getcwd(), workflowdir, "envs")
336-350
: Consider updating genomefile references in summary textThe workflow summary still references the original
genomefile
rather than the newworkflow_geno
path. If the intention is to report the input parameters as provided by the user, this is correct. If the intention is to report the actual file used in the workflow, consider updating these references.- summary.append(f"The provided genome: {genomefile}") + summary.append(f"The provided genome: {genomefile} (processed as {workflow_geno})") - duplicates += f"\tsamtools sort -T SAMPLE --reference {genomefile} -m 2000M |\n" + duplicates += f"\tsamtools sort -T SAMPLE --reference {workflow_geno} -m 2000M |\n"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.github/workflows/tests.yml
(2 hunks)harpy/snakefiles/align_strobealign.smk
(10 hunks)harpy/snakefiles/sv_naibr.smk
(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/workflows/tests.yml
- harpy/snakefiles/sv_naibr.smk
⏰ Context from checks skipped due to timeout of 90000ms (16)
- GitHub Check: align EMA
- GitHub Check: align BWA
- GitHub Check: deconvolve
- GitHub Check: sv naibr
- GitHub Check: simulate variants
- GitHub Check: impute
- GitHub Check: sv leviathan
- GitHub Check: simulate linkedreads
- GitHub Check: assembly
- GitHub Check: align strobe
- GitHub Check: phase
- GitHub Check: freebayes
- GitHub Check: mpileup
- GitHub Check: demux gen1
- GitHub Check: Rebuild Container
- GitHub Check: qc
🔇 Additional comments (4)
harpy/snakefiles/align_strobealign.smk (4)
17-25
: LGTM: Standardized file path management with workflow variablesThe addition of
workflowdir
andworkflow_geno
variables provides a consistent approach to file path management, addressing issue #219 and fixing the previously reported circular reference issue.
356-356
: LGTM: Workflow summary output path updatedThe path for writing the workflow summary now correctly uses the workflowdir variable, maintaining consistency with other path changes.
47-191
: LGTM: Consistent use of workflow_geno throughout rulesThe changes to use the
workflow_geno
variable across multiple rules (process_genome, index_genome, strobe_index, align, mark_duplicates, molecule_coverage) ensure consistent file path handling and implement the per-execution genome processing approach outlined in the PR objectives.
214-303
: LGTM: Standardized report paths using workflowdirThe reports configuration now consistently uses the
workflowdir
variable for all template paths, improving maintainability and organization.
--merge-haplotypes
routine for linked read sims (convenience features)Summary by CodeRabbit
New Features
Chores