merge-haplotypes for simulating linkedreads #221

pdimens · 2025-03-14T19:19:33Z

adds new --merge-haplotypes routine for linked read sims (convenience features)
drops HTCondor support b/c snakemake plugin is gone fixes htcondor executor plugin no longer exists #220
genomes are now processing per-execution, so there's no longer a Genome/ folder and instead outdir/workflow/genome fixes Make genome processing cache-able #219

Summary by CodeRabbit

New Features
- Introduced a new command-line option to merge haplotypes during read simulation.
- Enhanced the downsample command with improved cluster submission capabilities.
- Improved output file organization with a dedicated per-run structure to avoid naming conflicts.
Chores
- Removed legacy support for an alternative HPC scheduler, now standardizing on SLURM for job submissions.

imputeparams -> template impute hpc xxx -> template hpc-xxx unified simulate clarified docstrings

coderabbitai · 2025-03-14T19:19:41Z

📝 Walkthrough

Walkthrough

The changes update multiple components of the project by removing deprecated cleanup and HPC commands and standardizing file path management. In the GitHub workflow, cleanup commands are removed, and the HPC scheduler command order is altered. In the Python modules, support for HTCondor is removed in favor of SLURM, with corresponding simplifications in documentation. The simulation command now accepts a new --merge-haplotypes flag. Additionally, a broad refactor in the Snakemake workflows introduces a unified workflowdir variable to streamline and centralize the file and report output paths across many rules.

Changes

File(s)	Change Summary
`.github/workflows/tests.yml`	Removed cleanup commands (`rm -r Genome`) in naibr-related steps; updated HPC template commands by replacing `hpc-generic` with `hpc-slurm` and swapping the order of HPC job submissions.
`harpy/hpc.py`, `harpy/template.py`	Removed the `hpc_htcondor` function/command and associated code; simplified the `hpc_slurm` function’s docstring; updated command registration by eliminating HTCondor support.
`harpy/simulate_linkedreads.py`	Added a new CLI option `--merge-haplotypes` with a default of False; updated the function signature and internal configuration to enable the merging of haplotype outputs into single read files.
`harpy/snakefiles/*.smk` (multiple snakefile updates)	Introduced a new `workflowdir` variable (and related variables such as `workflow_geno` and `workflow_geno_idx`) to centralize file path construction for genome, index, and report outputs; updated numerous rules accordingly.
`resources/changelog.md`	Updated changelog entries to document the new `--merge-haplotypes` option, changes to HPC configuration (removal of direct HTCondor support), and alterations from a unified `Genome/` folder to a per-output-folder workflow structure.

Sequence Diagram(s)

sequenceDiagram
    participant U as User
    participant CLI as Harpy CLI
    participant SW as Snakemake Workflow
    U->>CLI: Execute `harpy simulate linkedreads [--merge-haplotypes]`
    CLI->>SW: Initiate simulation (flag: merge_haplotypes)
    alt merge_haplotypes true
        SW->>SW: Run rule `combine_haplotypes` to merge haplotype files
        SW->>CLI: Output merged read file
    else merge_haplotypes false
        SW->>CLI: Output separate haplotype files for each haplotype
    end

Possibly related PRs

add assembly #146: The changes in the main PR regarding the .github/workflows/tests.yml file, which involve restructuring the workflow for the assembly job, are related to the modifications in the retrieved PR that also focus on the assembly functionality, including renaming and updating job steps in the workflow. Both PRs address the assembly process, indicating a direct connection in their code changes.

Suggested labels

bug, enhancement

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)

We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
- To enable this feature, set early_access to true under in the settings.

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

resources/changelog.md (1)
13-14: Fix grammatical mistake in the bullet for --merge-haplotypes.
Currently states “as a convenience features.” Recommend singular:
- - `harpy simulate linkedreads` adds `--merge-haplotypes` as a convenience features to merge R1 reads for hap0 and hap1 (same for R2)
+ - `harpy simulate linkedreads` adds `--merge-haplotypes` as a convenience feature to merge R1 reads for hap0 and hap1 (same for R2)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3558706 and a520368.

📒 Files selected for processing (28)

.github/filters.yml (3 hunks)
.github/workflows/tests.yml (2 hunks)
harpy/__main__.py (3 hunks)
harpy/_validations.py (2 hunks)
harpy/align.py (1 hunks)
harpy/hpc.py (6 hunks)
harpy/impute.py (1 hunks)
harpy/imputeparams.py (0 hunks)
harpy/simulate.py (1 hunks)
harpy/simulate_linkedreads.py (5 hunks)
harpy/simulate_variants.py (4 hunks)
harpy/snakefiles/align_bwa.smk (12 hunks)
harpy/snakefiles/align_ema.smk (13 hunks)
harpy/snakefiles/align_strobealign.smk (10 hunks)
harpy/snakefiles/phase.smk (11 hunks)
harpy/snakefiles/simulate_linkedreads.smk (2 hunks)
harpy/snakefiles/snp_freebayes.smk (10 hunks)
harpy/snakefiles/snp_mpileup.smk (12 hunks)
harpy/snakefiles/sv_leviathan.smk (7 hunks)
harpy/snakefiles/sv_leviathan_pop.smk (10 hunks)
harpy/snakefiles/sv_naibr.smk (7 hunks)
harpy/snakefiles/sv_naibr_phase.smk (9 hunks)
harpy/snakefiles/sv_naibr_pop.smk (10 hunks)
harpy/snakefiles/sv_naibr_pop_phase.smk (8 hunks)
harpy/snp.py (4 hunks)
harpy/sv.py (4 hunks)
harpy/template.py (2 hunks)
resources/changelog.md (3 hunks)

💤 Files with no reviewable changes (1)

harpy/imputeparams.py

🧰 Additional context used

🪛 LanguageTool

resources/changelog.md

[grammar] ~18-~18: This phrase is duplicated. You should probably use “now is” only once.
Context: ... Renamed Commands - harpy popgroup is now harpy template groupings - harpy imputeparams is now harpy template impute - harpy hpc is now harpy template hpc-* - where `*...

(PHRASE_REPETITION)

[duplication] ~61-~61: Possible typo: you repeated a word.
Context: ... more [direct] reliance on subprocess calls - calls pygmentize directly through python AP...

(ENGLISH_WORD_REPEAT_RULE)

🪛 Ruff (0.8.2)

harpy/simulate_linkedreads.py

95-95: Undefined name combine_haplotypes

(F821)

🔇 Additional comments (140)

harpy/align.py (2)

23-27: Documentation improvement with tabular format 👍

The change to a tabular format for aligner documentation enhances readability and makes it easier to compare the different aligners' capabilities at a glance.

33-41: Well-structured module documentation

Adding the module_docstring dictionary provides a consistent and structured way to document the available commands in the align module. This enhancement aligns with similar documentation improvements in other files.

harpy/snp.py (4)

28-36: Good addition of structured documentation

The module_docstring dictionary provides a clean, consistent way to document the available commands in the SNP module, following the same pattern used in other modules.

63-63: Appropriate integration of module documentation

Using the dictionary merging operator to combine the existing docstring with the new module_docstring maintains code clarity while adding the new structured documentation.

94-94: Command reference updated for consistency

Replacing harpy popgroup with harpy template aligns with the PR's restructuring of commands and ensures documentation consistency.

198-198: Command reference updated for consistency

Replacing harpy popgroup with harpy template in this docstring maintains consistency with the changes elsewhere in the codebase.

harpy/sv.py (5)

24-27: Enhanced documentation with feature comparison table

The tabular format clearly illustrates the capabilities of each SV caller, making it easier for users to choose the appropriate tool for their needs.

33-41: Well-structured module documentation

Adding the module_docstring dictionary provides a consistent documentation approach across modules, improving overall code organization.

68-68: Appropriate integration of module documentation

Using the dictionary merging operator to combine the existing docstring with the new module_docstring maintains code clarity.

97-97: Command reference updated for consistency

Replacing harpy popgroup with harpy template aligns with the PR's command restructuring.

203-203: Command reference updated for consistency

Replacing harpy popgroup with harpy template maintains documentation consistency.

harpy/impute.py (1)

54-54: Command reference updated for consistency

Replacing harpy stitchparams with harpy template reflects the consolidation of related functionalities under the template command group as described in the PR objectives.

harpy/_validations.py (2)

67-67: Command reference updated

The error message has been updated to use the new command name harpy template instead of the old harpy stitchparams. This aligns with the larger CLI restructuring in the PR.

97-97: Command reference updated

Updated the error message to use the new command name harpy template instead of harpy stitchparams, consistent with the command structure changes.

harpy/simulate_variants.py (4)

149-149: Improved documentation clarity

The help text for the --only-vcf option has been updated to be more precise about what the option does. It now clearly states that it will "only generate the second VCF file and not simulate a second FASTA file" rather than the previous wording about generating only VCF files without creating the diploid genome.

273-273: Consistent help text update

The help text for the --only-vcf option has been updated in the inversion function to match the same clarification made in the snpindel function.

378-378: Consistent help text update

The help text for the --only-vcf option has been updated in the cnv function to match the same clarification made in the other functions.

487-487: Consistent help text update

The help text for the --only-vcf option has been updated in the translocation function to match the same clarification made in the other functions.

.github/filters.yml (3)

142-142: Added new module file to simvars section

Added harpy/simulate.py to the simvars section. This file likely consolidates simulation commands for genomic variants, replacing or complementing the existing simulation functionality.

151-151: Added new module file to simreads section

Added harpy/simulate.py to the simreads section, indicating that this file contains functionality related to simulating reads as well as variants.

168-168: Updated module files in other section

Added harpy/template.py to the other section, likely replacing the removed files harpy/imputeparams.py and harpy/popgroup.py as part of the CLI restructuring mentioned in the PR objectives.

harpy/snakefiles/simulate_linkedreads.smk (3)

16-16: Added new configuration parameter

Added a new configuration parameter combine_haplotypes to support the new functionality mentioned in the PR objectives that enhances simulation of linked reads.

177-187: Added new rule to merge haplotype outputs

Implemented a new rule combine_haplotypes that concatenates the haplotype files from both simulated haplotypes into single output files. This adds the capability to merge haplotypes for simulating linked reads, which was a key objective of this PR.

191-192: Updated workflow summary to include merged haplotypes

Modified the workflow_summary rule to conditionally include the outputs from the combine_haplotypes rule when that feature is enabled. This properly integrates the new functionality into the workflow's summary mechanism.

harpy/snakefiles/sv_leviathan_pop.smk (3)

18-18: Introduction of centralized workflow directory path

The addition of workflowdir as a centralized variable for workflow-related paths is a good architectural improvement that will help with maintainability.

35-37: Improved genome file path organization

The genome path is now properly organized under the workflow directory structure instead of the previous "Genome/" folder, which aligns with the PR objective to update genome processing to occur on a per-execution basis.

333-333: Consistent path structure for workflow summary

The workflow summary path has been updated to use the new directory structure, maintaining consistency with other path changes throughout the file.

harpy/snakefiles/sv_naibr_pop.smk (3)

19-19: Standardized workflow directory definition

The workflowdir variable definition is consistent with the changes in other Snakemake files, ensuring uniformity across the codebase.

33-35: Properly handled genome path construction

The conditional logic for handling gzipped genomes is correctly maintained while updating the path structure to use the workflow directory.

316-316: Updated workflow summary path

This change ensures the workflow summary is stored in a consistent location with the other Snakemake files.

harpy/snakefiles/phase.smk (3)

17-18: Consistent workflow directory implementation

The workflowdir variable is properly introduced, maintaining the same pattern seen in other Snakemake files in this PR.

46-47: Updated genome path variables

The genome path variables have been properly updated to use the new workflow directory structure, consistent with the changes in other files.

314-314: Consistent update to summary file location

The workflow summary file is now properly placed in the workflow directory, which aligns with the PR's goal of restructuring the output organization.

harpy/snakefiles/snp_freebayes.smk (3)

17-17: Standardized workflow directory definition

The introduction of the workflowdir variable follows the same pattern as other Snakemake files in this PR, ensuring consistency.

33-33: Consolidated genome path definition

The workflow_geno variable now properly uses the new directory structure, eliminating the hardcoded "Genome/" path used previously.

260-260: Consistent workflow summary path update

The location of the workflow summary file has been updated to use the new directory structure, maintaining consistency with other Snakemake files.

harpy/simulate.py (3)

1-40: Well-structured command group for simulation

The new simulate command group effectively unifies the simulation commands for genomic variants and linked-reads under a single interface. The docstring provides clear instructions on the purpose and usage of each subcommand.

19-32: Good organization of commands by category

The docstring dictionary effectively categorizes commands into "Linked Read Sequences" and "Genomic Variants" with appropriate visual styling, making the CLI more user-friendly.

36-40: Clean command registration approach

The code properly registers all simulation subcommands from their respective modules using add_command(), maintaining modularity while providing a unified interface.

harpy/simulate_linkedreads.py (4)

24-24: Appropriate placement of new option in UI documentation

The new --merge-haplotypes option is correctly added to the "Workflow Options" section in the docstring dictionary.

40-40: Well-defined flag for the new merge-haplotypes feature

The CLI option is correctly defined with appropriate default value (False) and helpful description.

49-49: Function signature correctly updated

The linkedreads function signature has been properly updated to include the new merge_haplotypes parameter.

58-59: Clear documentation for the new feature

The docstring clearly explains the effect of using the --merge-haplotypes option, which is helpful for users.

harpy/snakefiles/align_strobealign.smk (3)

25-25: Good approach for standardizing genome file paths

The workflow_geno variable effectively centralizes the path construction for genome files using the new workflow directory structure.

47-51: Updated output path for process_genome rule

The output path now correctly uses the workflow_geno variable, which aligns with the PR objective of updating genome processing to occur on a per-execution basis.

356-357: Path updated for workflow summary file

The path for the workflow summary file now uses the workflowdir variable, which is consistent with the other path changes in this file.

harpy/snakefiles/snp_mpileup.smk (4)

17-17: Proper definition of workflowdir

The workflowdir variable is correctly defined as f"{outdir}/workflow", which aligns with the PR objective of organizing output under outdir/workflow/genome.

28-30: Well-structured genome file path variables

The creation of workflow_geno and workflow_geno_idx variables provides a clean way to handle both the genome file and its index, with proper handling of compressed files.

50-51: Updated output path for process_genome rule

The output path now correctly uses the workflow_geno variable, which aligns with the PR objective of updating genome processing.

299-300: Path updated for workflow summary file

The path for the workflow summary file now uses the workflowdir variable, which is consistent with the other path changes in this file.

harpy/snakefiles/align_ema.smk (14)

17-17: Good addition of workflowdir variable for better organization

Adding this variable is a good practice for improving the organization of output files by having a consistent workflow directory structure.

25-29: Good refactoring of genome file paths

The modification of genome file paths to use the new workflowdir variable creates a more organized directory structure, matching the PR objectives of updating genome processing to occur on a per-execution basis.

49-49: Path updated correctly to use workflow_geno

This path update is consistent with the structural changes in the codebase, moving away from hardcoded paths to the new workflow directory structure.

64-68: Consistent path usage in index_genome rule

Good job updating both input and output paths consistently to use the new workflow_geno variable. This ensures that all references to the genome are properly aligned with the new structure.

84-89: Consistent path usage in bwa_index rule

File paths have been correctly updated to use the workflow_geno variable consistently for both input and outputs.

97-97: Consistent usage in make_depth_intervals rule input

The path for the genome faidx file has been correctly updated to use the workflow_geno variable.

158-160: Properly updated genome references in align_ema rule

All genome references in the align_ema rule have been updated to use the workflow_geno and related variables, ensuring consistency with the new structure.

192-194: Alignment with new directory structure in align_bwa rule

The genome paths in align_bwa rule have been correctly updated to use the workflow_geno variable and its related indices.

216-217: Consistent path usage in mark_duplicates rule

The genome references in mark_duplicates rule have been properly updated to use workflow_geno.

303-303: Proper path reference in molecule_coverage rule

The faidx path has been correctly updated to use the workflow_geno variable.

315-316: Configuration file paths updated in report_config rule

The report configuration file paths have been correctly updated to reference the workflowdir.

332-332: Updated QMD path in sample_reports rule

The QMD template path has been correctly updated to use workflowdir.

389-389: Updated QMD path in barcode_report rule

The QMD template path has been correctly updated to use workflowdir.

450-451: Updated summary file location in workflow_summary

The summary file path has been correctly updated to use workflowdir, consistent with the overall restructuring approach.

harpy/snakefiles/align_bwa.smk (13)

17-17: Good addition of workflowdir variable for consistent directory structure

Adding this variable establishes a clear and consistent directory structure for workflow outputs, aligning with the PR's objective to improve file organization.

24-28: Good refactoring of genome file variables

The modification of genome file paths to use the new workflowdir variable creates a more organized directory structure and encapsulates the path construction logic effectively.

45-45: Path updated correctly in process_genome rule

Output path is correctly updated to use workflow_geno in the process_genome rule.

60-65: Consistent path usage in samtools_faidx rule

Both input and output paths are correctly updated to use workflow_geno in the samtools_faidx rule, maintaining consistency with the new directory structure.

81-81: Updated path in make_depth_intervals rule

The input path is correctly updated to use workflow_geno in the make_depth_intervals rule.

99-101: Consistent path usage in bwa_index rule

Both input and output paths are correctly updated to use workflow_geno in the bwa_index rule.

112-113: Updated paths in align rule

The genome paths are correctly updated to use workflow_geno in the align rule.

137-138: Consistent path usage in mark_duplicates rule

The genome paths are correctly updated to use workflow_geno in the mark_duplicates rule.

199-199: Updated path in molecule_coverage rule

The faidx path is correctly updated to use workflow_geno in the molecule_coverage rule.

223-224: Updated configuration file paths

The paths for report configuration files are correctly updated to use workflowdir.

240-240: Updated QMD path in sample_reports rule

The QMD template path is correctly updated to use workflowdir.

312-312: Updated QMD path in barcode_report rule

The QMD template path is correctly updated to use workflowdir.

357-357: Updated summary file location

The summary file path is correctly updated to use workflowdir in the workflow_summary rule.

.github/workflows/tests.yml (3)

827-829: Command renamed from imputeparams to template impute

This change is part of the command-line interface restructuring mentioned in the PR. The command has been renamed from harpy imputeparams to harpy template impute, making it more consistent with a template-based approach.

830-832: Command renamed from popgroup to template groupings

The harpy popgroup command has been renamed to harpy template groupings, which better describes its purpose and aligns with the template-based naming convention.

842-845: HPC commands updated to use template-based approach

The HPC-related commands have been updated to use a template-based approach, replacing individual commands with template variants. This change is consistent with the overall restructuring of the command-line interface.

harpy/snakefiles/sv_leviathan.smk (9)

18-18: Good addition of workflowdir variable for consistent structure

Adding this variable establishes a consistent directory structure for the workflow outputs, aligning with the PR's objective to improve file organization.

35-38: Improved genome file path handling

The conditional logic for determining the genome filename has been updated to use the workflowdir variable, ensuring that the genome is stored in a consistent location within the workflow directory.

67-67: Path updated correctly in process_genome rule

The output path is correctly updated to use workflow_geno in the process_genome rule.

75-79: Consistent path usage in index_genome rule

Both input and output paths are correctly updated to use workflow_geno in the index_genome rule.

87-91: Updated paths in bwa_index_genome rule

The genome paths are correctly updated to use workflow_geno in the bwa_index_genome rule.

101-102: Properly updated genome references in call_variants rule

All genome references in the call_variants rule have been updated to use the workflow_geno variable, ensuring consistent path usage throughout the workflow.

190-191: Updated configuration file paths

The paths for report configuration files are correctly updated to use workflowdir.

204-206: Updated paths in sample_reports rule

The faidx and QMD paths are correctly updated to use workflow_geno and workflowdir in the sample_reports rule.

249-249: Updated summary file location

The summary file path is correctly updated to use workflowdir in the workflow_summary rule.

resources/changelog.md (9)

8-8: Looks good for the new template creation note.
No issues found with this addition referencing harpy template.

17-21: Changelog duplication flagged by static analysis can be ignored.
LanguageTool warns about “is now” repeated, but it’s a standard way to highlight renaming in release notes.

🧰 Tools

🪛 LanguageTool

[grammar] ~18-~18: This phrase is duplicated. You should probably use “now is” only once.
Context: ... Renamed Commands - harpy popgroup is now harpy template groupings - harpy imputeparams is now harpy template impute - harpy hpc is now harpy template hpc-* - where `*...

(PHRASE_REPETITION)

23-23: Documentation is consistent with HPC changes.
No further suggestions—this bullet clearly explains the new output path.

36-36: Removal note is clear.
The line about dropping direct HTCondor support aligns with the PR changes.

50-51: Clear explanation of the new genome folder structure.
The updated bullet accurately warns about the trade-off of increased disk usage.

54-54: Error-handling improvement is well-noted.
No further changes needed.

61-61: Directly calling pygmentize is fine.
No duplication or style issues found.

🧰 Tools

🪛 LanguageTool

[duplication] ~61-~61: Possible typo: you repeated a word.
Context: ... more [direct] reliance on subprocess calls - calls pygmentize directly through python AP...

(ENGLISH_WORD_REPEAT_RULE)

63-63: References to popgroup and stitchparams are properly updated.
No concerns; matches the rename patterns in the code.

65-68: Internal merges are documented well.
All items align with the broader refactoring efforts.

harpy/__main__.py (7)

14-14: Import statement for simulate.
No issues; it accurately references the new simulation module.

18-18: Import statement for template.
This supports the new HPC template commands.

53-53: simulate.simulate command registration is consistent.
No refactoring concerns; the repeated name is clear in context.

60-60: template.template command registration is consistent.
Same pattern as simulate.simulate, no issues to address.

71-72: “Other Commands” group settings.
Correctly includes template. No concerns.

80-80: Dictionary union for docstring references.
Innovative usage of Python’s dictionary merge operator. No problems found.

83-83: Looping through modules for docstring merges.
Correctly adds simulate and other modules. No issues to report.

harpy/hpc.py (6)

20-21: Helpful notice for missing plugins.
The instructions about installing the Snakemake plugin are clear and accurate.

31-37: hpc_generic docstring is straightforward.
No improvements needed; the usage instructions align with the new HPC approach.

72-78: LSF template configuration guidance looks good.
All references to the new plugin remain consistent. No suggestions.

108-113: SLURM template is fully aligned with the HPC changes.
No further edits needed.

142-147: Google Batch template approach is correct.
Clear instructions for installing snakemake-executor-plugin-googlebatch.

204-204: googlebatch_snippets option is well-defined.
No issues with adding this advanced configuration field.

harpy/template.py (5)

10-10: Good import organization.

The import statement properly imports HPC-related functions from .hpc module, maintaining a clean module organization.

12-25: Well-structured docstring dictionary for the CLI.

The docstring dictionary provides clear organization of commands with appropriate categorization into "Input Files" and "HPC Configurations". This enhances usability and follows good CLI design practices.

27-36: Clean implementation of the template command group.

The template command group is well-implemented with comprehensive documentation that clearly explains the purpose of each subcommand.

87-111: Good implementation of the impute command.

The impute command properly:

Includes clear documentation

Validates existing files to prevent accidental overwrites

Creates a structured template with example parameters

Provides informative user feedback

The parameter file generation follows best practices with appropriate error handling.

113-118: Proper command registration.

All commands are correctly added to the template group, maintaining a consistent CLI structure.

harpy/snakefiles/sv_naibr_pop_phase.smk (8)

19-19: Good standardization of workflow directory.

The addition of workflowdir centralizes path construction and aligns with the PR objective of reorganizing output under outdir/workflow/genome.

37-39: Improved genome path handling.

The code now properly handles both gzipped and non-gzipped genome files, constructing appropriate paths under the workflow directory structure.

95-95: Output path standardization in process_genome rule.

Updated to use the workflow_geno variable, eliminating hardcoded paths and improving consistency.

103-105: Consistent path updates in index_genome rule.

Both input and output paths now use the workflow_geno variable, maintaining consistency with the new directory structure.

126-130: Standardized reference paths in phase_alignments rule.

The genome reference paths now use the workflow_geno variable, ensuring consistency throughout the workflow.

259-260: Consistent report configuration paths.

Updated to use workflowdir for report configuration files, maintaining the standardized directory structure.

273-275: Standardized paths in sample_reports rule.

Both the genome index and report template paths now use the appropriate variables, enhancing consistency.

317-317: Updated summary file path.

The summary file is now written to the workflow directory, consistent with the new structure.

harpy/snakefiles/sv_naibr.smk (6)

18-18: Good standardization of workflow directory.

The addition of workflowdir centralizes path construction and aligns with the PR objective of reorganizing output.

31-32: Well-designed genome path variables.

The creation of workflow_geno and workflow_geno_idx variables properly handles both the genome file and its index, ensuring consistent path references throughout the workflow.

181-184: Consistent file paths in index_genome rule.

Both input and output paths now use the workflow_geno variable, maintaining consistency with the new directory structure.

202-203: Standardized report configuration paths.

The report configuration paths now use workflowdir, maintaining consistency with the new structure.

216-218: Updated paths in sample_reports rule.

The paths for genome index and report template now use appropriate variables, enhancing consistency.

256-256: Consistent summary file path.

The summary file is now written to the workflow directory, aligning with the new structure.

harpy/snakefiles/sv_naibr_phase.smk (10)

18-18: Good standardization of workflow directory.

The addition of workflowdir centralizes path construction and aligns with the PR objective of reorganizing output.

34-36: Improved genome path handling.

The code now properly handles both gzipped and non-gzipped genome files, constructing appropriate paths under the workflow directory structure.

74-74: Output path standardization in process_genome rule.

Updated to use the workflow_geno variable, eliminating hardcoded paths and improving consistency.

82-84: Consistent path updates in index_genome rule.

Both input and output paths now use the workflow_geno variable, maintaining consistency with the new directory structure.

126-130: Standardized reference paths in phase_alignments rule.

The genome reference paths now use the workflow_geno variable, ensuring consistency throughout the workflow.

163-163: Updated naibr configuration path.

The configuration file path now uses workflowdir, maintaining consistency with the new structure.

190-190: Consistent configuration path in call_variants rule.

The configuration path now uses workflowdir, maintaining the standardized directory structure.

259-260: Standardized report configuration paths.

Updated to use workflowdir for report configuration files, maintaining the standardized directory structure.

273-275: Consistent paths in sample_reports rule.

Both the genome index and report template paths now use the appropriate variables, enhancing consistency.

317-317: Updated summary file path.

The summary file is now written to the workflow directory, consistent with the new structure.

harpy/simulate_linkedreads.py

harpy/snakefiles/align_strobealign.smk

harpy/snakefiles/sv_naibr.smk

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (1)

.github/workflows/tests.yml (1)
841-846: ⚠️ Potential issue

Updated HPC template commands

The ordering of HPC template commands has been adjusted, and hpc-htcondor is still included even though HTCondor support has been removed from the rest of the codebase.

This command will fail since hpc_htcondor function has been removed from the imported modules. Remove line 845 to maintain consistency with the changes in the rest of the codebase:
-          harpy template hpc-htcondor

🧹 Nitpick comments (1)

harpy/hpc.py (1)
108-108: Simplified SLURM configuration docstring

The docstring for the hpc_slurm function has been simplified. While this maintains functionality, consider providing more detailed documentation to match the thoroughness of other function docstrings in this file.
-    """Configuration for SLURM"""
+    """
+    Create a template config for SLURM
+    
+    This command creates a configuration (`hpc/slurm.yaml`) for the SLURM HPC scheduler.
+    You will also need to install `snakemake-executor-plugin-slurm` for the HPC job submission to work.
+    """

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3558706 and 87073ab.

📒 Files selected for processing (18)

.github/workflows/tests.yml (2 hunks)
harpy/hpc.py (1 hunks)
harpy/simulate_linkedreads.py (5 hunks)
harpy/snakefiles/align_bwa.smk (12 hunks)
harpy/snakefiles/align_ema.smk (13 hunks)
harpy/snakefiles/align_strobealign.smk (10 hunks)
harpy/snakefiles/phase.smk (11 hunks)
harpy/snakefiles/simulate_linkedreads.smk (2 hunks)
harpy/snakefiles/snp_freebayes.smk (10 hunks)
harpy/snakefiles/snp_mpileup.smk (12 hunks)
harpy/snakefiles/sv_leviathan.smk (7 hunks)
harpy/snakefiles/sv_leviathan_pop.smk (10 hunks)
harpy/snakefiles/sv_naibr.smk (7 hunks)
harpy/snakefiles/sv_naibr_phase.smk (9 hunks)
harpy/snakefiles/sv_naibr_pop.smk (10 hunks)
harpy/snakefiles/sv_naibr_pop_phase.smk (8 hunks)
harpy/template.py (2 hunks)
resources/changelog.md (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: impute
GitHub Check: simulate linkedreads
GitHub Check: assembly

🔇 Additional comments (45)

harpy/template.py (2)

10-10: Import modification correctly removes HTCondor support

The import statement has been updated to remove the hpc_htcondor module, which aligns with the PR objective of removing HTCondor support.

21-21: Updated HPC Configurations documentation

The documentation for available HPC configuration commands has been properly updated to exclude the HTCondor option, maintaining consistency with the code changes.

harpy/snakefiles/simulate_linkedreads.smk (3)

16-16: New configuration variable for merging haplotypes

The new merge_haplotypes variable correctly pulls from the configuration dictionary, enabling the new merge haplotypes functionality.

177-187: New rule for combining haplotypes

Excellent implementation of the new combine_haplotypes rule that concatenates the two haplotype outputs into a single file. The rule structure is clear with well-defined inputs and outputs.

192-192: Conditional inclusion of merged haplotypes in workflow summary

The workflow summary correctly includes the combined haplotype files conditionally based on the merge_haplotypes flag. This ensures that the workflow only attempts to collect files that actually exist.

.github/workflows/tests.yml (4)

639-641: Removed cleanup commands from naibr workflow

The cleanup command (rm -r Genome) has been removed, which aligns with the PR objective of changing how genomes are processed. Genomes are now processed on a per-execution basis with output organized under outdir/workflow/genome.

643-645: Removed cleanup commands from naibr pop workflow

Similar to previous changes, removing the cleanup command ensures consistent processing of genomes across the workflow.

647-650: Removed cleanup commands from naibr phasing workflow

The cleanup command has been consistently removed across all naibr-related steps, maintaining the new genome processing approach.

652-654: Removed cleanup commands from naibr pop with phasing workflow

Consistent implementation of the change in genome handling approach across all naibr variants.

harpy/snakefiles/sv_leviathan_pop.smk (4)

18-18: Good addition of workflowdir variable to standardize paths.

This change introduces a centralized workflow directory path definition, which will improve maintainability by reducing hardcoded path strings throughout the file.

35-37: Improved workflow_geno path handling.

The updates to the workflow_geno variable logic ensure consistent path handling regardless of whether the genome file is gzipped, providing better organization and isolation between workflow runs.

127-127: Consistent output path for processed genome.

Using workflow_geno for the output path of the process_genome rule aligns with the goal of standardizing file paths and eliminates the need for a shared Genome folder.

333-333: Updated summary file location.

This change maintains consistency with the new directory structure organization and preserves the summary information in the appropriate workflow directory.

harpy/snakefiles/sv_naibr_pop.smk (3)

19-19: Good addition of workflowdir variable to standardize paths.

This change introduces a centralized workflow directory path that will improve maintainability by reducing hardcoded path strings throughout the file.

33-35: Improved workflow_geno path handling.

The updates to the workflow_geno variable logic ensure consistent path handling regardless of whether the genome file is gzipped, providing better organization and isolation between workflow runs.

316-316: Updated summary file location.

This change maintains consistency with the new directory structure organization and preserves the summary information in the appropriate workflow directory.

harpy/snakefiles/align_strobealign.smk (2)

25-25: Good workflow_geno path definition.

The change to define workflow_geno using the new workflowdir variable structure helps maintain consistency across workflows. This aligns with the PR's goal of eliminating the shared Genome folder.

356-356: Updated summary file location.

This change maintains consistency with the new directory structure organization and preserves the summary information in the appropriate workflow directory.

resources/changelog.md (3)

13-14: Clear documentation of new features.

The changelog clearly documents the new features: cluster submission for downsampling and the merge-haplotypes functionality for simulating linked reads. These additions align well with the PR objective of enhancing simulation of linked reads.

36-36: Good documentation of HTCondor support removal.

The changelog clearly notes the breaking change regarding HTCondor support removal, which aligns with the PR objective mentioned in issue #220 about removing HTCondor support due to the missing Snakemake plugin.

50-51: Clear explanation of the Genome folder restructuring.

The changelog provides a clear explanation of the change from a unified Genome/ folder to a per-output-folder structure, addressing the issue #219 mentioned in the PR objectives. It also provides a good explanation of the trade-off (more disk space usage vs. more reliable behavior).

harpy/snakefiles/sv_naibr_pop_phase.smk (3)

19-19: Central workflow directory standardization looks good.

Adding a standardized workflowdir variable improves path organization and creates a consistent pattern for file management across the codebase.

37-39: Clean implementation of workflow genome path handling.

The conditional path construction properly accounts for gzipped files by removing the extension in the output path.

95-95: Path standardization looks consistent.

All paths have been properly updated to use the new workflowdir and workflow_geno variables, maintaining consistency throughout the file.

Also applies to: 103-103, 105-105, 109-109, 147-147, 150-150, 319-320, 333-333, 335-335, 358-358, 406-406

harpy/snakefiles/snp_mpileup.smk (2)

17-17: Good implementation of workflow directory variables.

The introduction of workflowdir, workflow_geno, and workflow_geno_idx variables creates a consistent pattern for file management. The conditional logic for handling gzipped files is properly implemented.

Also applies to: 28-28, 30-30

50-50: Path standardization is consistent.

All file paths have been properly updated to use the new workflow directory variables, ensuring consistency across the file.

Also applies to: 65-68, 70-70, 88-88, 119-120, 134-134, 235-236, 299-299

harpy/snakefiles/phase.smk (2)

17-18: Good directory structure standardization.

The introduction of workflowdir variable is consistent with changes in other files, supporting a unified approach to path management.

46-47: Path updates are consistent throughout the file.

All relevant paths have been properly updated to use the new workflowdir variable, maintaining consistency with other files in the codebase.

Also applies to: 74-74, 129-129, 149-149, 164-164, 192-192, 200-200, 256-257, 271-271, 314-314

harpy/snakefiles/align_bwa.smk (2)

17-17: Clean implementation of workflow directory variables.

The introduction of workflowdir, workflow_geno, and workflow_geno_idx follows the same pattern as other files, maintaining consistency across the codebase.

Also applies to: 24-24, 26-28

45-45: Path standardization is comprehensive.

All file paths have been properly updated to use the new workflow directory variables. The changes are consistent with the project's new file organization strategy.

Also applies to: 60-64, 81-81, 99-102, 112-113, 137-138, 199-199, 223-224, 240-240, 312-312, 357-357

harpy/snakefiles/snp_freebayes.smk (3)

17-17: Good addition of a centralized workflow directory path variable.

This addition of the workflowdir variable helps standardize paths across the workflow, making the code more maintainable.

33-33: Good refactoring for genome file path management.

Creating a consistent reference for the genome file path using the workflow_geno variable improves maintainability by centralizing path construction.

53-53: Consistent path standardization across rules.

The systematic replacement of hardcoded paths with the new path variables (workflowdir and workflow_geno) throughout all rules creates a more maintainable codebase.

Also applies to: 62-62, 70-72, 74-74, 106-107, 167-167, 184-185, 199-200, 214-214, 260-260

harpy/snakefiles/align_ema.smk (2)

17-17: Good standardization of path variables.

The introduction of workflowdir and related variables creates a consistent approach to path management throughout the workflow.

Also applies to: 27-29

49-49: Comprehensive path refactoring across rules.

The systematic replacement of hardcoded paths with the standardized variables improves maintainability and reduces the potential for path-related errors.

Also applies to: 64-69, 85-89, 97-97, 158-160, 192-194, 216-217, 261-261, 303-303, 315-316, 332-332, 389-389, 450-450

harpy/snakefiles/sv_naibr_phase.smk (2)

18-18: Good introduction of workflow directory variable and genome path.

Replacing the validgenome approach with the standardized workflow_geno variable improves consistency with other workflow files.

Also applies to: 34-36

74-74: Consistent path standardization across rules.

The systematic update of paths across all rules ensures consistency with the path management approach used in other workflow files.

Also applies to: 82-86, 126-126, 129-129, 163-163, 190-190, 259-260, 273-275, 317-317

harpy/simulate_linkedreads.py (3)

24-24: Good implementation of the new merge-haplotypes feature.

The addition of the --merge-haplotypes flag follows the existing CLI pattern and properly updates both the workflow options list and function signature.

Also applies to: 40-40, 49-49

58-59: Clear documentation of the new feature.

The docstring update clearly explains the purpose and effect of the new --merge-haplotypes option, helping users understand when to use it.

95-95: Proper configuration integration for the new feature.

Adding the merge_haplotypes parameter to the configuration dictionary ensures the option is properly passed to the Snakemake workflow.

harpy/snakefiles/sv_leviathan.smk (3)

18-18: Improved directory structure with workflowdir variable

Adding a dedicated variable for workflow directory structure creates better organization and makes the file path management more maintainable.

35-38: Good refactoring of genome file path handling

The conditional logic for handling gzipped genomes is well-structured, creating a consistent path based on the workflowdir variable and properly handling file extensions.

67-67: Consistent path updates throughout the workflow

All path references have been correctly updated to use the new workflowdir and workflow_geno variables, ensuring consistency throughout the workflow.

Also applies to: 75-76, 77-77, 79-79, 87-88, 89-89, 91-91, 101-102, 190-191, 204-204, 206-206, 249-249

harpy/snakefiles/sv_naibr.smk (2)

18-18: Good standardization of workflow directory structure

Implementing workflowdir and deriving workflow_geno paths creates consistency with other workflow files in the project.

Also applies to: 31-32

181-186: Consistent path updates throughout the workflow

All path references have been correctly updated to use the new workflowdir and workflow_geno variables.

Also applies to: 202-203, 216-216, 218-218, 256-256

harpy/snakefiles/align_strobealign.smk

harpy/snakefiles/sv_naibr.smk

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

harpy/snakefiles/align_strobealign.smk (2)
18-18: Consider using workflowdir in envdir definition for consistency

For improved consistency with the new path standardization approach, consider refactoring the envdir definition to use the workflowdir variable.
-envdir      = os.path.join(os.getcwd(), outdir, "workflow", "envs")
+envdir      = os.path.join(os.getcwd(), workflowdir, "envs")
336-350: Consider updating genomefile references in summary text

The workflow summary still references the original genomefile rather than the new workflow_geno path. If the intention is to report the input parameters as provided by the user, this is correct. If the intention is to report the actual file used in the workflow, consider updating these references.
-        summary.append(f"The provided genome: {genomefile}")
+        summary.append(f"The provided genome: {genomefile} (processed as {workflow_geno})")

-        duplicates += f"\tsamtools sort -T SAMPLE --reference {genomefile} -m 2000M |\n"
+        duplicates += f"\tsamtools sort -T SAMPLE --reference {workflow_geno} -m 2000M |\n"

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87073ab and 9026ac3.

📒 Files selected for processing (3)

.github/workflows/tests.yml (2 hunks)
harpy/snakefiles/align_strobealign.smk (10 hunks)
harpy/snakefiles/sv_naibr.smk (7 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

.github/workflows/tests.yml
harpy/snakefiles/sv_naibr.smk

⏰ Context from checks skipped due to timeout of 90000ms (16)

GitHub Check: align EMA
GitHub Check: align BWA
GitHub Check: deconvolve
GitHub Check: sv naibr
GitHub Check: simulate variants
GitHub Check: impute
GitHub Check: sv leviathan
GitHub Check: simulate linkedreads
GitHub Check: assembly
GitHub Check: align strobe
GitHub Check: phase
GitHub Check: freebayes
GitHub Check: mpileup
GitHub Check: demux gen1
GitHub Check: Rebuild Container
GitHub Check: qc

🔇 Additional comments (4)

harpy/snakefiles/align_strobealign.smk (4)

17-25: LGTM: Standardized file path management with workflow variables

The addition of workflowdir and workflow_geno variables provides a consistent approach to file path management, addressing issue #219 and fixing the previously reported circular reference issue.

356-356: LGTM: Workflow summary output path updated

The path for writing the workflow summary now correctly uses the workflowdir variable, maintaining consistency with other path changes.

47-191: LGTM: Consistent use of workflow_geno throughout rules

The changes to use the workflow_geno variable across multiple rules (process_genome, index_genome, strobe_index, align, mark_duplicates, molecule_coverage) ensure consistent file path handling and implement the per-execution genome processing approach outlined in the PR objectives.

214-303: LGTM: Standardized report paths using workflowdir

The reports configuration now consistently uses the workflowdir variable for all template paths, improving maintainability and organization.

pdimens added 10 commits March 13, 2025 15:50

popgroup -> template groupings

a4943a2

imputeparams -> template impute hpc xxx -> template hpc-xxx unified simulate clarified docstrings

update references to popgroup and imputeparams/stitchparams

a6891c2

update changelog

2f5f7e9

make text succinct

c51c3aa

even more succinct

1a5ecd1

improve docstrings all around

f4deabf

move genome processing to within-workflow dir

b3b75d2

add --merge-haplotypes

ea498ca

htcondor support vanished =/

6f763d8

rm htcondor from test

a520368

pdimens added this to the 2.0 milestone Mar 14, 2025

pdimens added 4 commits March 14, 2025 15:21

Merge branch 'main' into harpy_templates

b3dac8a

fix git wierdness

03f83cc

resolve more merge issues

47575c5

fixes for ci

87073ab

coderabbitai bot reviewed Mar 14, 2025

View reviewed changes

harpy/simulate_linkedreads.py Outdated Show resolved Hide resolved

harpy/snakefiles/align_strobealign.smk Outdated Show resolved Hide resolved

harpy/snakefiles/sv_naibr.smk Outdated Show resolved Hide resolved

coderabbitai bot reviewed Mar 14, 2025

View reviewed changes

harpy/snakefiles/align_strobealign.smk Outdated Show resolved Hide resolved

harpy/snakefiles/sv_naibr.smk Outdated Show resolved Hide resolved

bugfixes

9026ac3

coderabbitai bot reviewed Mar 14, 2025

View reviewed changes

pdimens merged commit c77a901 into main Mar 17, 2025
20 checks passed

pdimens deleted the harpy_templates branch March 17, 2025 17:02

coderabbitai bot mentioned this pull request Apr 8, 2025

add more v2 features #228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge-haplotypes for simulating linkedreads #221

merge-haplotypes for simulating linkedreads #221

pdimens commented Mar 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 14, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

merge-haplotypes for simulating linkedreads #221

merge-haplotypes for simulating linkedreads #221

Conversation

pdimens commented Mar 14, 2025 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Mar 14, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

pdimens commented Mar 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 14, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)