-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
attempt parallelized demultiplexing #200
Conversation
📝 WalkthroughWalkthroughThe pull request simplifies the demultiplexing process in the primary Python script by removing the schema-reading function and instead relying on Snakemake parameters to obtain sample names and segment identifiers. Output file handling is streamlined to directly produce sample-specific files and a unified log. In parallel, the Snakefile is adjusted to rename variables, remove the schema input, update output management by using pipes, and add new lambda parameters for sample and id_segments. A backup version of the original script with legacy barcode and schema functions is retained. Changes
Sequence Diagram(s)sequenceDiagram
participant S as Snakefile (demultiplex rule)
participant P as demultiplex_gen1.py
participant OF as Output Files
participant L as Log
S->>P: Invoke demultiplex script with sample and id_segments parameters
P->>P: Process input reads and determine status using id_segments
P->>OF: Write processed reads to sample-specific R1 and R2 files
P->>L: Log total, correct, and corrected reads summary
Assessment against linked issues
Possibly related issues
Possibly related PRs
Suggested labels
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
harpy/snakefiles/demultiplex_gen1.smk (1)
37-37
: Minor style note.
samplenames = list(samples)
might be more concise, but this is purely stylistic.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
harpy/scripts/demultiplex_gen1.py
(2 hunks)harpy/scripts/demultiplex_gen1.py.bak
(1 hunks)harpy/snakefiles/demultiplex_gen1.smk
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/scripts/demultiplex_gen1.py
65-65: Undefined name snakemake
(F821)
66-66: Undefined name snakemake
(F821)
93-93: Undefined name snakemake
(F821)
100-100: Undefined name status_R1
(F821)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build/Cache Container
🔇 Additional comments (20)
harpy/scripts/demultiplex_gen1.py (9)
65-67
: Confirmsnakemake
is defined in the execution environment.
These lines assume thatsnakemake
is globally available. Since static analysis flagssnakemake
as undefined, please confirm that this script is only executed in a Snakemake environment wheresnakemake
is injected.🧰 Tools
🪛 Ruff (0.8.2)
65-65: Undefined name
snakemake
(F821)
66-66: Undefined name
snakemake
(F821)
91-93
: Confirmsnakemake
usage for output file references.
Similar to lines 65–67, these references tosnakemake
might fail if the script is run outside Snakemake. Just verify that your workflow always injectssnakemake.output.bx_info
.🧰 Tools
🪛 Ruff (0.8.2)
93-93: Undefined name
snakemake
(F821)
110-110
: No issues found.
The check for"unclear"
withinstatuses
appears logically correct.
116-116
: No issues found.
The check for"corrected"
instatuses
is consistent and appears correct.
122-122
: No issues found.
Verifying"found"
for all statuses is appropriate for this logic branch.
126-126
: Looks good.
The map initialization with[1,0]
is consistent with your logic for counting correct vs. corrected reads.
128-128
: Log header format is clear.
The newly added header clarifies the output columns well.
130-130
: LGTM for clear read map output.
Writing the total and the breakdown of correct vs. corrected reads is a helpful summary.
132-132
: LGTM for unclear reads output.
This line properly logs unclear barcode reads as requested.harpy/scripts/demultiplex_gen1.py.bak (7)
1-2
: Shebang and blank line.
This looks fine; ensure that the script is marked as executable if needed.
3-6
: Imports appear appropriate.
No apparent issues;pysam
andLevenshtein
usage likely assumes an environment where these are installed.
7-21
: Review ofread_barcodes
function.
The logic reads each line, validates the segment, and stores the data in a dict. The approach is clear and consistent.
23-41
: Review ofread_schema
function.
This function loads sample-to-segment mappings. It’s clear and well-commented. However, note thatid_letter
is assumed to be unique if multiple code letters exist. Confirm that your schema never includes more than one letter type.
42-58
: Review ofget_min_dist
function.
The Levenshtein-based approach for nearest barcodes is standard. The logic for tying a tie (nbFound > 1
) toXX00
code is also clear.
60-78
: Review ofget_read_codes
function.
The code properly splits the index read, checks dictionary membership, and falls back toget_min_dist
. The use ofstatus = "found"/"corrected"/"unclear"
is consistent.
80-162
: Main execution block.
- The global
bar_codes
dictionary is loaded from segment files; consider clarifying this usage with a comment.- All references to
snakemake
should be verified if used outside a Snakemake environment.- Output is well-structured: valid vs invalid logs, plus separate R1/R2 files for each sample.
Otherwise, logic and flow appear correct for demultiplexing.
harpy/snakefiles/demultiplex_gen1.smk (4)
27-31
: Check for multiple barcodes per sample.
Appending multiple barcodes to each sample is a good enhancement. This code block looks correct for building a multi‐barcode dictionary.
60-62
: Pipe-based outputs look good.
Switching topipe(...)
for read outputs simplifies intermediate file handling and is consistent with your new approach.
66-66
: No issues found.
Retainingoutdir
as a parameter is consistent with the rest of the workflow usage.
68-69
: Parameterized sample and segment retrieval.
Defining lambdas to fetchsample
andid_segments
fromsamples
is a straightforward approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
harpy/scripts/demultiplex_gen1.py (1)
65-67
: Document Snakemake dependency.Add a docstring at the beginning of the script to clarify that this script is intended to be run by Snakemake and requires the
snakemake
variable to be injected.Add this docstring at the beginning of the file:
#!/usr/bin/env python +""" +This script is designed to be run by Snakemake and requires the `snakemake` variable +to be injected by the Snakemake workflow engine. +"""🧰 Tools
🪛 Ruff (0.8.2)
65-65: Undefined name
snakemake
(F821)
66-66: Undefined name
snakemake
(F821)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
harpy/demultiplex.py
(1 hunks)harpy/scripts/demultiplex_gen1.py
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/scripts/demultiplex_gen1.py
65-65: Undefined name snakemake
(F821)
66-66: Undefined name snakemake
(F821)
93-93: Undefined name snakemake
(F821)
🔇 Additional comments (5)
harpy/scripts/demultiplex_gen1.py (4)
91-93
: LGTM! Simplified output handling supports parallelization.The direct file output approach aligns well with the PR's objective of parallelizing demultiplexing by focusing on one sample at a time.
🧰 Tools
🪛 Ruff (0.8.2)
93-93: Undefined name
snakemake
(F821)
96-97
: LGTM! Clear and consistent variable naming.The status variables are well-named and the logic is correct.
98-100
: LGTM! Efficient segment filtering.The filtering logic effectively supports the parallelization strategy by focusing on one sample at a time.
129-133
: LGTM! Clear and consistent log format.The log file output is well-structured and provides all necessary information in a readable format.
harpy/demultiplex.py (1)
42-42
: Verify minimum thread count requirement.The minimum thread count has been increased from 1 to 2, which enforces parallelization. While this aligns with the PR's objective, it may prevent users from running the tool on single-core systems.
Please confirm:
- Is this a deliberate design choice to enforce parallelization?
- Have you considered the impact on users with single-core systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
harpy/_validations.py
(2 hunks)test/demux/samples.schema
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build/Cache Container
🔇 Additional comments (3)
harpy/_validations.py (3)
268-268
: LGTM! Enhanced error message visibility.The addition of color formatting for the segment ID improves error message readability.
286-287
: LGTM! Improved error message clarity.The addition of color formatting for the filename and the clear example format enhances user understanding.
289-295
: LGTM! Enhanced error message with better guidance.The updated error message now provides clearer guidance about sample IDs needing to be in a single segment, with helpful examples using color formatting.
The goal of this is to see if the new demultiplexing can be done by parsing the multiplexed data only looking for one sample at a time, which can be done in parallel. The results are then piped into gzip for fast comrpession
Summary by CodeRabbit
New Features
Bug Fixes
Refactor