-
Notifications
You must be signed in to change notification settings - Fork 4
add subsample_reads #138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add subsample_reads #138
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One request for change to an error message, a couple of questions
src/qp_klp/Protocol.py
Outdated
f'| gzip > {f}') | ||
_, se, rv = system_call(cmd) | ||
if rv != 0 or se: | ||
raise ValueError(f'Error during mv: {cmd}. {se}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message seems misleading since it (still) says "Error during mv"; I think it should be changed to say it is reporting on errors that occur during seqtk
.
src/qp_klp/Protocol.py
Outdated
for f in files: | ||
dn = dirname(f) | ||
bn = basename(f) | ||
nbn = join(dn, bn.replace('fastq.gz', 'subsampled.gz')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it going to cause any problems later that the subsampled reads file doesn't end with 'fastq.gz'? I'm not familiar with the workflow but I know I've seen regexes around that expect this suffix ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to keep in the working directory a backup of the original file, just in case we need to debug it. This basically moves the original problematic file to a new name with subsample,, then in the next command the subsample will create a new (smaller) file with the name of the original. In fact, I'm relying on those regex to ignore the subsampled.gz. I'll add a comment about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that makes sense :) However, I wonder if we could name them something other than "subsampled.gz"? That name makes me think that the file with that name IS the one that was subsampled, rather than being the original one. Could we name it "not_subsampled.gz" or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, while writing the comment I realized the same thing so I called it "full".
@@ -166,6 +167,7 @@ def execute_pipeline(self): | |||
self.convert_raw_to_fastq() | |||
self.integrate_results() | |||
self.generate_sequence_counts() | |||
self.subsample_reads() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that we will now ALWAYS subsample?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but really we will always check if subsample is needed and only run it when necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we will always subsample every fastq that has more than the max number of reads, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
No description provided.