Skip to content

add subsample_reads #138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 13, 2025
Merged

add subsample_reads #138

merged 11 commits into from
Jul 13, 2025

Conversation

antgonza
Copy link
Member

No description provided.

@antgonza antgonza changed the title add subsample_reads [WIP] add subsample_reads Jul 10, 2025
Copy link
Contributor

@AmandaBirmingham AmandaBirmingham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One request for change to an error message, a couple of questions

f'| gzip > {f}')
_, se, rv = system_call(cmd)
if rv != 0 or se:
raise ValueError(f'Error during mv: {cmd}. {se}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message seems misleading since it (still) says "Error during mv"; I think it should be changed to say it is reporting on errors that occur during seqtk.

for f in files:
dn = dirname(f)
bn = basename(f)
nbn = join(dn, bn.replace('fastq.gz', 'subsampled.gz'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it going to cause any problems later that the subsampled reads file doesn't end with 'fastq.gz'? I'm not familiar with the workflow but I know I've seen regexes around that expect this suffix ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep in the working directory a backup of the original file, just in case we need to debug it. This basically moves the original problematic file to a new name with subsample,, then in the next command the subsample will create a new (smaller) file with the name of the original. In fact, I'm relying on those regex to ignore the subsampled.gz. I'll add a comment about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense :) However, I wonder if we could name them something other than "subsampled.gz"? That name makes me think that the file with that name IS the one that was subsampled, rather than being the original one. Could we name it "not_subsampled.gz" or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, while writing the comment I realized the same thing so I called it "full".

@@ -166,6 +167,7 @@ def execute_pipeline(self):
self.convert_raw_to_fastq()
self.integrate_results()
self.generate_sequence_counts()
self.subsample_reads()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we will now ALWAYS subsample?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but really we will always check if subsample is needed and only run it when necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we will always subsample every fastq that has more than the max number of reads, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

@antgonza antgonza changed the title [WIP] add subsample_reads add subsample_reads Jul 11, 2025
@antgonza antgonza merged commit 2b7832e into qiita-spots:main Jul 13, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants