Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug-fixes after testing on qiita-rc #81

Merged
merged 3 commits into from
Feb 27, 2024

Conversation

charles-cowart
Copy link
Contributor

Here is a PR for the bug-fixes encountered after testing. There is one inconsequential change in mg-scripts for debugging purposes on qiita-rc so I did not include it in a second PR.

seqpro has so far taken four hours to count all of the files in our test job and it is still not finished. My assumption is that if seqpro completes successfully the steps downstream will be successful as well, as they're relatively small and were not affected by any recent changes.

seqpro keeps the generated sequence count for each file in memory. If the interpreter process dies, we can begin the Step() over again but all of the previous work will be lost. One possible solution might be to alter the demux() function to keep a count and write the info out to a json file that can later be read by a modified version of seqpro. This is similar to what it already does with other metadata files found in the run-directory. For existing jobs where NuQCJob has already run, we can write a job script to process them in parallel. That's where my thinking takes me.

@charles-cowart
Copy link
Contributor Author

Regarding slow counting speed, it appears scikit-bio is known to be slow counting sequences, because sequences can still potentially be more or less than four lines (https://stackoverflow.com/questions/39150965/fastest-way-to-read-a-fastq-with-scikit-bio).

Selecting a 1.2GB file from the dataset, decompressing it, and getting a line count produced a result of approximately 17M sequences. The time taken by a small script using scikit-bio to pull the first 100,000 sequences from the zipped version of this file was a little over 44 seconds. If my back of the envelope estimate is correct counting all 162GB of data will take approximately 280 hours.

@wasade
Copy link
Contributor

wasade commented Feb 27, 2024

Why not use seqtk size to count?

@charles-cowart
Copy link
Contributor Author

seqtk size

Ty for the tip! Results are 17088755\t2516404944, the first component appears to match the numbers I got from wc -l /4. The results arrived in a little over 37 seconds:
real 0m37.218s
user 0m36.708s
sys 0m0.507s

@charles-cowart
Copy link
Contributor Author

I will need to put together a change to use seqtk in mg-scripts in a PR in the morning.

@antgonza antgonza merged commit 73a4b40 into qiita-spots:main Feb 27, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants