bug-fixes after testing on qiita-rc #81

charles-cowart · 2024-02-27T05:41:34Z

Here is a PR for the bug-fixes encountered after testing. There is one inconsequential change in mg-scripts for debugging purposes on qiita-rc so I did not include it in a second PR.

seqpro has so far taken four hours to count all of the files in our test job and it is still not finished. My assumption is that if seqpro completes successfully the steps downstream will be successful as well, as they're relatively small and were not affected by any recent changes.

seqpro keeps the generated sequence count for each file in memory. If the interpreter process dies, we can begin the Step() over again but all of the previous work will be lost. One possible solution might be to alter the demux() function to keep a count and write the info out to a json file that can later be read by a modified version of seqpro. This is similar to what it already does with other metadata files found in the run-directory. For existing jobs where NuQCJob has already run, we can write a job script to process them in parallel. That's where my thinking takes me.

charles-cowart · 2024-02-27T07:11:35Z

Regarding slow counting speed, it appears scikit-bio is known to be slow counting sequences, because sequences can still potentially be more or less than four lines (https://stackoverflow.com/questions/39150965/fastest-way-to-read-a-fastq-with-scikit-bio).

Selecting a 1.2GB file from the dataset, decompressing it, and getting a line count produced a result of approximately 17M sequences. The time taken by a small script using scikit-bio to pull the first 100,000 sequences from the zipped version of this file was a little over 44 seconds. If my back of the envelope estimate is correct counting all 162GB of data will take approximately 280 hours.

wasade · 2024-02-27T07:25:23Z

Why not use seqtk size to count?

charles-cowart · 2024-02-27T07:33:36Z

seqtk size

Ty for the tip! Results are 17088755\t2516404944, the first component appears to match the numbers I got from wc -l /4. The results arrived in a little over 37 seconds:
real 0m37.218s
user 0m36.708s
sys 0m0.507s

charles-cowart · 2024-02-27T08:26:54Z

I will need to put together a change to use seqtk in mg-scripts in a PR in the morning.

bug-fixes after testing on qiita-rc

a9ff239

charles-cowart requested a review from antgonza February 27, 2024 05:41

bugfix for error in CI

292704d

updates

d58b055

antgonza merged commit 73a4b40 into qiita-spots:main Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug-fixes after testing on qiita-rc #81

bug-fixes after testing on qiita-rc #81

Uh oh!

charles-cowart commented Feb 27, 2024

Uh oh!

charles-cowart commented Feb 27, 2024

Uh oh!

wasade commented Feb 27, 2024

Uh oh!

charles-cowart commented Feb 27, 2024

Uh oh!

charles-cowart commented Feb 27, 2024

Uh oh!

Uh oh!

bug-fixes after testing on qiita-rc #81

bug-fixes after testing on qiita-rc #81

Uh oh!

Conversation

charles-cowart commented Feb 27, 2024

Uh oh!

charles-cowart commented Feb 27, 2024

Uh oh!

wasade commented Feb 27, 2024

Uh oh!

charles-cowart commented Feb 27, 2024

Uh oh!

charles-cowart commented Feb 27, 2024

Uh oh!

Uh oh!