-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Various updates. There are still test failures but it's the weekend
so I'll come back to it.
- Loading branch information
Showing
10 changed files
with
335 additions
and
23 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
After releasing Hesiod 3.x, which is the Dorado+POD5 update, I got Rob to do a test run: | ||
|
||
https://egcloud.bio.ed.ac.uk/hesiod/20240222_EGS2_Is_PromethION_Working/ | ||
|
||
The run itself is a bit funky (used barcodes but the samples look to be un-barcoded) but | ||
Hesiod processed it fine (not I had to manually set it as an internal run to trigger | ||
getting a report). | ||
|
||
A few issues: | ||
|
||
We're making a load of empty fastq.gz files. In other pipelines we avoid this. I'm not sure | ||
if this is by accident or design here? Anyway, it works so I'll leave it for now. | ||
|
||
The Metadata reports "Guppy Version" and "Guppy Config". Of course we're now using Dorado. | ||
|
||
get_pod5_metadata.py now gives me: | ||
|
||
Software: MinKNOW 23.11.7 (Bream 7.8.2, Core 5.8.6, Dorado 7.2.13+fba8e8925) | ||
|
||
Which seems rather more informative so probably I should use that. Note that even for slightly | ||
older runs (eg. 20240124_EGS2_27971RLpool01) I see: | ||
|
||
Software: MinKNOW 23.04.5 (Bream 7.5.9, Core 5.5.3, Guppy 6.5.7+ca6d6af) | ||
|
||
So I should deffo be using this in the Hesiod reports now, not Guppy Version. | ||
But note that this gets used for the deliveries, so I need to modify | ||
disseminate_results.py and the template too!! | ||
|
||
Also "Guppy Config" does not reveal the model version, which in this case is: | ||
|
||
[email protected] | ||
^^^^^ | ||
|
||
So I need to get this from the POD5 (or from the FASTQ header even??). | ||
|
||
OK. | ||
|
||
Hmm. The "Guppy Config" item in the report is coming from: | ||
|
||
res['BasecallConfig'] = context_tags.get('basecall_config_filename', 'unknown') | ||
|
||
But this doesn't capture the model number. It doesn't seem to be in the POD5 metadata | ||
at all. Or anywhere else! I think I'm going to need to get this from the actual FASTQ | ||
file header. What a PITA! | ||
|
||
OK, done. Let's incorporate this. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
#!/usr/bin/env python3 | ||
|
||
"""Given a (directory of) .fastq(.gz) file(s), extract some metadata from the first line: | ||
1) runid | ||
2) Start time of the run | ||
3) flow_cell_id | ||
4) barcode | ||
5) basecall_model (apparently we can't get this from elsewhere) | ||
Inputs: | ||
A directory where .fastq(.gz) files may be found, or a file | ||
""" | ||
|
||
import os, sys, re | ||
import logging | ||
import gzip | ||
import shutil | ||
from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter | ||
from collections import OrderedDict | ||
from contextlib import suppress | ||
|
||
# For parsing of ISO/RFC format dates (note that newer Python has datetime.datetime.fromisoformat | ||
# but we're using dateutil.parser.isoparse from python-dateutil 2.8) | ||
# Actually, the time in the header lines here is per-read, and not useful to us anyway. | ||
# The POD5 file knows the cell start time. | ||
#from dateutil.parser import isoparse | ||
|
||
from hesiod import dump_yaml, glob | ||
|
||
def main(args): | ||
|
||
logging.basicConfig( level = logging.DEBUG if args.verbose else logging.INFO, | ||
format = "{levelname}:{message}", | ||
style = '{') | ||
|
||
if os.path.isdir(args.fastq): | ||
logging.debug(f"Scanning .fastq[.gz] files in {args.fastq!r}") | ||
md = md_from_fastq_dir(args.fastq) | ||
else: | ||
logging.debug(f"Reading from single file {args.fastq!r}") | ||
md = md_from_fastq_file(args.fastq) | ||
|
||
print(dump_yaml(md), end='') | ||
|
||
def md_from_fastq_dir(fq_dir): | ||
"""Read from the directory of fastq files and return a dict of metadata | ||
from the first header of the first file. | ||
""" | ||
fq_files = glob(os.path.join(fq_dir, '*.fastq.gz')) | ||
if not fq_files: | ||
# Try unzipped... | ||
logging.debug("No .fastq.gz files, maybe .fastq?") | ||
fq_files = glob(os.path.join(fq_dir, "*.fastq")) | ||
|
||
logging.debug(f"Found {len(fq_files)} files") | ||
if not fq_files: | ||
raise RuntimeError("No fastq[.gz] files found.") | ||
|
||
# Use the first one | ||
return md_from_fastq_file(fq_files[0]) | ||
|
||
def md_from_fastq_file(fq_file): | ||
"""Read from a specified fastq file and return a dict of metadata | ||
""" | ||
_open = gzip.open if fq_file.endswith('.gz') else open | ||
with _open(fq_file, 'rt') as fh: | ||
first_line = next(fh) | ||
|
||
if not first_line.startswith("@"): | ||
raise RuntimeError(f"Not a FASTQ header line:\n{first_line}") | ||
|
||
return md_from_header_line(first_line.rstrip("\n")) | ||
|
||
def md_from_header_line(hline): | ||
"""Extract the goodies from the header line. | ||
""" | ||
hdict = dict([p.split("=", 1) for p in hline.split() if "=" in p]) | ||
|
||
res = OrderedDict() | ||
|
||
for k, v in dict( runid = None, | ||
flowcell = "flow_cell_id", | ||
experiment = "protocol_group_id", | ||
sample = "sample_id", | ||
barcode = None, | ||
basecall_model = "basecall_model_version_id" ).items(): | ||
if hdict.get(v or k): | ||
res[k] = hdict.get(v or k) | ||
|
||
# Add this if missing | ||
for x in ['basecall_model']: | ||
res.setdefault(x, 'unknown') | ||
|
||
return res | ||
|
||
def parse_args(*args): | ||
description = """Extract various bits of metadata from the first read in a FASTQ file.""" | ||
|
||
parser = ArgumentParser( description = description, | ||
formatter_class = ArgumentDefaultsHelpFormatter) | ||
|
||
parser.add_argument("fastq", default='.', nargs='?', | ||
help="A file, or a directory to scan for .fastq[.gz] files") | ||
parser.add_argument("-v", "--verbose", action="store_true", | ||
help="Print progress to stderr") | ||
|
||
return parser.parse_args(*args) | ||
|
||
if __name__=="__main__": | ||
main(parse_args()) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.