This is a script designed for Nanopore (ONT) data that generates a
list of file paths for fast5
files that are contained within
fastq
, BAM
, or SAM
files of interest.
The output can be piped to another unix command to copy the fast5 files to a new directory or to save a list of file paths. This is great for situations when you only need to share some, but not all, reads of an ONT dataset.
This script should work for both python 2 and python 3.
To install the script with git, you can clone it with github. Then,
change to the directory where you installed it (cd
) and make the
script executable (chmod
)
git clone https://github.com/mbhall88/fast5_in_ref.git
cd fast5_in_ref && chmod +x fast5_in_ref
It's pretty straight-forward to use:
./fast5_in_ref -i <fast5_dir> -r <in.fastq|in.bam|in.sam> -o <out.txt>
The script will walk down into subdirectories as well, so you can just give it your directory containing all your files.
What it does is read in the <in.fastq|in.bam|in.sam>
files and
extract the read id from each header. It then goes through all the
fast5 files under <fast_dir>
and checks whether their read id is in
the set of read ids from <in.fastq|in.bam|in.sam>
. If it is, the
path to the file is written to it's own line in <out.txt>
.
If no output (-o
) is given, it will write the output to stdout
.
It is possible to use multiple directories/files as arguments. No need to merge bam|fastq|sam files.
./fast5_in_ref -i /myfast5/dir/1/ /other/fast5/dir/2/ -r reads.sorted.bam reads2.bam
For example, if all of your fast5 directories contain the prefix
myfast5_
and the reference files contain .sorted.bam
, you can use
wildcards to find them all if they are in the same directory.
./fast5_in_ref -i myfast5_* -r *.sorted.bam
You can also mix reference file types in the arguments. For example, if you happen to have a sam file, a fastq file, and a bam file that contain reads you would like to find fast5 files for, they can all be processed simultaneously like so.
./fast5_in_ref -i myfast5_* -r mapped.sorted.bam mapped2.sam filtered_reads.fastq
Currently this program does not support gzipped fastq files. Fastq
files can end with .fq
or .fastq
.
So if you wanted to pipe these paths into another program, you could do something like
mkdir subset_dir/
./fast5_in_ref -i </path/to/fast5s/> -r <in.fastq> | xargs cp -t subset_dir/
The above example would copy the fast5
files that are found in your fastq
to subset_dir/
.
However because of the computationally intensive step required to open
fast5
files, we recommend that you first save the output of
fast5_in_ref
to a file for safekeeping, then proceed with analysis like so:
mkdir subset_dir/
./fast5_in_ref -i /path/to/fast5s/ -r in.fastq > mapped_reads.txt
cat mapped_reads.txt | xargs cp -t subset_dir/
For example, it took 37 minutes to look for mapped reads in 1.87 million fast5 files on a single processor. This same process took 10 minutes using a parallelized version of the program with 90 cores with spinning disk drives. Faster processing speeds are possible if your fast5 files are stored on solid state drives.
We are currently developed a parallelized version of the program to
speed up the analysis. It is called fast5_in_ref_parallelized
and
uses the same arguments as fast5_in_ref
. So far it is 60-70% faster
than the single-processor version for large datasets.
You need to have h5py
. You'll also need pysam
if you're going to be using SAM or BAM files.
pip install h5py
pip install pysam
If there are any issues with the program please open an issue above.
Michael Hall @mbhall88 Darrin Schultz @conchoecia