Skip to content

Latest commit



127 lines (99 loc) · 4.68 KB

File metadata and controls

127 lines (99 loc) · 4.68 KB

GFF(3) stats

Given a genome and a corresponding GFF3 file, calculate various statistics on the coding regions (or extract them).

Current output is a tsv, with bed-like first four columns (i.e. sequence ID, attribute Parent ID, start, end...).

GC percent, GC skew, AT percent, and AC skew are calculated for each:

  • raw CDS (or spliced CDS) (GC)
  • four(six)-fold degenerate sites from the CDS (GC4)
  • third codon position for each codon in the CDS (GC3)

gff-stats can also extract CDS/spliced CDS as a nucleotide or protein string to a fasta file (see below). Note this functionality is also provided by gffread (see below). gffread may be faster as it indexes the fasta for quick random access.

Note: gff-stats requires the length of coding sequences of a given transcript to add up to a value divisible by three. In case any transcripts violate this assumption, they can be filtered out with the following script before running gff-stats:


Building requires Rust.

git clone
cd gff-stats
cargo build --release
# ./target/release/gff-stats is the executable
# or
cargo install --path .
# to put gff-stats in your path


### gff-stats -h

GFF(3) stats 0.2.2
Max Brown <[email protected]>
Extract GFF3 regions from a reference fasta and compute statistics on them.

    gff-stats [SUBCOMMAND]

    -h, --help       Print help information
    -V, --version    Print version information

    help    Print this message or the help of the given subcommand(s)
    seq     Extract CDS regions to fasta format. Printed to stdout.
    stat    Compute statistics on CDS regions

gff-stats stat -h

gff-stats-stat 0.2.2
Compute statistics on CDS regions

    gff-stats stat [OPTIONS] --gff <gff> --fasta <fasta>

    -d, --degeneracy <degeneracy>    Calculate statistics on four-fold or six-fold (in addition to
                                     four-fold) degenerate codon sites. [default: fourfold]
                                     [possible values: fourfold, sixfold]
    -f, --fasta <fasta>              The reference fasta file.
    -g, --gff <gff>                  The input gff file.
    -h, --help                       Print help information
    -o, --output <output>            Output filename for the TSV (without extension). [default: gff-
    -p, --spliced                    Compute stats on spliced CDS sequences?
    -V, --version                    Print version information

gff-stats seq -h

Cross testing with gffread:

# -x outputs spliced fastas
gffread -g ./tests/test_fasta.fna ./tests/test_gff.gff -x ./tests/test_gffread_x.fa
# equivalent to:
gff-stats seq -f ./tests/test_fasta.fna -g ./tests/test_gff.gff -s
# -y outputs spliced protein fastas
gffread -g ./tests/test_fasta.fna ./tests/test_gff.gff -y ./tests/test_gffread_y.fa
# equivalent to:
gff-stats seq -f ./tests/test_fasta.fna -g ./tests/test_gff.gff -sp
gff-stats-seq 0.2.2
Extract CDS regions to fasta format. Printed to stdout.

    gff-stats seq [OPTIONS] --gff <gff> --fasta <fasta>

    -f, --fasta <fasta>      The reference fasta file.
    -g, --gff <gff>          The input gff file.
    -h, --help               Print help information
    -o, --output <output>    Output filename for the fasta (without extension). [default: gff-stat]
    -p, --protein            Save the extracted CDS fasta sequences as a translated protein?
    -s, --spliced            Save the spliced extracted CDS fasta sequences?
    -V, --version            Print version information

The output of gff-stats can be used to calcualate average GC3 percent in non-overlapping sliding windows of a user-defined size (e.g. 100 kb). While either mode of gff-stats stat can be used as the input to this script, using the 'spliced' option is the quickest. -h

  -h, --help            show this help message and exit
  -i INDEX, --index INDEX
                        Index file for the genome
  -s STATS, --stats STATS
                        Stats file generated by gff-stats stat
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        Window size (in bases)
  -o OUTPUT, --output OUTPUT
                        Output filename for the TSV (without extension)


API documentation