Skip to content

Commit 1a7dd94

Browse files
authored
Remove S5F contents and mark experimental features (#110)
* add mutability file excerpts to docs * make backward-compatible mutability loading * format and lint
1 parent 915228d commit 1a7dd94

9 files changed

+93
-2075
lines changed

Diff for: Makefile

+2
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ lint:
1919
flake8 . --count --max-complexity=30 --max-line-length=127 --statistics
2020

2121
docs:
22+
wget -O HS5F_Mutability.csv https://bitbucket.org/kleinstein/shazam/raw/ba4b30fc6791e2cfd5712e9024803c53b136e664/data-raw/HS5F_Mutability.csv
23+
wget -O HS5F_Substitution.csv https://bitbucket.org/kleinstein/shazam/raw/ba4b30fc6791e2cfd5712e9024803c53b136e664/data-raw/HS5F_Substitution.csv
2224
rm -f docs/outfile docs/outtree # remove phylip dnapars output
2325
make -C docs html
2426
make -C docs html # need it again to get images from quickstart commands

Diff for: S5F/Mutability.csv

-1,025
This file was deleted.

Diff for: S5F/Substitution.csv

-1,025
This file was deleted.

Diff for: docs/quickstart.rst

+16-4
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,17 @@ For example here is the top ranked tree ``gctree.out.inference.1.svg``:
100100

101101
You will also see Python pickle files ``gctree.out.inference.[1,2,...].p`` containing a :obj:`gctree.CollapsedTree` object for each optimal tree, which can be loaded and manipulated via the API (e.g. plotted in various ways using :meth:`gctree.CollapsedTree.render`).
102102

103+
All parsimony trees found by dnapars, as well as branching process parameters
104+
are saved in the file ``gctree.out.inference.parsimony_forest.p``, containing a :class:`gctree.CollapsedForest` object.
105+
This file may be manipulated using ``gctree infer``, instead of providing
106+
a dnapars ``outfile``.
107+
108+
.. note::
109+
Although described below, using mutability parsimony or isotype parsimony
110+
as ranking criteria is experimental, and has not yet been shown in a careful
111+
validation to improve tree inference. Only the default branching process
112+
likelihood is recommended for tree ranking!
113+
103114
Criteria other than branching process likelihoods can be used to break ties
104115
between trees. Providing arguments ``--isotype_mapfile`` and
105116
``--idmapfile`` will allow trees to be ranked by isotype parsimony. Providing
@@ -109,15 +120,16 @@ lexicographically, first maximizing likelihood, then minimizing isotype
109120
parsimony and mutabilities, if such information is provided.
110121
Ranking priorities can be adjusted using the argument ``--ranking_coeffs``.
111122

112-
All parsimony trees found by dnapars, as well as branching process parameters
113-
are saved in the file ``gctree.out.inference.parsimony_forest.p``, containing a :class:`gctree.CollapsedForest` object.
114-
This file may be manipulated using ``gctree infer``. For example, to find the optimal tree
123+
For example, to find the optimal tree
115124
according to a linear combination of likelihood, isotype parsimony,
116125
mutabilities, and alleles:
117126

118-
.. command-output:: gctree infer gctree.out.inference.parsimony_forest.p --frame 1 --idmap idmap.txt --isotype_mapfile ../example/isotypemap.txt --mutability ../S5F/Mutability.csv --substitution ../S5F/Substitution.csv --ranking_coeffs 1 0.1 0 --outbase newranking --summarize_forest --tree_stats --verbose
127+
.. command-output:: gctree infer gctree.out.inference.parsimony_forest.p --frame 1 --idmap idmap.txt --isotype_mapfile ../example/isotypemap.txt --mutability ../HS5F_Mutability.csv --substitution ../HS5F_Substitution.csv --ranking_coeffs 1 0.1 0 --outbase newranking --summarize_forest --tree_stats --verbose
119128
:shell:
120129

130+
The files ``HS5F_Mutability.csv`` and ``HS5F_Substitution.csv`` are a context
131+
sensitive mutation model which can be downloaded from the `Shazam Project <https://bitbucket.org/kleinstein/shazam/src/master/data-raw/`>_.
132+
121133
By default, only the files listed above will be generated, with the optional argument ``--outbase`` specifying how the output files should be named.
122134

123135
.. image:: newranking.inference.1.svg

Diff for: gctree/branching_processes.py

+16-12
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
1-
r"""
2-
This module contains classes for simulation and inference for a binary
1+
r"""This module contains classes for simulation and inference for a binary
32
branching process with mutation in which the tree is collapsed to nodes that
4-
count the number of clonal leaves of each type.
5-
"""
3+
count the number of clonal leaves of each type."""
64

75
from __future__ import annotations
86

@@ -378,7 +376,8 @@ def ll(
378376
p: np.float64,
379377
q: np.float64,
380378
) -> Tuple[np.float64, np.ndarray]:
381-
r"""Log likelihood of branching process parameters :math:`(p, q)` given tree topology :math:`T` and genotype abundances :math:`A`.
379+
r"""Log likelihood of branching process parameters :math:`(p, q)` given
380+
tree topology :math:`T` and genotype abundances :math:`A`.
382381
383382
.. math::
384383
\ell(p, q; T, A) = \log\mathbb{P}(T, A \mid p, q)
@@ -833,7 +832,8 @@ def support(
833832
weights: Optional[List[np.float64]] = None,
834833
compatibility: bool = False,
835834
):
836-
r"""Compute support from a list of bootstrap :class:`CollapsedTree` objects, and add to tree attibute.
835+
r"""Compute support from a list of bootstrap :class:`CollapsedTree`
836+
objects, and add to tree attibute.
837837
838838
Args:
839839
bootstrap_trees_list: List of trees
@@ -867,10 +867,9 @@ def local_branching(
867867
self, tau=1, tau0=1, infinite_root_branch=True, nan_root_lbr=False
868868
):
869869
r"""Add local branching statistics (Neher et al. 2014) as tree node
870-
features to the ETE tree attribute.
871-
After execution, all nodes will have new features ``LBI``
872-
(local branching index) and ``LBR`` (local branching ratio, below Vs
873-
above the node)
870+
features to the ETE tree attribute. After execution, all nodes will
871+
have new features ``LBI`` (local branching index) and ``LBR`` (local
872+
branching ratio, below Vs above the node)
874873
875874
Args:
876875
tau: decay timescale for exponential filter
@@ -994,7 +993,8 @@ def __init__(
994993
self.is_isotyped = False
995994

996995
def simulate(self, p: np.float64, q: np.float64, n_trees: int):
997-
r"""Simulate a forest of collapsed trees. Overwrites existing forest attribute.
996+
r"""Simulate a forest of collapsed trees. Overwrites existing forest
997+
attribute.
998998
999999
Args:
10001000
p: branching probability
@@ -1017,7 +1017,10 @@ def ll(
10171017
q: np.float64,
10181018
marginal: bool = False,
10191019
) -> Tuple[np.float64, np.ndarray]:
1020-
r"""Log likelihood of branching process parameters :math:`(p, q)` given tree topologies :math:`T_1, \dots, T_n` and corresponding genotype abundances vectors :math:`A_1, \dots, A_n` for each of :math:`n` trees in the forest.
1020+
r"""Log likelihood of branching process parameters :math:`(p, q)` given
1021+
tree topologies :math:`T_1, \dots, T_n` and corresponding genotype
1022+
abundances vectors :math:`A_1, \dots, A_n` for each of :math:`n` trees
1023+
in the forest.
10211024
10221025
If ``marginal=False`` (the default), compute the joint log likelihood
10231026
@@ -1637,6 +1640,7 @@ def f(x):
16371640

16381641
def _lltree(cm_counts, p: np.float64, q: np.float64) -> Tuple[np.float64, np.ndarray]:
16391642
r"""Log likelihood of branching process parameters :math:`(p, q)`
1643+
16401644
.. math::
16411645
\ell(p, q; T, A) = \log\mathbb{P}(T, A \mid p, q)
16421646

Diff for: gctree/cli.py

+6-1
Original file line numberDiff line numberDiff line change
@@ -592,7 +592,9 @@ def get_parser():
592592
default=None,
593593
help=(
594594
"Path to mutability model file. If --mutability and --substitution are both provided, "
595-
"they will be used to rank trees after likelihood and isotype parsimony."
595+
"they will be used to rank trees after likelihood and isotype parsimony. This shall be a csv file"
596+
"with the first column containing fivemers, and the second column containing mutability scores."
597+
"See a file excerpt in the documentation for :meth:`mutation_model.MutationModel`."
596598
),
597599
)
598600
parser_infer.add_argument(
@@ -602,6 +604,9 @@ def get_parser():
602604
help=(
603605
"Path to substitution model file. If --mutability and --substitution are both provided, "
604606
"they will be used to rank trees after likelihood and isotype parsimony."
607+
"This shall be a csv file with the first column containing fivemers, and the next four"
608+
"columns containing targeting probabilities for bases A, C, G, and T, respectively."
609+
"See a file excerpt in the documentation for :meth:`mutation_model.MutationModel`."
605610
),
606611
)
607612
parser_infer.add_argument(

Diff for: gctree/mkconfig.py

-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@
1212
1313
$ mkconfig sequence.phy > dnapars.cfg
1414
$ dnapars < dnapars.cfg
15-
1615
"""
1716
import os
1817
import argparse

Diff for: gctree/mutation_model.py

+48-5
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,41 @@ class MutationModel:
2222
mutation_order: whether or not to mutate sequences using a context sensitive manner
2323
where mutation order matters
2424
with_replacement: allow the same position to mutate multiple times on a single branch
25+
26+
Notes:
27+
``mutability_file`` shall be a csv file with the first column containing fivemers,
28+
and the second column containing mutability scores.
29+
An example can be found at https://bitbucket.org/kleinstein/shazam/src/master/data-raw/HS5F_Mutability.csv
30+
31+
For example:
32+
33+
.. code-block:: text
34+
35+
Fivemer,Mutability,...
36+
TCGGG,0.03542,...
37+
GCCGG,0.02241675,...
38+
GCCGC,0.06789,...
39+
.
40+
.
41+
.
42+
43+
44+
``substitution_file`` shall be a csv file with the first column containing fivemers,
45+
and the next four columns containing targeting probabilities for bases A, C, G,
46+
and T, respectively.
47+
An example can be found at https://bitbucket.org/kleinstein/shazam/src/master/data-raw/HS5F_Substitution.csv
48+
49+
For example:
50+
51+
.. code-block:: text
52+
53+
Fivemer,A,C,G,T,...
54+
AAAAA,0,0.33,0.33,0.34,...
55+
AAAAC,0,0.5000,0.2500,0.2500,...
56+
AAAAG,0,0.65,0.15,0.20,...
57+
.
58+
.
59+
.
2560
"""
2661

2762
def __init__(
@@ -39,7 +74,10 @@ def __init__(
3974
# eat header
4075
f.readline()
4176
for line in f:
42-
motif, score = line.replace('"', "").split()[:2]
77+
if line[0] == '"':
78+
motif, score = line.replace('"', "").split()[:2]
79+
else:
80+
motif, score = line.replace(",", " ").split()[:2]
4381
self.context_model[motif] = float(score)
4482

4583
# kmer k
@@ -48,7 +86,10 @@ def __init__(
4886
# eat header
4987
f.readline()
5088
for line in f:
51-
fields = line.replace('"', "").split()
89+
if line[0] == '"':
90+
fields = line.replace('"', "").split()
91+
else:
92+
fields = line.replace(",", " ").split()
5293
motif = fields[0]
5394
if self.k is None:
5495
self.k = len(motif)
@@ -69,8 +110,9 @@ def _disambiguate(sequence):
69110
return _sequence_disambiguations(sequence)
70111

71112
def mutability(self, kmer: str) -> Tuple[np.float64, np.float64]:
72-
r"""Returns the mutability of a central base of :math:`k`-mer, along with
73-
nucleotide bias averages over ambiguous ``"N"`` nucleotide identities.
113+
r"""Returns the mutability of a central base of :math:`k`-mer, along
114+
with nucleotide bias averages over ambiguous ``"N"`` nucleotide
115+
identities.
74116
75117
Args:
76118
kmer: nucleotide :math:`k`-mer
@@ -213,7 +255,8 @@ def simulate(
213255
n: Optional[int] = None,
214256
verbose: bool = False,
215257
) -> TreeNode:
216-
r"""Simulate a neutral binary branching process with the mutation model, returning a :class:`ete3.Treenode` object.
258+
r"""Simulate a neutral binary branching process with the mutation model,
259+
returning a :class:`ete3.Treenode` object.
217260
218261
Args:
219262
sequence: root nucleotide sequence

Diff for: tests/smalltest.sh

+5-2
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,13 @@
11
#!/bin/bash
2+
set -eu
23

34
export QT_QPA_PLATFORM=offscreen
45
export XDG_RUNTIME_DIR=/tmp/runtime-runner
56
export MPLBACKEND=agg
67
mkdir -p tests/smalltest_output
8+
wget -O HS5F_Mutability.csv https://bitbucket.org/kleinstein/shazam/raw/ba4b30fc6791e2cfd5712e9024803c53b136e664/data-raw/HS5F_Mutability.csv
9+
wget -O HS5F_Substitution.csv https://bitbucket.org/kleinstein/shazam/raw/ba4b30fc6791e2cfd5712e9024803c53b136e664/data-raw/HS5F_Substitution.csv
710
gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer --root GL --frame 1 --verbose --idlabel
811
gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt
9-
gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer --root GL --frame 1 --verbose --idlabel --mutability S5F/Mutability.csv --substitution S5F/Substitution.csv
10-
gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt --mutability S5F/Mutability.csv --substitution S5F/Substitution.csv
12+
gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer --root GL --frame 1 --verbose --idlabel --mutability HS5F_Mutability.csv --substitution HS5F_Substitution.csv
13+
gctree infer tests/small_outfile tests/abundances.csv --outbase tests/smalltest_output/gctree.infer --root GL --frame 1 --verbose --idlabel --idmapfile tests/idmap.txt --isotype_mapfile tests/isotypemap.txt --mutability HS5F_Mutability.csv --substitution HS5F_Substitution.csv

0 commit comments

Comments
 (0)