Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpectedly High Proportion Genes Showing Relaxed Selection in HyPhy RELAX Analysis #59

Open
jdaron opened this issue Nov 20, 2024 · 3 comments

Comments

@jdaron
Copy link

jdaron commented Nov 20, 2024

Dear HyPhy developers,

I have recently been utilizing the HyPhy RELAX tool to analyze a set of ~2000 orthologous genes identified across various species of Drosophila and mosquitoes. In my analyses, I estimate the selection intensity parameter K at every node of the phylogenetic tree using the General Descriptive model.

Below is an example of a command line I used for launching the analysis:
hyphy RELAX --alignment $orthoGene.codonAlign.aln --tree m1.tree --test Foreground

Here is the input tree provided to RELAX:
(dmel,((albi,(aatro,(ansteph,angam)Node7)Node5)Node3,(urlow,(cuquin,((tosep,(sabcya,wysmi)Node17)Node15,(aekor,(arsuba,(aalb,aaegL5{Foreground})Node24)Node22)Node20)Node14)Node12)Node10));

In the bar graph below, I plotted the distribution of the exponent K for all tips and nodes. Surprisingly, most Drosophila melanogaster (dmel) genes have a K value less than 1 (K<1). In the plot, the dotted colored line represents the median value across the dataset, which I plotted alongside the tree for clearer visualization.

I have been replicating this analysis using the same gene set and alignment with CODEML, I obtained a dN/dS value of 0.002, which confirm that something is wrong in the way I am conducting the analysis with HyPhy RELAX and that Drosophila melanogaster could not have so many genes under relaxed selection.

I would greatly appreciate any suggestions or insights you could provide to help understand what might be going wrong in my analysis with HyPhy RELAX.

Best regards,
Josquin

relax.Kdistribution.pdf
relax.tree.pdf

@spond
Copy link
Member

spond commented Nov 20, 2024

Dear @jdaron,

Generally, when you test a single branch in RELAX you are going to get incredibly noisy estimates of K. My suggestions would be as follows

  1. Treat point estimates of K with caution. I would suggest focusing on the subset of genes where K is inferred to be singificantly different from 1, i.e. there's a suitable p-value for the test.
  2. Use --models Minimal as an additional command line argument.
  3. Confirm that your HyPhy and RELAX versions are recent.
  4. The comparison with mean dN/dS estimates are not particularly informative. K<1 may often arise if there's reduced positive selection component.

Could you share one or two genes with me so that I could run RELAX with a few different options and give you a clearer explanation?

Best,
Sergei

@jdaron
Copy link
Author

jdaron commented Nov 21, 2024

Dear Sergei,

Thanks for your quick answer. I am currently working with hyphy version HYPHY 2.5.2, but I will try updating it with more recent version.

I follow your advice and run RELAX using Drosophila melanogaster as foreground node, and I got 533 and 42 genes under relax or constrained selection with significant pvalue<0.05, which is consistant with my previous result but quiet unlikely from a biological perspective.

(dmel{Foreground},((albi,(aatro,(angam,ansteph))),(urlow,(cuquin,((tosep,(wysmi,sabcya)),(aekor,(arsuba,(aaegL5,aalb))))))));
hyphy RELAX --alignment $prefix.codonAlign.aln --tree m0.tree --test Foreground --models Minimal

However in this analysis I am more interested by the signal in the rest of the tree, especially some of the basal node, such as Node10 or Node3. If you don't recommend to use the General Descriptive model to infer the K at each node you will suggest to test each node individually and count the number of significative genes under relaxed and constrained selection?

Here is a folder of 20 genes (10 with significant K<1 or K>1) I am using if you want to have a look at them.
genes.tar.gz

Best,
Josquin

@spond
Copy link
Member

spond commented Nov 21, 2024

Dear @jdaron,

2.5.2 is well over 5 years old. I strongly recommend upgrading to a newer version. This will also include many improvements and enhancements to RELAX specifically.

Could you, perhaps, provide some more background on what the specific hypothesis you are interested in testing? This will help me tailor my response better.

A couple of things to note (I am using hyphy v2.5.63 here).

Results could be sensitive to run options.

Here are some examples (* = p-value < 0.01)

Vanilla default

hyphy relax --alignment $FILE --tree ~/Downloads/Issue59/tree.nwk --test Foreground

Filename K p-value
OG2664.aln.RELAX.json 1.00 1.0000
OG992.aln.RELAX.json 0.81 0.0197
OG8516.aln.RELAX.json 1.16 0.2829
OG10631.aln.RELAX.json 0.02 0.0171
OG13474.aln.RELAX.json 0.00 0.0000*
OG2135.aln.RELAX.json 0.49 0.0002*
OG14176.aln.RELAX.json 0.78 0.2333
OG11507.aln.RELAX.json 1.39 0.0506
OG5425.aln.RELAX.json 1.00 0.9979
OG13412.aln.RELAX.json 1.00 1.0000
OG5796.aln.RELAX.json 0.80 0.5770
OG6141.aln.RELAX.json 1.09 0.6559
OG16140.aln.RELAX.json 0.20 0.0562
OG18330.aln.RELAX.json 1.03 0.6307
OG6625.aln.RELAX.json 0.23 0.0000*
OG12982.aln.RELAX.json 1.26 0.2271
OG10154.aln.RELAX.json 0.00 0.0010*
OG8918.aln.RELAX.json 0.85 0.6352
OG2586.aln.RELAX.json 0.05 0.0000*

Do not run all models (but just the minimal models)

Depending on the dataset, and espcially for small test branch sets, this may improve initial condition selection for RELAX. The reason for it that the fully unconstrained model may be pulling the initial conditions to extreme values of K etc.

hyphy relax --alignment $FILE --tree ~/Downloads/Issue59/tree.nwk --test Foreground --models Minimal

Total 17 of which 2 are relaxed and 15 are intensified

Filename K p-value
OG2664.aln.RELAX.json 0.62 0.0121
OG992.aln.RELAX.json 0.23 0.0009*
OG8516.aln.RELAX.json 0.39 0.0064*
OG10631.aln.RELAX.json 0.07 0.0087*
OG13474.aln.RELAX.json 0.00 0.0000*
OG2135.aln.RELAX.json 0.00 0.0000*
OG14176.aln.RELAX.json 0.76 0.2145
OG11507.aln.RELAX.json 1.75 0.0220
OG5425.aln.RELAX.json 0.00 0.0055*
OG13412.aln.RELAX.json 0.00 0.0121
OG5796.aln.RELAX.json 1.24 0.4079
OG6141.aln.RELAX.json 0.29 0.0002*
OG14096.aln.RELAX.json 0.07 0.0000*
OG16140.aln.RELAX.json 0.30 0.0240
OG18330.aln.RELAX.json 2.45 0.0131
OG6625.aln.RELAX.json 0.23 0.0130
OG12982.aln.RELAX.json 0.05 0.0021*
OG10154.aln.RELAX.json 0.00 0.0010*
OG8918.aln.RELAX.json 0.88 0.6478
OG2586.aln.RELAX.json 0.00 0.0000*

Add synonymous rate variation, simplify distribution for small datasets, increase the number of starting points

Synonymous rate variation is a significant cofounder in rate selection tests. Having 3 dN/dS rates for small alignments is an overkill.

hyphy relax --alignment $FILE --tree ~/Downloads/Issue59/tree.nwk --test Foreground --models Minimal --srv Yes --rates 2 --syn-rates 2 --starting-points 5

Total 10 of which 0 are relaxed and 10 are intensified

Filename K p-value
OG2664.aln.RELAX.json 0.83 0.0907
OG992.aln.RELAX.json 0.68 0.0516
OG8516.aln.RELAX.json 0.64 0.0093*
OG10631.aln.RELAX.json 0.65 0.0344
OG13474.aln.RELAX.json 0.58 0.0001*
OG2135.aln.RELAX.json 0.64 0.0004*
OG14176.aln.RELAX.json 0.68 0.3406
OG11507.aln.RELAX.json 1.25 0.1315
OG5425.aln.RELAX.json 0.64 0.0006*
OG13412.aln.RELAX.json 0.83 0.2786
OG5796.aln.RELAX.json 1.08 0.5635
OG6141.aln.RELAX.json 0.82 0.0798
OG14096.aln.RELAX.json 0.40 0.0001*
OG16140.aln.RELAX.json 0.48 0.0311
OG18330.aln.RELAX.json 1.42 0.3591
OG6625.aln.RELAX.json 0.70 0.0029*
OG12982.aln.RELAX.json 0.60 0.0100
OG10154.aln.RELAX.json 0.79 0.1889
OG8918.aln.RELAX.json 1.10 0.5307
OG2586.aln.RELAX.json 0.67 0.0000*

The moral of the story is that the default settings are actually quite off, because the initial (parameter rich) model gets stuck in bad areas of the parameter space.

Sanity Checks

RELAX outputs "mean" ω estimates for foregound and background branches. Generally (but not always), lower mean ω is "more conserved", higher mean ω is "less conserved".

This is reported in json["fits"]["MG94xREV with separate rates for branch sets"]["Rate Distributions"].

Filename Reference Test
OG2664.aln.RELAX.json 0.058 0.042
OG992.aln.RELAX.json 0.063 0.030
OG8516.aln.RELAX.json 0.007 0.015
OG10631.aln.RELAX.json 0.188 0.077
OG13474.aln.RELAX.json 0.055 0.005
OG2135.aln.RELAX.json 0.115 0.032
OG14176.aln.RELAX.json 0.049 0.033
OG11507.aln.RELAX.json 0.015 0.063
OG5425.aln.RELAX.json 0.107 0.040
OG13412.aln.RELAX.json 0.044 0.010
OG5796.aln.RELAX.json 0.018 0.068
OG6141.aln.RELAX.json 0.044 0.016
OG14096.aln.RELAX.json 0.057 0.008
OG16140.aln.RELAX.json 0.023 0.007
OG18330.aln.RELAX.json 0.006 0.037
OG6625.aln.RELAX.json 0.104 0.053
OG12982.aln.RELAX.json 0.102 0.027
OG10154.aln.RELAX.json 0.099 0.059
OG8918.aln.RELAX.json 0.007 0.022
OG2586.aln.RELAX.json 0.165 0.074

Heterogeneity across branches.

One other sanity check to look for is this: are there "outlier" branches? For example, if you have a single "background" branch with a very high ω (e.g. a short branch with only non-synonymous substitutions), this could pull the entire reference distribution away from the test branch. One way to deal with this is to narrow your reference set, if that's appropriate.

For example, here I am using absrel to profile branch-heterogeneity on one particular example dataset, showing such variability.

hyphy ~/Development/hyphy-analyses/FitMG94/FitMG94.bf --alignment OG2135.aln --tree tree.nwk --type local --lrt Yes
### Testing selected branches for selection

|              Branch               |  Rates   |     Max. dN/dS     | Sites @ EBF>=100 |      Test LRT      |Uncorrected p-value |
|-----------------------------------|----------|--------------------|------------------|--------------------|--------------------|
|               urlow               |     2    |    0.34 ( 2.07%)   |       N/A        |        0.00        |       1.00000      |
|               dmel                |     3    |   12.62 ( 7.47%)   |         28       |        5.21        |       0.02661      |
|               tosep               |     2    |    5.40 ( 0.83%)   |          4       |        2.15        |       0.13101      |
|              ansteph              |     2    |    2.88 ( 0.22%)   |          2       |        0.48        |       0.33415      |
|              Node17               |     1    |   0.01 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|               aekor               |     2    |    0.48 ( 2.39%)   |       N/A        |        0.00        |       1.00000      |
|               albi                |     2    |    0.36 ( 7.62%)   |       N/A        |        0.00        |       1.00000      |
|              Node14               |     2    |    2.92 ( 0.16%)   |          1       |        0.23        |       0.39361      |
|              arsuba               |     1    |   0.00 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|               wysmi               |     1    |   0.00 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              cuquin               |     3    |   >1000 ( 0.72%)   |         14       |       10.32        |       0.00199      |
|              sabcya               |     1    |   0.01 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|               aatro               |     1    |   0.01 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|               Node3               |     2    |   >1000 ( 2.14%)   |          3       |       12.37        |       0.00071      |
|              aaegL5               |     1    |   0.00 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              Node20               |     2    |   14.64 ( 0.34%)   |          3       |        1.89        |       0.15006      |
|               aalb                |     1    |   0.01 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              Node10               |     2    |   16.70 ( 2.02%)   |          4       |        8.85        |       0.00417      |
|               angam               |     1    |   0.00 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              Node15               |     1    |   0.00 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|               Node7               |     1    |   0.04 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|               Node5               |     1    |   0.02 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              Node24               |     1    |   0.04 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              Node22               |     1    |   0.04 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              Node12               |     1    |  >1000 (100.00%)   |       N/A        |        0.02        |       0.47715      |
----
### Adaptive branch site random effects likelihood test 
Likelihood ratio test for episodic diversifying positive selection at Holm-Bonferroni corrected _p =   0.0500_ found **2** branches under selection among **25** tested.

* Node3, p-value =  0.01767
* cuquin, p-value =  0.04776
image

Best,
Sergei

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants