Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new SVA #12

Open
kfuku52 opened this issue Jul 7, 2020 · 46 comments
Open

new SVA #12

kfuku52 opened this issue Jul 7, 2020 · 46 comments
Assignees

Comments

@kfuku52
Copy link
Owner

kfuku52 commented Jul 7, 2020

No description provided.

@kfuku52
Copy link
Owner Author

kfuku52 commented Oct 8, 2020

Please check this paper out. We want to compare the performance of the SVA correction of log-transformed counts (current implementation), and newer methods designed specifically for non-log-transformed RNA-seq raw count data (ComBat-seq, SVA-seq, RUV-seq).
https://academic.oup.com/nargab/article/2/3/lqaa078/5909519

@Hego-CCTB
Copy link
Collaborator

Any suggestion as to what data set I should use for testing/benchmarking?

@kfuku52
Copy link
Owner Author

kfuku52 commented Oct 9, 2020

You would have to generate your own dataset. You can use Monodelphis domestica from my paper but it's more straight forward to use a plant species you study.

@Hego-CCTB
Copy link
Collaborator

I'm looking into the implementation of ComBat-seq now.

@Hego-CCTB
Copy link
Collaborator

In order to use CombatSeq I have to provide a "Batch" vector (I can include other parameters, such as tissues as a "group" parameter.).

What would constitute as the batch in SRA metadata? The Bioproject?

@kfuku52
Copy link
Owner Author

kfuku52 commented Dec 11, 2020

BioProject tends to be the strongest predictor of SVs, so let's try BioProject alone first.

@Hego-CCTB
Copy link
Collaborator

The data set I'm using is Zea mays (my largest dataset bot in terms of individual runs and total transcripts). Matrix size is 147x131585.

This is how SVA in its current form performs (final output):
Runtime for final SVA-adjustment: 139.065 seconds

Zea_mays.3.correlation_cutoff.sva.pdf

Here are some PCAs of CombatSeq Results:
Runtime for single adjustment by CombatSeq: 92.961 seconds.

RAW COUNT PCA
image
COMBAT SEQ ADJUSTED COUNTS
image
COMBAT SEQ ADJUSTED COUNTS AND SUBSEQUENT LOG-FPKM TRANSFORMATION
image

This is how it looks if you run CombatSeq together with iterative outlier removal (adjusted, but untransformed):

Zea_mays.4.correlation_cutoff.cbs.pdf

After applying log-FPKM transform on the adjusted counts, it actually looks like the dataset improved. However, SVA is much more powerful in terms of separation.

The big advantage of CombatSeq is, that it preserves the discrete nature of the counts, which some downstream analysis packages need (for example DEseq2).

@kfuku52
Copy link
Owner Author

kfuku52 commented Dec 18, 2020

Thanks! It looks like the program worked correctly, which is nice. We should keep this result for a side-by-side comparison in the eventual publication. Please try SVA-seq next. I hope it outperforms SVA as it should be for RNA-seq data.

@Hego-CCTB
Copy link
Collaborator

Yup, moving on to SVA-seq.

Another note:

I have tried to include other covariates as well (the same ones that SVA is currently adjusting for), but I get the error: "At least one covariate is confounded with batch! Please remove confounded covariates and rerun ComBat-Seq". I could not determine what this actually means and how to fix it.
Also CombatSeq can not handle cases, where there is only one sample (i.e. run) for a given batch (i.e. bioproject).

@kfuku52
Copy link
Owner Author

kfuku52 commented Dec 18, 2020

That could mean that two or more covariates happened to be identical combinations or too similar to each other, but let's get SVA-seq done first.

@Hego-CCTB
Copy link
Collaborator

OK, SVA-seq implementation is really easy, since it uses the exact same inputs as SVA, it's just tailored to raw counts instead of transformed ones. Only difference in code is literally just the function name.

Run time of SVA-seq: 131.88 seconds

image

Separation looks really nice and PC1 already accounts for almost all of the variance, meaning batch effect is almost completely sorted out.

Caveat of this is, that FPKM/TPM transformation will not be possible afterwards, because SVAseq can produce negative values. I also have read that Svaseq applies log transformation as a first step in the algorithm.

Iterative outlier removal looks a bit weird right now, not entirely sure what the issue is, but the above data suggests SVAseq might be a good alternative.
Zea_mays.4.correlation_cutoff.sva.pdf

@kfuku52
Copy link
Owner Author

kfuku52 commented Dec 18, 2020

Hmm... the boxplot looks weird indeed. Did you apply log2 transformation after the SVAseq correction before calculating Pearson's correlation coefficient for the iterative outlier removal? Results may be biased by highly expressed genes if not.

@Hego-CCTB
Copy link
Collaborator

Yeah, that might be the issue here.

I did not perform log transform, since I assumed SVAseq does that itself (it's the first thing mentioned in the manual). However, it looks like it undoes log transformation before returning the data. The parabolic shape of the PCA is an indicator that as well.

@Hego-CCTB
Copy link
Collaborator

RUVseq results:
RUVseq offers different approaches for variation elimination. The one used here is based on calculating empirical control genes first (housekeeping genes, basically), which are then used for RUVseq.
Runtime: 87.496 seconds
This PCA is AFTER FPKM+log transformation of the counts normalized by RUVseq:
image

Separation looks decent, however, total explained variation is a bit low across the two PCs. Also RUVseq is not compatible with Kallisto data, since Kallisto produces non-integer counts, which RUVseq does not seem to like (I just rounded the counts here to avoid having to make a new dataset, so results are technically falsified slightly).

@Hego-CCTB
Copy link
Collaborator

This is RUVseq estimating the factors of unwanted variation using residuals, also after FPKM+log transform:
image

We get a bit more variation explained this way, but separation doesn't look as strong.

@Hego-CCTB
Copy link
Collaborator

Just for completeness, here is the current SVA implementation:
Runtime 151.142 seconds
image

@kfuku52
Copy link
Owner Author

kfuku52 commented Jan 11, 2021

Thanks! Can you make a quantitative comparison between the methods using the boxplots like this?
image

@Hego-CCTB
Copy link
Collaborator

on it!

@Hego-CCTB
Copy link
Collaborator

So, I'm at a loss here. After calculating pearson's correlation, it turn's out that SVAseq produces massive negative correlations outside of same-tissue areas.

Current SVA:
SVA
SVA_heat

SVAseq:
SVAseq

All that's white here is negative:
SVAseq_heat
RUVseq, no log, no FPKM:
RUVseq
RUVseq_heat
RUVseq + log:
RUV-log
RUV-log_heat

CombatSeq:
CombatSeq
CombatSeq_heat

@kfuku52
Copy link
Owner Author

kfuku52 commented Jan 12, 2021

Are you comparing them in the same unit (i.e., log-FPKM)?

@Hego-CCTB
Copy link
Collaborator

SVA and CombatSeq have log-FPKM applied, SVAseq and RUVseq are incompatible with FPKM calculations, so they are only logarithmized.
There are two different RUV packages (different authors as RUVseq), which I haven't tried yet, but are supposedly able to handle log-FPKM as input.

@kfuku52
Copy link
Owner Author

kfuku52 commented Jan 12, 2021

Could you try to force log-FPKM-ish transformation for SVA-/RUV-seq? We can't compare apples and oranges so the values should be adjusted as much as possible when compared, even though the transformation procedure may not be perfect. I guess a big problem for FPKM would be the presence of negative values after the SVA-/RUV-seq correction. If negative values are seen in a minor fraction of genes (up to several hundred genes in a genome), you can round them to zero to enable an FPKM-like transformation.

@Hego-CCTB
Copy link
Collaborator

this worked for SVAseq, by setting all negative values to 0:
SVAseq_log_FPKM
SVAseq_Heat_log_fpkm

RUVseq, however, becomes weird. It pretty much equalizes the whole dataset.
RUVseq-log_fpkm
RUVseq_log_fpkm_heat

@kfuku52
Copy link
Owner Author

kfuku52 commented Jan 12, 2021

Thanks! I wounder if the majority of genes are negative in RUV-seq. Anyway, SVA and SVA-seq seem the two best methods so far. Could you compare SVA, SVA-seq, and no-correction side-by-side with the same Y axis?

@Hego-CCTB
Copy link
Collaborator

RUVseq doesn't produce negative values at all. But it might be, that it doesn't handle low expressed genes very well. I only remove unexpressed genes, but I leave lowly expressed ones in. Comparing the raw counts with the normalized ones, it looks like the data was centrized: genes with counts of 2 get bumped up to 30 and genes with counts of 700 go down to 100. If you then apply log-FPKM, everything gets into the range of one order of magnitude.

I suspect this is a combination of low expressed genes and the upper quartile normalization RUVseq applies. I'll try removing low expressed genes as well.

@kfuku52
Copy link
Owner Author

kfuku52 commented Mar 3, 2021

Are you still working on it? If not, please close this issue with a brief summary.

@kfuku52
Copy link
Owner Author

kfuku52 commented Sep 21, 2022

Batch correction paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8895431/

@kfuku52
Copy link
Owner Author

kfuku52 commented Sep 21, 2022

Here is another batch correction paper: https://www.nature.com/articles/s41587-022-01440-w

@kfuku52
Copy link
Owner Author

kfuku52 commented Mar 7, 2023

@Hego-CCTB I will take care of it if you don't have time.

@Hego-CCTB
Copy link
Collaborator

I'll keep working on this. Give me a week!

@Hego-CCTB
Copy link
Collaborator

OK. I have implemented svaseq, combatseq and ruvseq. I ran all different batch-effect-removal algorithms (including sva) on 29 different plant species (I'll run the animal set as well this weekend). Short explanation on how these were implemented and the logic behind it.
I'll show a case here, where all algorithms perform well, but this will not be the full picture, of course. All plots you see show log2-FPKM transformed values.

uncorrected values (just log2fpkm transformed)

Prunus_persica.2.correlation_cutoff.pdf

sva

Nothing has changed. This will be the correction everything will be compared against.

Prunus_persica.2.correlation_cutoff.sva.pdf

svaseq

Identical to sva, except it's supposed to take in counts, rather than log-transformed counts. The documentation states that the first thing svaseq does is to apply a log(x +c) transformation, where X is the count matrix and C is a constant (defaults to 1) . So the initial logic was:

  1. apply FPKM
  2. apply temporary log-transformation only for plotting
  3. run iterative outlier removal
  4. apply temprorary log-transformation before every check_within_tissue correlation, since pearson-correlation is made for linear space.
  5. apply log-transformation on the last iteration

These were the results:
Prunus_persica.2.correlation_cutoff.svaseq.pdf

HOWEVER: When I treat svaseq exactly the same way as sva. So log2FPKM BEFORE any calculations start, it performs almost identical to sva, sometimes even slightly outperforms it. I have no explanation as to why this happens.
Prunus_persica.2.correlation_cutoff.svaseq.log_before_sva.pdf

combatseq

combatseq relies on known sources of batch effects. Since from experience the biggest source is almost always bioproject (at least judging from sva output), that's what I tell combatseq to look for.
Additionally, you can supply a covariate which represents the biological signal (curate_group in our case). Combatseq will try to preserve this signal.
Also of note: combatseq can not handle cases, where a batch contains only one sample (i.e. cases where there is only 1 sample for a given bioproject), so I need to remove those bioprojects from the dataset.

The implementation logic is to run combatseq on untransformed raw-counts and apply log2FPKM for check_within_tissue correlation and for every plot. And keep the log2FPKM transformation when the algorithm is done.

  1. no FPKM OR log transformation
  2. apply temporary log2FPKM-transformation only for plotting
  3. run iterative outlier removal
  4. apply temprorary log2FPKM--transformation before every check_within_tissue correlation, since pearson-correlation is made for linear space.
  5. apply log2FPKM-transformation on the last iteration

Prunus_persica.2.correlation_cutoff.combatseq.pdf

RUVseq

RUVseq needs control genes (i.e. genes that don't change expression across conditions/treatments/tissues) to find unwanted variation. There are a lot of different ways to do this:

  • spiked-in genes. Requires specific experimental design and is thus not useful
  • Known set of housekeeping genes. We could use BUSCO derived single-copy orthogroup genes, which may be worth checking out.
  • Taking the genes with the least significant differential expression
  • the residuals (e.g., deviance residuals) from a first-pass GLM regression of the counts on the covariates of interest

The last one (residuals) is what I went with. The outlier-removal/tissue-correlation logic is the same as in Combatseq:run combatseq on untransformed raw-counts and apply log2FPKM for check_within_tissue correlation and for every plot. And keep the log2FPKM transformation when the algorithm is done.

  1. no FPKM OR log transformation
  2. apply temporary log2FPKM-transformation only for plotting
  3. run iterative outlier removal
  4. apply temprorary log2FPKM--transformation before every check_within_tissue correlation, since pearson-correlation is made for linear space.
  5. apply log2FPKM-transformation on the last iteration

Prunus_persica.2.correlation_cutoff.ruvseq.pdf

@kfuku52
Copy link
Owner Author

kfuku52 commented Apr 14, 2023

Great! CombatSeq seems to be a good choice when single-sample BioProjects can be excluded. We should probably use sva in curate by default and CombatSeq as an option.

Especially Combatseq performed exceptionally.

This would be great to discuss in the amalgkit paper.

@Hego-CCTB
Copy link
Collaborator

Hego-CCTB commented Apr 14, 2023

Yeah, sva (or svaseq) should remain default. They perform consistently well independent of dataset. What bugs me is why svaseq behaves so well when I feed it the supposedly wrong input (i.e. log-transformed values instead of counts). I feel like I'm missing or misunderstood something.

I haven't had a chance to look at all 29 species yet, but the pattern really seems to be the size of the dataset (i.e. the number of batch variables) that determines how well Combatseq and RUVseq perform. This could mean that combatseq may be outperforming sva for private/local fastq projects like my carnivore set. I'll have a look what happens when I run combatseq on that.

Combatseq can potentially be improved, as it is possible to add more covariates.
As for RUVseq, I'd like to try out the 'least significant differentially expressed genes' as control genes. Maybe it will improve performance on larger sets. What seems to be happening is over-correction. Maybe the GLM the residuals stem from isn't robust enough to handle many different bioprojects, but that is pure speculation.

@kfuku52
Copy link
Owner Author

kfuku52 commented May 31, 2023

Could you expose this functionality as a new option in curate?

@kfuku52
Copy link
Owner Author

kfuku52 commented Jul 7, 2023

The only difference between sva and svaseq appears to be whether the function automatically converts the input expression data with the logn(x+1) transformation.
https://rdrr.io/bioc/sva/src/R/sva.R
https://rdrr.io/bioc/sva/src/R/svaseq.R

amalgkit curate already supports various log transformations via --norm, so the use of svaseq does not make sense. I will remove --batch_effect_alg svaseq.

@kfuku52
Copy link
Owner Author

kfuku52 commented Jul 7, 2023

@Hego-CCTB In #12 (comment), I suspect that the file Oryza_sativa.5.correlation_cutoff.combatseq.pdf does not represent combatseq-corrected data. PRJNA404045 appears in the plot even though it only contains one sample. This sample should have been excluded if combatseq had been correctly applied. This issue may have arisen because the original, uncorrected data was returned when combatseq failed. You can see this at

amalgkit/amalgkit/curate.r

Lines 406 to 410 in 772c190

} else{
cat("Combatseq correction failed. Returning the original transcriptome data.\n")
tc = rbind(tc, tc_ne)
sva1= ''
}

RUVseq might encounter a similar issue. As a temporary measure, I will deactivate it so that amalgkit returns an error when batch correction fails.

@Hego-CCTB
Copy link
Collaborator

Hego-CCTB commented Oct 31, 2023

I'll investigate this. Something is happening to the matrix, but that could just be the log that's applied before plotting. Also just so we are on the same page: the sample (and BP) removal happens independently and before the batch-effect removal. So if the sample is there, it means something in check_within_curate_group_correlation() went wrong, rather than in batch_effect_subtraction().

EDIT:
PRJNA404045 is present in the sva corrected output as well.

@Hego-CCTB
Copy link
Collaborator

I would like to be able to discuss, or at least mention that RUV and ComBat are available for amalgkit in the paper, so I'll try to get this issue sorted this week.

@kfuku52
Copy link
Owner Author

kfuku52 commented Jul 8, 2024

Have you had a chance to investigate this issue?

@Hego-CCTB
Copy link
Collaborator

I can definitely say this has nothing to do with sva alternatives, since the project shows up in SVA output as well. Looking at the code for check_within_sample_group_correlation(), where the outlier exlusion happens, I am not convinced that we've implemented a check that removes single-sample bioprojects. And since that sample meets the correlation threshold condition, it doesn't get removed.
Can you remind me what the rationale is to remove such bioprojects? I can push an update to remove single sample projects. In which case, how should we handle a fringe case where that single sample bioproject is also the only sample for a sample group?

I want to get closer to "official" support for the SVA alternatives. To do so, we need to properly evaluate performance. In the paper manuscript we've talked about dPCC as a metric for performance. I'm currently working on implementing the dPCC plot and it will make things easier if I finish that first.

@kfuku52
Copy link
Owner Author

kfuku52 commented Sep 25, 2024

This isn't a rationale but rather your code. I believe CombatSeq doesn't support groups with only a single sample, so you may have needed to remove them.

https://github.com/kfuku52/amalgkit/blame/772c1901f14d2fb7e5571005ab7f940210d74eb2/amalgkit/curate.r#L397-L406

@Hego-CCTB
Copy link
Collaborator

Ah, I was looking in the wrong place! I see what the issue is. The tc table is never properly updated during combatseq. It's an easy fix, it'll be part of the next update.

@Hego-CCTB
Copy link
Collaborator

I fixed this and another issue with combatseq. Here are the plant dataset results to see overall performance:

SVA

image

combatseq

image

ruvseq

image


While all options positively impact delta PCC on the dataset, SVA still performs best. Combatseq output matrices can be used for DEG analysis though, which is nice to have.

@Hego-CCTB
Copy link
Collaborator

I should add that combatseq tends to yield worse results in large sets with many different BPs, but is doing well with low complexity sets:

low-complexity uncorrected

image

low complexity corrected

image

high complexity uncorrected

image

high complexity corrected

image

@kfuku52
Copy link
Owner Author

kfuku52 commented Nov 18, 2024

I should add that combatseq tends to yield worse results in large sets with many different BPs, but is doing well with low complexity sets

Thanks! This is a good point to discuss in the AMALGKIT paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants