new SVA #12

kfuku52 · 2020-07-07T10:05:22Z

No description provided.

kfuku52 · 2020-10-08T20:41:04Z

Please check this paper out. We want to compare the performance of the SVA correction of log-transformed counts (current implementation), and newer methods designed specifically for non-log-transformed RNA-seq raw count data (ComBat-seq, SVA-seq, RUV-seq).
https://academic.oup.com/nargab/article/2/3/lqaa078/5909519

Hego-CCTB · 2020-10-09T07:34:38Z

Any suggestion as to what data set I should use for testing/benchmarking?

kfuku52 · 2020-10-09T07:50:45Z

You would have to generate your own dataset. You can use Monodelphis domestica from my paper but it's more straight forward to use a plant species you study.

Hego-CCTB · 2020-12-04T10:53:27Z

I'm looking into the implementation of ComBat-seq now.

Hego-CCTB · 2020-12-11T13:33:58Z

In order to use CombatSeq I have to provide a "Batch" vector (I can include other parameters, such as tissues as a "group" parameter.).

What would constitute as the batch in SRA metadata? The Bioproject?

kfuku52 · 2020-12-11T13:43:41Z

BioProject tends to be the strongest predictor of SVs, so let's try BioProject alone first.

Hego-CCTB · 2020-12-18T13:50:12Z

The data set I'm using is Zea mays (my largest dataset bot in terms of individual runs and total transcripts). Matrix size is 147x131585.

This is how SVA in its current form performs (final output):
Runtime for final SVA-adjustment: 139.065 seconds

Zea_mays.3.correlation_cutoff.sva.pdf

Here are some PCAs of CombatSeq Results:
Runtime for single adjustment by CombatSeq: 92.961 seconds.

RAW COUNT PCA

COMBAT SEQ ADJUSTED COUNTS

COMBAT SEQ ADJUSTED COUNTS AND SUBSEQUENT LOG-FPKM TRANSFORMATION

This is how it looks if you run CombatSeq together with iterative outlier removal (adjusted, but untransformed):

Zea_mays.4.correlation_cutoff.cbs.pdf

After applying log-FPKM transform on the adjusted counts, it actually looks like the dataset improved. However, SVA is much more powerful in terms of separation.

The big advantage of CombatSeq is, that it preserves the discrete nature of the counts, which some downstream analysis packages need (for example DEseq2).

kfuku52 · 2020-12-18T14:00:19Z

Thanks! It looks like the program worked correctly, which is nice. We should keep this result for a side-by-side comparison in the eventual publication. Please try SVA-seq next. I hope it outperforms SVA as it should be for RNA-seq data.

Hego-CCTB · 2020-12-18T14:10:34Z

Yup, moving on to SVA-seq.

Another note:

I have tried to include other covariates as well (the same ones that SVA is currently adjusting for), but I get the error: "At least one covariate is confounded with batch! Please remove confounded covariates and rerun ComBat-Seq". I could not determine what this actually means and how to fix it.
Also CombatSeq can not handle cases, where there is only one sample (i.e. run) for a given batch (i.e. bioproject).

kfuku52 · 2020-12-18T14:19:24Z

That could mean that two or more covariates happened to be identical combinations or too similar to each other, but let's get SVA-seq done first.

Hego-CCTB · 2020-12-18T15:30:16Z

OK, SVA-seq implementation is really easy, since it uses the exact same inputs as SVA, it's just tailored to raw counts instead of transformed ones. Only difference in code is literally just the function name.

Run time of SVA-seq: 131.88 seconds

Separation looks really nice and PC1 already accounts for almost all of the variance, meaning batch effect is almost completely sorted out.

Caveat of this is, that FPKM/TPM transformation will not be possible afterwards, because SVAseq can produce negative values. I also have read that Svaseq applies log transformation as a first step in the algorithm.

Iterative outlier removal looks a bit weird right now, not entirely sure what the issue is, but the above data suggests SVAseq might be a good alternative.
Zea_mays.4.correlation_cutoff.sva.pdf

kfuku52 · 2020-12-18T15:51:25Z

Hmm... the boxplot looks weird indeed. Did you apply log2 transformation after the SVAseq correction before calculating Pearson's correlation coefficient for the iterative outlier removal? Results may be biased by highly expressed genes if not.

Hego-CCTB · 2020-12-18T16:04:07Z

Yeah, that might be the issue here.

I did not perform log transform, since I assumed SVAseq does that itself (it's the first thing mentioned in the manual). However, it looks like it undoes log transformation before returning the data. The parabolic shape of the PCA is an indicator that as well.

Hego-CCTB · 2021-01-11T10:44:43Z

RUVseq results:
RUVseq offers different approaches for variation elimination. The one used here is based on calculating empirical control genes first (housekeeping genes, basically), which are then used for RUVseq.
Runtime: 87.496 seconds
This PCA is AFTER FPKM+log transformation of the counts normalized by RUVseq:

Separation looks decent, however, total explained variation is a bit low across the two PCs. Also RUVseq is not compatible with Kallisto data, since Kallisto produces non-integer counts, which RUVseq does not seem to like (I just rounded the counts here to avoid having to make a new dataset, so results are technically falsified slightly).

Hego-CCTB · 2021-01-11T10:53:18Z

This is RUVseq estimating the factors of unwanted variation using residuals, also after FPKM+log transform:

We get a bit more variation explained this way, but separation doesn't look as strong.

Hego-CCTB · 2021-01-11T11:36:58Z

Just for completeness, here is the current SVA implementation:
Runtime 151.142 seconds

kfuku52 · 2021-01-11T12:57:54Z

Thanks! Can you make a quantitative comparison between the methods using the boxplots like this?

Hego-CCTB · 2021-01-11T13:10:24Z

on it!

Hego-CCTB · 2021-01-12T09:17:32Z

So, I'm at a loss here. After calculating pearson's correlation, it turn's out that SVAseq produces massive negative correlations outside of same-tissue areas.

Current SVA:

SVAseq:

All that's white here is negative:

RUVseq, no log, no FPKM:

RUVseq + log:

CombatSeq:

kfuku52 · 2021-01-12T09:22:44Z

Are you comparing them in the same unit (i.e., log-FPKM)?

Hego-CCTB · 2021-01-12T09:59:19Z

SVA and CombatSeq have log-FPKM applied, SVAseq and RUVseq are incompatible with FPKM calculations, so they are only logarithmized.
There are two different RUV packages (different authors as RUVseq), which I haven't tried yet, but are supposedly able to handle log-FPKM as input.

kfuku52 · 2021-01-12T10:08:40Z

Could you try to force log-FPKM-ish transformation for SVA-/RUV-seq? We can't compare apples and oranges so the values should be adjusted as much as possible when compared, even though the transformation procedure may not be perfect. I guess a big problem for FPKM would be the presence of negative values after the SVA-/RUV-seq correction. If negative values are seen in a minor fraction of genes (up to several hundred genes in a genome), you can round them to zero to enable an FPKM-like transformation.

Hego-CCTB · 2021-01-12T11:16:25Z

this worked for SVAseq, by setting all negative values to 0:

RUVseq, however, becomes weird. It pretty much equalizes the whole dataset.

kfuku52 · 2021-01-12T11:19:31Z

Thanks! I wounder if the majority of genes are negative in RUV-seq. Anyway, SVA and SVA-seq seem the two best methods so far. Could you compare SVA, SVA-seq, and no-correction side-by-side with the same Y axis?

Hego-CCTB · 2021-01-12T11:36:50Z

RUVseq doesn't produce negative values at all. But it might be, that it doesn't handle low expressed genes very well. I only remove unexpressed genes, but I leave lowly expressed ones in. Comparing the raw counts with the normalized ones, it looks like the data was centrized: genes with counts of 2 get bumped up to 30 and genes with counts of 700 go down to 100. If you then apply log-FPKM, everything gets into the range of one order of magnitude.

I suspect this is a combination of low expressed genes and the upper quartile normalization RUVseq applies. I'll try removing low expressed genes as well.

kfuku52 · 2021-03-03T13:48:33Z

Are you still working on it? If not, please close this issue with a brief summary.

kfuku52 · 2022-09-21T12:11:57Z

Batch correction paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8895431/

kfuku52 · 2022-09-21T12:54:38Z

Here is another batch correction paper: https://www.nature.com/articles/s41587-022-01440-w

kfuku52 · 2023-03-07T08:31:52Z

@Hego-CCTB I will take care of it if you don't have time.

Hego-CCTB · 2023-03-29T09:53:40Z

I'll keep working on this. Give me a week!

Hego-CCTB · 2023-04-14T10:38:18Z

OK. I have implemented svaseq, combatseq and ruvseq. I ran all different batch-effect-removal algorithms (including sva) on 29 different plant species (I'll run the animal set as well this weekend). Short explanation on how these were implemented and the logic behind it.
I'll show a case here, where all algorithms perform well, but this will not be the full picture, of course. All plots you see show log2-FPKM transformed values.

uncorrected values (just log2fpkm transformed)

Prunus_persica.2.correlation_cutoff.pdf

sva

Nothing has changed. This will be the correction everything will be compared against.

Prunus_persica.2.correlation_cutoff.sva.pdf

svaseq

Identical to sva, except it's supposed to take in counts, rather than log-transformed counts. The documentation states that the first thing svaseq does is to apply a log(x +c) transformation, where X is the count matrix and C is a constant (defaults to 1) . So the initial logic was:

apply FPKM
apply temporary log-transformation only for plotting
run iterative outlier removal
apply temprorary log-transformation before every check_within_tissue correlation, since pearson-correlation is made for linear space.
apply log-transformation on the last iteration

These were the results:
Prunus_persica.2.correlation_cutoff.svaseq.pdf

HOWEVER: When I treat svaseq exactly the same way as sva. So log2FPKM BEFORE any calculations start, it performs almost identical to sva, sometimes even slightly outperforms it. I have no explanation as to why this happens.
Prunus_persica.2.correlation_cutoff.svaseq.log_before_sva.pdf

combatseq

combatseq relies on known sources of batch effects. Since from experience the biggest source is almost always bioproject (at least judging from sva output), that's what I tell combatseq to look for.
Additionally, you can supply a covariate which represents the biological signal (curate_group in our case). Combatseq will try to preserve this signal.
Also of note: combatseq can not handle cases, where a batch contains only one sample (i.e. cases where there is only 1 sample for a given bioproject), so I need to remove those bioprojects from the dataset.

The implementation logic is to run combatseq on untransformed raw-counts and apply log2FPKM for check_within_tissue correlation and for every plot. And keep the log2FPKM transformation when the algorithm is done.

no FPKM OR log transformation
apply temporary log2FPKM-transformation only for plotting
run iterative outlier removal
apply temprorary log2FPKM--transformation before every check_within_tissue correlation, since pearson-correlation is made for linear space.
apply log2FPKM-transformation on the last iteration

Prunus_persica.2.correlation_cutoff.combatseq.pdf

RUVseq

RUVseq needs control genes (i.e. genes that don't change expression across conditions/treatments/tissues) to find unwanted variation. There are a lot of different ways to do this:

spiked-in genes. Requires specific experimental design and is thus not useful
Known set of housekeeping genes. We could use BUSCO derived single-copy orthogroup genes, which may be worth checking out.
Taking the genes with the least significant differential expression
the residuals (e.g., deviance residuals) from a first-pass GLM regression of the counts on the covariates of interest

The last one (residuals) is what I went with. The outlier-removal/tissue-correlation logic is the same as in Combatseq:run combatseq on untransformed raw-counts and apply log2FPKM for check_within_tissue correlation and for every plot. And keep the log2FPKM transformation when the algorithm is done.

no FPKM OR log transformation
apply temporary log2FPKM-transformation only for plotting
run iterative outlier removal
apply temprorary log2FPKM--transformation before every check_within_tissue correlation, since pearson-correlation is made for linear space.
apply log2FPKM-transformation on the last iteration

Prunus_persica.2.correlation_cutoff.ruvseq.pdf

Hego-CCTB · 2023-04-14T10:49:34Z

Again, I want to stress that this is an ideal use case for these algorithms. Especially Combatseq performed exceptionally.
However, Combat and RUV often struggle in larger datasets:

Oryza sativa

Oryza_sativa.5.correlation_cutoff.pdf
Oryza_sativa.5.correlation_cutoff.ruvseq.pdf
Oryza_sativa.5.correlation_cutoff.sva.pdf
Oryza_sativa.5.correlation_cutoff.svaseq.log_before_sva.pdf
Oryza_sativa.5.correlation_cutoff.combatseq.pdf

Vitis vinifera

Vitis_vinifera.2.correlation_cutoff.svaseq.pdf
Vitis_vinifera.2.correlation_cutoff.combatseq.pdf
Vitis_vinifera.2.correlation_cutoff.pdf

Vitis_vinifera.2.correlation_cutoff.ruvseq.pdf
Vitis_vinifera.2.correlation_cutoff.sva.pdf

kfuku52 · 2023-04-14T10:52:03Z

Great! CombatSeq seems to be a good choice when single-sample BioProjects can be excluded. We should probably use sva in curate by default and CombatSeq as an option.

Especially Combatseq performed exceptionally.

This would be great to discuss in the amalgkit paper.

Hego-CCTB · 2023-04-14T11:29:01Z

Yeah, sva (or svaseq) should remain default. They perform consistently well independent of dataset. What bugs me is why svaseq behaves so well when I feed it the supposedly wrong input (i.e. log-transformed values instead of counts). I feel like I'm missing or misunderstood something.

I haven't had a chance to look at all 29 species yet, but the pattern really seems to be the size of the dataset (i.e. the number of batch variables) that determines how well Combatseq and RUVseq perform. This could mean that combatseq may be outperforming sva for private/local fastq projects like my carnivore set. I'll have a look what happens when I run combatseq on that.

Combatseq can potentially be improved, as it is possible to add more covariates.
As for RUVseq, I'd like to try out the 'least significant differentially expressed genes' as control genes. Maybe it will improve performance on larger sets. What seems to be happening is over-correction. Maybe the GLM the residuals stem from isn't robust enough to handle many different bioprojects, but that is pure speculation.

kfuku52 · 2023-05-31T14:30:41Z

Could you expose this functionality as a new option in curate?

kfuku52 · 2023-07-07T15:32:24Z

The only difference between sva and svaseq appears to be whether the function automatically converts the input expression data with the logn(x+1) transformation.
https://rdrr.io/bioc/sva/src/R/sva.R
https://rdrr.io/bioc/sva/src/R/svaseq.R

amalgkit curate already supports various log transformations via --norm, so the use of svaseq does not make sense. I will remove --batch_effect_alg svaseq.

kfuku52 · 2023-07-07T21:05:30Z

@Hego-CCTB In #12 (comment), I suspect that the file Oryza_sativa.5.correlation_cutoff.combatseq.pdf does not represent combatseq-corrected data. PRJNA404045 appears in the plot even though it only contains one sample. This sample should have been excluded if combatseq had been correctly applied. This issue may have arisen because the original, uncorrected data was returned when combatseq failed. You can see this at

amalgkit/amalgkit/curate.r

Lines 406 to 410 in 772c190

    
           } else{ 
        
               cat("Combatseq correction failed. Returning the original transcriptome data.\n") 
        
               tc = rbind(tc, tc_ne) 
        
               sva1= '' 
        
           }

RUVseq might encounter a similar issue. As a temporary measure, I will deactivate it so that amalgkit returns an error when batch correction fails.

Hego-CCTB · 2023-10-31T12:01:35Z

I'll investigate this. Something is happening to the matrix, but that could just be the log that's applied before plotting. Also just so we are on the same page: the sample (and BP) removal happens independently and before the batch-effect removal. So if the sample is there, it means something in check_within_curate_group_correlation() went wrong, rather than in batch_effect_subtraction().

EDIT:
PRJNA404045 is present in the sva corrected output as well.

Hego-CCTB · 2024-03-05T10:20:34Z

I would like to be able to discuss, or at least mention that RUV and ComBat are available for amalgkit in the paper, so I'll try to get this issue sorted this week.

kfuku52 · 2024-07-08T13:45:23Z

Have you had a chance to investigate this issue?

Hego-CCTB · 2024-09-24T18:24:06Z

I can definitely say this has nothing to do with sva alternatives, since the project shows up in SVA output as well. Looking at the code for check_within_sample_group_correlation(), where the outlier exlusion happens, I am not convinced that we've implemented a check that removes single-sample bioprojects. And since that sample meets the correlation threshold condition, it doesn't get removed.
Can you remind me what the rationale is to remove such bioprojects? I can push an update to remove single sample projects. In which case, how should we handle a fringe case where that single sample bioproject is also the only sample for a sample group?

I want to get closer to "official" support for the SVA alternatives. To do so, we need to properly evaluate performance. In the paper manuscript we've talked about dPCC as a metric for performance. I'm currently working on implementing the dPCC plot and it will make things easier if I finish that first.

kfuku52 · 2024-09-25T03:57:37Z

This isn't a rationale but rather your code. I believe CombatSeq doesn't support groups with only a single sample, so you may have needed to remove them.

https://github.com/kfuku52/amalgkit/blame/772c1901f14d2fb7e5571005ab7f940210d74eb2/amalgkit/curate.r#L397-L406

Hego-CCTB · 2024-10-02T00:06:49Z

Ah, I was looking in the wrong place! I see what the issue is. The tc table is never properly updated during combatseq. It's an easy fix, it'll be part of the next update.

Hego-CCTB · 2024-11-18T02:09:59Z

I fixed this and another issue with combatseq. Here are the plant dataset results to see overall performance:

SVA

combatseq

ruvseq

While all options positively impact delta PCC on the dataset, SVA still performs best. Combatseq output matrices can be used for DEG analysis though, which is nice to have.

Hego-CCTB · 2024-11-18T02:23:25Z

I should add that combatseq tends to yield worse results in large sets with many different BPs, but is doing well with low complexity sets:

low-complexity uncorrected

low complexity corrected

high complexity uncorrected

high complexity corrected

kfuku52 · 2024-11-18T05:00:37Z

I should add that combatseq tends to yield worse results in large sets with many different BPs, but is doing well with low complexity sets

Thanks! This is a good point to discuss in the AMALGKIT paper.

kfuku52 assigned Hego-CCTB Oct 8, 2020

kfuku52 added a commit that referenced this issue Jul 7, 2023

batch correction bug fixes, svaseq obsoletion, and refactoring #12

1fc6dd5

new SVA #12

new SVA #12

Comments

kfuku52 commented Jul 7, 2020

kfuku52 commented Oct 8, 2020

Hego-CCTB commented Oct 9, 2020

kfuku52 commented Oct 9, 2020

Hego-CCTB commented Dec 4, 2020

Hego-CCTB commented Dec 11, 2020

kfuku52 commented Dec 11, 2020

Hego-CCTB commented Dec 18, 2020

kfuku52 commented Dec 18, 2020

Hego-CCTB commented Dec 18, 2020

kfuku52 commented Dec 18, 2020

Hego-CCTB commented Dec 18, 2020

kfuku52 commented Dec 18, 2020

Hego-CCTB commented Dec 18, 2020

Hego-CCTB commented Jan 11, 2021

Hego-CCTB commented Jan 11, 2021

Hego-CCTB commented Jan 11, 2021

kfuku52 commented Jan 11, 2021

Hego-CCTB commented Jan 11, 2021

Hego-CCTB commented Jan 12, 2021

kfuku52 commented Jan 12, 2021

Hego-CCTB commented Jan 12, 2021

kfuku52 commented Jan 12, 2021

Hego-CCTB commented Jan 12, 2021

kfuku52 commented Jan 12, 2021

Hego-CCTB commented Jan 12, 2021

kfuku52 commented Mar 3, 2021

kfuku52 commented Sep 21, 2022

kfuku52 commented Sep 21, 2022

kfuku52 commented Mar 7, 2023

Hego-CCTB commented Mar 29, 2023

Hego-CCTB commented Apr 14, 2023

uncorrected values (just log2fpkm transformed)

sva

svaseq

combatseq

RUVseq

Hego-CCTB commented Apr 14, 2023

Oryza sativa

Vitis vinifera

kfuku52 commented Apr 14, 2023

Hego-CCTB commented Apr 14, 2023 • edited Loading

kfuku52 commented May 31, 2023

kfuku52 commented Jul 7, 2023

kfuku52 commented Jul 7, 2023

Hego-CCTB commented Oct 31, 2023 • edited Loading

Hego-CCTB commented Mar 5, 2024

kfuku52 commented Jul 8, 2024

Hego-CCTB commented Sep 24, 2024

kfuku52 commented Sep 25, 2024

Hego-CCTB commented Oct 2, 2024

Hego-CCTB commented Nov 18, 2024

SVA

combatseq

ruvseq

Hego-CCTB commented Nov 18, 2024

low-complexity uncorrected

low complexity corrected

high complexity uncorrected

high complexity corrected

kfuku52 commented Nov 18, 2024

Hego-CCTB commented Apr 14, 2023 •

edited

Loading

Hego-CCTB commented Oct 31, 2023 •

edited

Loading