-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new SVA #12
Comments
Please check this paper out. We want to compare the performance of the SVA correction of log-transformed counts (current implementation), and newer methods designed specifically for non-log-transformed RNA-seq raw count data (ComBat-seq, SVA-seq, RUV-seq). |
Any suggestion as to what data set I should use for testing/benchmarking? |
You would have to generate your own dataset. You can use Monodelphis domestica from my paper but it's more straight forward to use a plant species you study. |
I'm looking into the implementation of ComBat-seq now. |
In order to use CombatSeq I have to provide a "Batch" vector (I can include other parameters, such as tissues as a "group" parameter.). What would constitute as the batch in SRA metadata? The Bioproject? |
BioProject tends to be the strongest predictor of SVs, so let's try BioProject alone first. |
The data set I'm using is Zea mays (my largest dataset bot in terms of individual runs and total transcripts). Matrix size is 147x131585. This is how SVA in its current form performs (final output): Zea_mays.3.correlation_cutoff.sva.pdf Here are some PCAs of CombatSeq Results: RAW COUNT PCA This is how it looks if you run CombatSeq together with iterative outlier removal (adjusted, but untransformed): Zea_mays.4.correlation_cutoff.cbs.pdf After applying log-FPKM transform on the adjusted counts, it actually looks like the dataset improved. However, SVA is much more powerful in terms of separation. The big advantage of CombatSeq is, that it preserves the discrete nature of the counts, which some downstream analysis packages need (for example DEseq2). |
Thanks! It looks like the program worked correctly, which is nice. We should keep this result for a side-by-side comparison in the eventual publication. Please try SVA-seq next. I hope it outperforms SVA as it should be for RNA-seq data. |
Yup, moving on to SVA-seq. Another note: I have tried to include other covariates as well (the same ones that SVA is currently adjusting for), but I get the error: "At least one covariate is confounded with batch! Please remove confounded covariates and rerun ComBat-Seq". I could not determine what this actually means and how to fix it. |
That could mean that two or more covariates happened to be identical combinations or too similar to each other, but let's get SVA-seq done first. |
OK, SVA-seq implementation is really easy, since it uses the exact same inputs as SVA, it's just tailored to raw counts instead of transformed ones. Only difference in code is literally just the function name. Run time of SVA-seq: 131.88 seconds Separation looks really nice and PC1 already accounts for almost all of the variance, meaning batch effect is almost completely sorted out. Caveat of this is, that FPKM/TPM transformation will not be possible afterwards, because SVAseq can produce negative values. I also have read that Svaseq applies log transformation as a first step in the algorithm. Iterative outlier removal looks a bit weird right now, not entirely sure what the issue is, but the above data suggests SVAseq might be a good alternative. |
Hmm... the boxplot looks weird indeed. Did you apply log2 transformation after the SVAseq correction before calculating Pearson's correlation coefficient for the iterative outlier removal? Results may be biased by highly expressed genes if not. |
Yeah, that might be the issue here. I did not perform log transform, since I assumed SVAseq does that itself (it's the first thing mentioned in the manual). However, it looks like it undoes log transformation before returning the data. The parabolic shape of the PCA is an indicator that as well. |
on it! |
Are you comparing them in the same unit (i.e., log-FPKM)? |
SVA and CombatSeq have log-FPKM applied, SVAseq and RUVseq are incompatible with FPKM calculations, so they are only logarithmized. |
Could you try to force log-FPKM-ish transformation for SVA-/RUV-seq? We can't compare apples and oranges so the values should be adjusted as much as possible when compared, even though the transformation procedure may not be perfect. I guess a big problem for FPKM would be the presence of negative values after the SVA-/RUV-seq correction. If negative values are seen in a minor fraction of genes (up to several hundred genes in a genome), you can round them to zero to enable an FPKM-like transformation. |
Thanks! I wounder if the majority of genes are negative in RUV-seq. Anyway, SVA and SVA-seq seem the two best methods so far. Could you compare SVA, SVA-seq, and no-correction side-by-side with the same Y axis? |
RUVseq doesn't produce negative values at all. But it might be, that it doesn't handle low expressed genes very well. I only remove unexpressed genes, but I leave lowly expressed ones in. Comparing the raw counts with the normalized ones, it looks like the data was centrized: genes with counts of 2 get bumped up to 30 and genes with counts of 700 go down to 100. If you then apply log-FPKM, everything gets into the range of one order of magnitude. I suspect this is a combination of low expressed genes and the upper quartile normalization RUVseq applies. I'll try removing low expressed genes as well. |
Are you still working on it? If not, please close this issue with a brief summary. |
Batch correction paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8895431/ |
Here is another batch correction paper: https://www.nature.com/articles/s41587-022-01440-w |
@Hego-CCTB I will take care of it if you don't have time. |
I'll keep working on this. Give me a week! |
OK. I have implemented svaseq, combatseq and ruvseq. I ran all different batch-effect-removal algorithms (including sva) on 29 different plant species (I'll run the animal set as well this weekend). Short explanation on how these were implemented and the logic behind it. uncorrected values (just log2fpkm transformed)Prunus_persica.2.correlation_cutoff.pdf svaNothing has changed. This will be the correction everything will be compared against. Prunus_persica.2.correlation_cutoff.sva.pdf svaseqIdentical to sva, except it's supposed to take in counts, rather than log-transformed counts. The documentation states that the first thing svaseq does is to apply a log(x +c) transformation, where X is the count matrix and C is a constant (defaults to 1) . So the initial logic was:
These were the results: HOWEVER: When I treat svaseq exactly the same way as sva. So log2FPKM BEFORE any calculations start, it performs almost identical to sva, sometimes even slightly outperforms it. I have no explanation as to why this happens. combatseqcombatseq relies on known sources of batch effects. Since from experience the biggest source is almost always bioproject (at least judging from sva output), that's what I tell combatseq to look for. The implementation logic is to run combatseq on untransformed raw-counts and apply log2FPKM for check_within_tissue correlation and for every plot. And keep the log2FPKM transformation when the algorithm is done.
Prunus_persica.2.correlation_cutoff.combatseq.pdf RUVseqRUVseq needs control genes (i.e. genes that don't change expression across conditions/treatments/tissues) to find unwanted variation. There are a lot of different ways to do this:
The last one (residuals) is what I went with. The outlier-removal/tissue-correlation logic is the same as in Combatseq:run combatseq on untransformed raw-counts and apply log2FPKM for check_within_tissue correlation and for every plot. And keep the log2FPKM transformation when the algorithm is done.
|
Again, I want to stress that this is an ideal use case for these algorithms. Especially Combatseq performed exceptionally. Oryza sativaOryza_sativa.5.correlation_cutoff.pdf Vitis viniferaVitis_vinifera.2.correlation_cutoff.svaseq.pdf Vitis_vinifera.2.correlation_cutoff.ruvseq.pdf |
Great! CombatSeq seems to be a good choice when single-sample BioProjects can be excluded. We should probably use sva in
This would be great to discuss in the amalgkit paper. |
Yeah, sva (or svaseq) should remain default. They perform consistently well independent of dataset. What bugs me is why svaseq behaves so well when I feed it the supposedly wrong input (i.e. log-transformed values instead of counts). I feel like I'm missing or misunderstood something. I haven't had a chance to look at all 29 species yet, but the pattern really seems to be the size of the dataset (i.e. the number of batch variables) that determines how well Combatseq and RUVseq perform. This could mean that combatseq may be outperforming sva for private/local fastq projects like my carnivore set. I'll have a look what happens when I run combatseq on that. Combatseq can potentially be improved, as it is possible to add more covariates. |
Could you expose this functionality as a new option in |
The only difference between
|
@Hego-CCTB In #12 (comment), I suspect that the file Oryza_sativa.5.correlation_cutoff.combatseq.pdf does not represent combatseq-corrected data. PRJNA404045 appears in the plot even though it only contains one sample. This sample should have been excluded if combatseq had been correctly applied. This issue may have arisen because the original, uncorrected data was returned when combatseq failed. You can see this at Lines 406 to 410 in 772c190
RUVseq might encounter a similar issue. As a temporary measure, I will deactivate it so that amalgkit returns an error when batch correction fails. |
I'll investigate this. Something is happening to the matrix, but that could just be the log that's applied before plotting. Also just so we are on the same page: the sample (and BP) removal happens independently and before the batch-effect removal. So if the sample is there, it means something in EDIT: |
I would like to be able to discuss, or at least mention that RUV and ComBat are available for amalgkit in the paper, so I'll try to get this issue sorted this week. |
Have you had a chance to investigate this issue? |
I can definitely say this has nothing to do with sva alternatives, since the project shows up in SVA output as well. Looking at the code for I want to get closer to "official" support for the SVA alternatives. To do so, we need to properly evaluate performance. In the paper manuscript we've talked about dPCC as a metric for performance. I'm currently working on implementing the dPCC plot and it will make things easier if I finish that first. |
This isn't a rationale but rather your code. I believe CombatSeq doesn't support groups with only a single sample, so you may have needed to remove them. |
Ah, I was looking in the wrong place! I see what the issue is. The tc table is never properly updated during combatseq. It's an easy fix, it'll be part of the next update. |
I fixed this and another issue with combatseq. Here are the plant dataset results to see overall performance: SVAcombatseqruvseqWhile all options positively impact delta PCC on the dataset, SVA still performs best. Combatseq output matrices can be used for DEG analysis though, which is nice to have. |
Thanks! This is a good point to discuss in the AMALGKIT paper. |
No description provided.
The text was updated successfully, but these errors were encountered: