-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathintro.Rmd
455 lines (324 loc) · 16.2 KB
/
intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
# <img src="inst/logo-01.png" height="139px" width="120px"/>
`sccomp` tests differences in cell type proportions from single-cell data. It is robust against outliers, it models continuous and discrete factors, and capable of random-effect/intercept modelling.
## Characteristics
- Complex linear models with continuous and categorical covariates
- Multilevel modelling, with population fixed and random effects/intercept
- Modelling data from counts
- Testing differences in cell-type proportionality
- Testing differences in cell-type specific variability
- Cell-type information share for variability adaptive shrinkage
- Testing differential variability
- Probabilistic outlier identification
- Cross-dataset learning (hyperpriors).
# Installation
`sccomp` is based on `cmdstanr` which provides the latest version of `cmdstan` the Bayesian modelling tool. `cmdstanr` is not on CRAN, so we need to have 3 simple step process (that will be prompted to the user is forgot).
1. R installation of `sccomp`
2. R installation of `cmdstanr`
3. `cmdstanr` call to `cmdstan` installation
**Bioconductor**
```{r eval=FALSE}
if (!requireNamespace("BiocManager")) install.packages("BiocManager")
# Step 1
BiocManager::install("sccomp")
# Step 2
install.packages("cmdstanr", repos = c("https://stan-dev.r-universe.dev/", getOption("repos")))
# Step 3
cmdstanr::check_cmdstan_toolchain(fix = TRUE) # Just checking system setting
cmdstanr::install_cmdstan()
```
**Github**
```{r eval=FALSE}
# Step 1
devtools::install_github("MangiolaLaboratory/sccomp")
# Step 2
install.packages("cmdstanr", repos = c("https://stan-dev.r-universe.dev/", getOption("repos")))
# Step 3
cmdstanr::check_cmdstan_toolchain(fix = TRUE) # Just checking system setting
cmdstanr::install_cmdstan()
```
Function | Description
------------ | -------------
`sccomp_estimate` | Fit the model onto the data, and estimate the coefficients
`sccomp_remove_outliers` | Identify outliers probabilistically based on the model fit, and exclude them from the estimation
`sccomp_test` | Calculate the probability that the coefficients are outside the H0 interval (i.e. test_composition_above_logit_fold_change)
`sccomp_replicate` | Simulate data from the model, or part of the model
`sccomp_predict` | Predicts proportions, based on the model, or part of the model
`sccomp_remove_unwanted_variation` | Removes the variability for unwanted factors
`plot` | Plots summary plots to asses significance
# Analysis
```{r message=FALSE, warning=FALSE}
library(dplyr)
library(sccomp)
library(ggplot2)
library(forcats)
library(tidyr)
data("seurat_obj")
data("sce_obj")
data("counts_obj")
```
`sccomp` can model changes in composition and variability. By default, the formula for variability is either `~1`, which assumes that the
cell-group variability is independent of any covariate or `~ factor_of_interest`, which assumes that the model is dependent on the
factor of interest only. The variability model must be a subset of the model for composition.
## Binary factor
Of the output table, the estimate columns start with the prefix `c_` indicate `composition`, or with `v_` indicate `variability` (when formula_variability is set).
### From Seurat, SingleCellExperiment, metadata objects
```{r eval=FALSE}
sccomp_result =
sce_obj |>
sccomp_estimate(
formula_composition = ~ type,
.sample = sample,
.cell_group = cell_group,
cores = 1
) |>
sccomp_test()
```
### From counts
```{r, message=FALSE, warning=FALSE, eval = instantiate::stan_cmdstan_exists()}
sccomp_result =
counts_obj |>
sccomp_estimate(
formula_composition = ~ type,
.sample = sample,
.cell_group = cell_group,
.count = count,
cores = 1, verbose = FALSE
) |>
sccomp_test()
```
Here you see the results of the fit, the effects of the factor on composition and variability. You also can see the uncertainty around those effects.
The output is a tibble containing the **Following columns**
- `cell_group` - The cell groups being tested.
- `parameter` - The parameter being estimated from the design matrix described by the input `formula_composition` and `formula_variability`.
- `factor` - The covariate factor in the formula, if applicable (e.g., not present for Intercept or contrasts).
- `c_lower` - Lower (2.5%) quantile of the posterior distribution for a composition (c) parameter.
- `c_effect` - Mean of the posterior distribution for a composition (c) parameter.
- `c_upper` - Upper (97.5%) quantile of the posterior distribution for a composition (c) parameter.
- `c_pH0` - Probability of the null hypothesis (no difference) for a composition (c). This is not a p-value.
- `c_FDR` - False-discovery rate of the null hypothesis for a composition (c).
- `v_lower` - Lower (2.5%) quantile of the posterior distribution for a variability (v) parameter.
- `v_effect` - Mean of the posterior distribution for a variability (v) parameter.
- `v_upper` - Upper (97.5%) quantile of the posterior distribution for a variability (v) parameter.
- `v_pH0` - Probability of the null hypothesis for a variability (v).
- `v_FDR` - False-discovery rate of the null hypothesis for a variability (v).
- `count_data` - Nested input count data.
```{r, eval = instantiate::stan_cmdstan_exists()}
sccomp_result
```
## Outlier identification
`sccomp` can identify outliers probabilistically and exclude them from the estimation.
```{r, message=FALSE, warning=FALSE, eval = instantiate::stan_cmdstan_exists()}
sccomp_result =
counts_obj |>
sccomp_estimate(
formula_composition = ~ type,
.sample = sample,
.cell_group = cell_group,
.count = count,
cores = 1, verbose = FALSE
) |>
sccomp_remove_outliers(cores = 1, verbose = FALSE) |> # Optional
sccomp_test()
```
## Summary plots
A plot of group proportions, faceted by groups. The blue boxplots represent the posterior predictive check. If the model is descriptively adequate for the data, the blue boxplots should roughly overlay the black boxplots, which represent the observed data. The outliers are coloured in red. A boxplot will be returned for every (discrete) covariate present in formula_composition. The colour coding represents the significant associations for composition and/or variability.
```{r, eval = instantiate::stan_cmdstan_exists()}
sccomp_result |>
sccomp_boxplot(factor = "type")
```
You can plot proportions adjusted for unwanted effects. This is helpful especially for complex models, where multiple factors can significantly impact the proportions.
```{r, eval = instantiate::stan_cmdstan_exists()}
sccomp_result |>
sccomp_boxplot(factor = "type", remove_unwanted_effects = TRUE)
```
A plot of estimates of differential composition (c_) on the x-axis and differential variability (v_) on the y-axis. The error bars represent 95% credible intervals. The dashed lines represent the minimal effect that the hypothesis test is based on. An effect is labelled as significant if it exceeds the minimal effect according to the 95% credible interval. Facets represent the covariates in the model.
```{r, eval = instantiate::stan_cmdstan_exists()}
sccomp_result |>
plot_1D_intervals()
```
We can plot the relationship between abundance and variability. As we can see below, they are positively correlated. sccomp models this relationship to obtain a shrinkage effect on the estimates of both the abundance and the variability. This shrinkage is adaptive as it is modelled jointly, thanks to Bayesian inference.
```{r, eval = instantiate::stan_cmdstan_exists()}
sccomp_result |>
plot_2D_intervals()
```
You can produce the series of plots calling the `plot` method.
```{r, out.height="200%", eval=FALSE}
sccomp_result |> plot()
```
## Model proportions directly (e.g. from deconvolution)
**Note:** If counts are available, we strongly discourage the use of proportions, as an important source of uncertainty (i.e., for rare groups/cell types) is not modeled.
The use of proportions is better suited for modelling deconvolution results (e.g., of bulk RNA data), in which case counts are not available.
Proportions should be greater than 0. Assuming that zeros derive from a precision threshold (e.g., deconvolution), zeros are converted to the smallest non-zero value.
```{r, message=FALSE, warning=FALSE, eval = instantiate::stan_cmdstan_exists()}
sccomp_result =
counts_obj |>
sccomp_estimate(
formula_composition = ~ type,
.sample = sample,
.cell_group = cell_group,
.count = proportion,
cores = 1, verbose = FALSE
) |>
sccomp_remove_outliers(cores = 1, verbose = FALSE) |> # Optional
sccomp_test()
```
## Continuous factor
`sccomp` is able to fit erbitrary complex models. In this example we have a continuous and binary covariate.
```{r, eval = instantiate::stan_cmdstan_exists()}
res =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ type + continuous_covariate,
.sample = sample, .cell_group = cell_group,
cores = 1, verbose=FALSE
)
res
```
## Random Effect Modeling
`sccomp` supports multilevel modeling by allowing the inclusion of random effects in the compositional and variability formulas. This is particularly useful when your data has hierarchical or grouped structures, such as measurements nested within subjects, batches, or experimental units. By incorporating random effects, sccomp can account for variability at different levels of your data, improving model fit and inference accuracy.
### Random Intercept Model
In this example, we demonstrate how to fit a random intercept model using sccomp. We’ll model the cell-type proportions with both fixed effects (e.g., treatment) and random effects (e.g., subject-specific variability).
Here is the input data
```{r}
seurat_obj[[]] |> as_tibble()
```
```{r, eval = instantiate::stan_cmdstan_exists()}
res =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ type + (1 | group__),
.sample = sample,
.cell_group = cell_group,
bimodal_mean_variability_association = TRUE,
cores = 1, verbose = FALSE
)
res
```
### Random Effect Model (random slopes)
`sccomp` can model random slopes. We providean example below.
```{r, eval = instantiate::stan_cmdstan_exists()}
res =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ type + (type | group__),
.sample = sample,
.cell_group = cell_group,
bimodal_mean_variability_association = TRUE,
cores = 1, verbose = FALSE
)
res
```
### Nested Random Effects
If you have a more complex hierarchy, such as measurements nested within subjects and subjects nested within batches, you can include multiple grouping variables. Here `group2__` is nested within `group__`.
```{r, eval = instantiate::stan_cmdstan_exists()}
res =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ type + (type | group__) + (1 | group2__),
.sample = sample,
.cell_group = cell_group,
bimodal_mean_variability_association = TRUE,
cores = 1, verbose = FALSE
)
res
```
## An aid to result interpretation and communication
The estimated effects are expressed in the unconstrained space of the parameters, similar to differential expression analysis that expresses changes in terms of log fold change. However, for differences in proportion, logit fold change must be used, which is harder to interpret and understand.
Therefore, we provide a more intuitive proportional fold change that can be more easily understood. However, these cannot be used to infer significance (use sccomp_test() instead), and a lot of care must be taken given the nonlinearity of these measures (a 1-fold increase from 0.0001 to 0.0002 carries a different weight than a 1-fold increase from 0.4 to 0.8).
From your estimates, you can specify which effects you are interested in (this can be a subset of the full model if you wish to exclude unwanted effects), and the two points you would like to compare.
In the case of a categorical variable, the starting and ending points are categories.
```{r, eval = instantiate::stan_cmdstan_exists()}
sccomp_result |>
sccomp_proportional_fold_change(
formula_composition = ~ type,
from = "benign",
to = "cancer"
) |>
select(cell_group, statement)
```
## Contrasts
```{r, message=FALSE, warning=FALSE, eval = instantiate::stan_cmdstan_exists()}
seurat_obj |>
sccomp_estimate(
formula_composition = ~ 0 + type,
.sample = sample,
.cell_group = cell_group,
cores = 1, verbose = FALSE
) |>
sccomp_test( contrasts = c("typecancer - typehealthy", "typehealthy - typecancer"))
```
## Categorical factor (e.g. Bayesian ANOVA)
This is achieved through model comparison with `loo`. In the following example, the model with association with factors better fits the data compared to the baseline model with no factor association. For model comparisons `sccomp_remove_outliers()` must not be executed as the leave-one-out must work with the same amount of data, while outlier elimination does not guarantee it.
If `elpd_diff` is away from zero of \> 5 `se_diff` difference of 5, we are confident that a model is better than the other [reference](https://discourse.mc-stan.org/t/interpreting-elpd-diff-loo-package/1628/2?u=stemangiola).
In this case, -79.9 / 11.5 = -6.9, therefore we can conclude that model one, the one with factor association, is better than model two.
```{r, message=FALSE, warning=FALSE, eval = instantiate::stan_cmdstan_exists()}
library(loo)
# Fit first model
model_with_factor_association =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ type,
.sample = sample,
.cell_group = cell_group,
inference_method = "hmc",
enable_loo = TRUE
)
# Fit second model
model_without_association =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ 1,
.sample = sample,
.cell_group = cell_group,
inference_method = "hmc",
enable_loo = TRUE
)
# Compare models
loo_compare(
attr(model_with_factor_association, "fit")$loo(),
attr(model_without_association, "fit")$loo()
)
```
## Differential variability, binary factor
We can model the cell-group variability also dependent on the type, and so test differences in variability
```{r, message=FALSE, warning=FALSE, eval = instantiate::stan_cmdstan_exists()}
res =
seurat_obj |>
sccomp_estimate(
formula_composition = ~ type,
formula_variability = ~ type,
.sample = sample,
.cell_group = cell_group,
cores = 1, verbose = FALSE
)
res
```
**Plot 1D significance plot**
```{r, eval = instantiate::stan_cmdstan_exists()}
plots = res |> sccomp_test() |> plot()
plots$credible_intervals_1D
```
**Plot 2D significance plot** Data points are cell groups. Error bars are the 95% credible interval. The dashed lines represent the default threshold fold change for which the probabilities (c_pH0, v_pH0) are calculated. pH0 of 0 represent the rejection of the null hypothesis that no effect is observed.
This plot is provided only if differential variability has been tested. The differential variability estimates are reliable only if the linear association between mean and variability for `(intercept)` (left-hand side facet) is satisfied. A scatterplot (besides the Intercept) is provided for each category of interest. For each category of interest, the composition and variability effects should be generally uncorrelated.
```{r, eval = instantiate::stan_cmdstan_exists()}
plots$credible_intervals_2D
```
# Suggested settings
## For single-cell RNA sequencing
We recommend setting `bimodal_mean_variability_association = TRUE`. The bimodality of the mean-variability association can be confirmed from the plots\$credible_intervals_2D (see below).
## For CyTOF and microbiome data
We recommend setting `bimodal_mean_variability_association = FALSE` (Default).
## Visualisation of the MCMC chains from the posterior distribution
It is possible to directly evaluate the posterior distribution. In this example, we plot the Monte Carlo chain for the slope parameter of the
first cell type. We can see that it has converged and is negative with probability 1.
```{r, eval = instantiate::stan_cmdstan_exists()}
library(cmdstanr)
library(posterior)
library(bayesplot)
# Assuming res contains the fit object from cmdstanr
fit <- res |> attr("fit")
# Extract draws for 'beta[2,1]'
draws <- as_draws_array(fit$draws("beta[2,1]"))
# Create a traceplot for 'beta[2,1]'
mcmc_trace(draws, pars = "beta[2,1]")
```