The goal of microflowseq is to assign probabilities to bacteria that are sorted in mFLOW-Seq experiments.
You can install the development version of microflowseq like so:
devtools::install_github("PennChopMicrobiomeProgram/microflowseq")
library(microflowseq)
Imagine an experiment where we have a specimen containing organisms from two species, Escherichia coli and Streptococcus mitis. The organisms are sorted into two fractions, IgA+ and IgA-. Let’s create a matrix giving the actual number of organisms. Here, the species are represented in the rows of the matrix and the fractions are represented in columns.
dual_cts <- matrix(
c(5, 7, 19, 13), nrow = 2,
dimnames = list(c("e.coli", "s.mitis"), c("IgA+", "IgA-")))
dual_cts
#> IgA+ IgA-
#> e.coli 5 19
#> s.mitis 7 13
For each species, we are interested in the probability of finding an
organism in the IgA+ or IgA- fraction. This is straightforward to
calculate if we know the absolute number of organisms: we simply
calculate a probability across each row.1 The microflowseq
package
provides a function, normalize_sum()
to perform the common task of
converting to probability or relative abundance.
t(apply(dual_cts, 1, normalize_sum))
#> IgA+ IgA-
#> e.coli 0.2083333 0.7916667
#> s.mitis 0.3500000 0.6500000
However, we don’t measure the absolute number of organisms directly in a real experiment, so we can’t use the method above to calculate the probabilities. In an mFLOW-Seq experiment, we measure two things: the abundance of each fraction, and, within each fraction, the relative abundance of the bacteria.
First, we compute the abundance of each fraction by summing down the columns. To show that our approach does not depend on the scale of the fraction abundance, we multiply the absolute fraction abundances by a large number.
dual_fracs <- 35 * colSums(dual_cts)
dual_fracs
#> IgA+ IgA-
#> 420 1120
Next, we compute the relative abundance of each species within each fraction.
dual_props <- apply(dual_cts, 2, normalize_sum)
dual_props
#> IgA+ IgA-
#> e.coli 0.4166667 0.59375
#> s.mitis 0.5833333 0.40625
We now have our example data set: a vector of fraction abundances
(dual_fracs
) and a matrix of bacteria relative abundances within each
fraction (dual_props
). As before, each row corresponds to a bacterial
species and each column corresponds to a fraction.
We can use mflow_apply()
to solve for the probabilities. The answer
here matches what we computed when we knew the exact number of organisms
in each fraction. Success!
mflow_apply(dual_props, dual_fracs)
#> IgA+ IgA-
#> e.coli 0.2083333 0.7916667
#> s.mitis 0.3500000 0.6500000
Let’s imagine that the bacteria were sorted into four fractions, according to whether they were tagged by IgA, IgG, both, or neither.
multi_cts <- matrix(
c(5, 7, 19, 13, 2, 11, 23, 29), nrow=2,
dimnames = list(
c("e.coli", "s.mitis"),
c("IgA+IgG+", "IgA+IgG-", "IgA-IgG+", "IgA-IgG-")))
multi_cts
#> IgA+IgG+ IgA+IgG- IgA-IgG+ IgA-IgG-
#> e.coli 5 19 2 23
#> s.mitis 7 13 11 29
As before, we want to work one species at a time and calculate how organisms of the species are distributed among the fractions. Both organisms are found in the double negative fraction about half the time. E. coli is almost never found in the IgA-IgG+ fraction, whereas that fraction accounts for about 18% of the S. mitis organisms.
t(apply(multi_cts, 1, normalize_sum))
#> IgA+IgG+ IgA+IgG- IgA-IgG+ IgA-IgG-
#> e.coli 0.1020408 0.3877551 0.04081633 0.4693878
#> s.mitis 0.1166667 0.2166667 0.18333333 0.4833333
Here is what we measure in the experiment. As before, I multiply the fraction abundances by some number to show that the absolute scale doesn’t matter.
multi_fracs <- 0.2 * colSums(multi_cts)
multi_fracs
#> IgA+IgG+ IgA+IgG- IgA-IgG+ IgA-IgG-
#> 2.4 6.4 2.6 10.4
multi_props <- apply(multi_cts, 2, normalize_sum)
multi_props
#> IgA+IgG+ IgA+IgG- IgA-IgG+ IgA-IgG-
#> e.coli 0.4166667 0.59375 0.1538462 0.4423077
#> s.mitis 0.5833333 0.40625 0.8461538 0.5576923
With the data in hand, mflow_apply()
works just the same as it did for
the previous example. Here’s the answer we wanted.
mflow_apply(multi_props, multi_fracs)
#> IgA+IgG+ IgA+IgG- IgA-IgG+ IgA-IgG-
#> e.coli 0.1020408 0.3877551 0.04081633 0.4693878
#> s.mitis 0.1166667 0.2166667 0.18333333 0.4833333
With the advent of the tidyverse, it’s no longer common to work in matrix format in the R programming language. This is probably a good thing overall; it encourages people to work in a long format, where each data point is represented in one row of a table.
Let’s go back to our first example and convert the data to long format,
so we can show how to use the microflowseq
library in that context.
First, we’ll load the collection of packages in the tidyverse.
library(tidyverse)
It’s actually a bit of a pain to convert our data from the matrix format. We won’t explain the process line-by-line, but we’ll show the answer so you can check that it matches the example above.
Here are the fraction abundances. We add a column to indicate the specimen from which the bacteria were sorted. In a typical sequencing experiment, you’d have many specimens.
dual_fracs_df <- dual_fracs |>
enframe("fraction", "fraction_abundance") |>
mutate(specimen_id = "Specimen1") |>
select(specimen_id, everything())
dual_fracs_df
#> # A tibble: 2 × 3
#> specimen_id fraction fraction_abundance
#> <chr> <chr> <dbl>
#> 1 Specimen1 IgA+ 420
#> 2 Specimen1 IgA- 1120
Here are the bacteria relative abundances. Again, we add a specimen ID because this is important to the demonstration of data in long format. Each fraction derived from a specimen would appear as a different sample in the sequencing data. To keep track of which samples were derived from each specimen, you would use a separate column for the specimen ID, just as we’ve done here.
Also, we’re going to name the species column “asv”, for amplicon sequence variant. In a real sequencing experiment, we often measure sequence variants rather than the actual species per se. Hopefully, this makes our data table look more similar to what you’d see in a real study.
dual_props_df <- dual_props |>
as.data.frame() |>
rownames_to_column("asv") |>
pivot_longer(-asv, names_to = "fraction", values_to = "asv_abundance") |>
mutate(specimen_id = "Specimen1") |>
select(specimen_id, fraction, everything())
dual_props_df
#> # A tibble: 4 × 4
#> specimen_id fraction asv asv_abundance
#> <chr> <chr> <chr> <dbl>
#> 1 Specimen1 IgA+ e.coli 0.417
#> 2 Specimen1 IgA- e.coli 0.594
#> 3 Specimen1 IgA+ s.mitis 0.583
#> 4 Specimen1 IgA- s.mitis 0.406
Using data in long format, we compute the probabilities in three steps:
- Join the fraction abundance table to the bacteria relative abundance table.
- Group by specimen ID and species (ASV).
- Use
mflow_probability()
to compute the probabilities.
The mflow_probability()
function works on one species/ASV at a time.
Under the hood, this function is used by mflow_apply()
when we work in
matrix format. Here, we no longer have to think about rows and columns,
and our result is ready to feed into ggplot2
for visualization.
dual_props_df |>
left_join(dual_fracs_df, by = c("specimen_id", "fraction")) |>
group_by(specimen_id, asv) |>
mutate(fraction_prob = mflow_probability(asv_abundance, fraction_abundance))
#> # A tibble: 4 × 6
#> # Groups: specimen_id, asv [2]
#> specimen_id fraction asv asv_abundance fraction_abundance fraction_prob
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Specimen1 IgA+ e.coli 0.417 420 0.208
#> 2 Specimen1 IgA- e.coli 0.594 1120 0.792
#> 3 Specimen1 IgA+ s.mitis 0.583 420 0.35
#> 4 Specimen1 IgA- s.mitis 0.406 1120 0.65
Footnotes
-
Unfortunately, I have to transpose the result after I apply a function across the rows of a matrix. It makes the code messier, but it’s necessary because R will flip the rows and columns otherwise. ↩