Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

threshold for low-occupancy data #4

Open
haiyangzhang798 opened this issue Jul 31, 2022 · 3 comments
Open

threshold for low-occupancy data #4

haiyangzhang798 opened this issue Jul 31, 2022 · 3 comments

Comments

@haiyangzhang798
Copy link

Many thanks for the interesting work. I wonder how to decide the threshold number for the low-occupancy data in their own dataset? For example, you used 10 (i.e., occupying less than 10 samples) in the endophyte dataset, but will that number apply to one's own microbial data? Any code to run a sensitivity analysis with one's own data to determine the threshold?
Thanks.

@darcyj
Copy link
Owner

darcyj commented Aug 2, 2022

This is a great question. In the paper, that occypancy>10 number was calculated via simulation, i.e. by down-sampling real data. Creating a function to do this for user data would be quite easy, and would be a good addition to the package. I'll prototype such a function and add the code to this issue, and eventually add it to the main package.

@darcyj
Copy link
Owner

darcyj commented Aug 2, 2022

Try this out. See the example for a how-to.

#' occ_power
#'
#' Downsamples a feature vector x to lower occupancies, and calculates Spec
#' on each resulting vector. Useful for power analysis. phy_or_env_spec is
#' run on simulated (downsampled) vectors. It is run with denom_type =
#' "sim_center" to speed up computation. 
#'
#' @author John L. Darcy
#'
#' @param x numeric vector. A feature vector, i.e. a column from the abunds_mat
#'   argument to phy_or_env_spec()
#' @param reps_per_core. integer. Number of simulated vectors to create from x
#'   for each occupancy level (default: 10)
#' @param min_occ integer. Smallest occupancy level to simulate (default: 5)
#' @param max_occ integer. Largest occupancy level to simulate. If NULL, will
#'   be set to length(x) (default: NULL)
#' @param ... Arguments to be passed to phy_or_env_spec(). Must include env or 
#'   hosts and hosts_phylo. Suggest including n_cores and n_sim. Cannot include
#'   denom_type.
#'
#' @return data.frame object with following columns:
#'   \describe{
#'     \item{Pval:}{
#'       P-value from phy_or_env_spec()
#'     }
#'     \item{Spec:}{
#'       Spec from phy_or_env_spec(); NOTE: values >0 are not corrected.
#'     }
#'     \item{occupancy:}{
#'       occupancy of simulated vector
#'     }
#'   }
#'
#' @examples
	# library(specificity)
	# data(endophyte)
	# env <- endophyte$metadata$Elevation
	# # find an empirical feature vector with high occupancy and strong specificity.
	# abunds_mat <- occ_threshold(prop_abund(endophyte$otutable), 100)
	# empirical_specs <- phy_or_env_spec(abunds_mat, env, denom_type="sim_center", n_cores=10)
	# plot(empirical_specs$Spec ~ colSums(abunds_mat > 0))
	# x <- abunds_mat[, which.min(empirical_specs$Spec)]
	# # run this function. speed up by lowering reps_per_occ.
	# sim_specs <- occ_power(x, env=env, n_cores=10, chunksize=250)
	# # plot
	# plot(sim_specs$Pval ~ sim_specs$occupancy)
	# # calculate false negative rate (FNR) for each occupancy
	# fnrs <- aggregate(Pval~occupancy, sim_specs, FUN=function(x){ sum(x >= 0.05) / length(x) })
	# colnames(fnrs) <- c("occupancy", "FNR")
	# # add spline fit
	# fnrs$FNR_spline <- smooth.spline(x=fnrs$occupancy,y=fnrs$FNR)$y
	# # make a plot
	# plot(FNR~occupancy, data=fnrs)
	# points(FNR_spline ~ occupancy, data=fnrs, type="l")
#'
#' @export
occ_power <- function(x, reps_per_occ=10, min_occ=5, max_occ=NULL,  ...){
	calc_occ <- function(x){sum(x > 0)}

	if(is.null(max_occ)){ max_occ <- calc_occ(x) }
	occs <- rep(seq(from=min_occ, to=max_occ), reps_per_occ)

	xpositions <- seq_along(x)
	abunds_mat <- do.call("cbind", lapply(occs, function(occ){
		xi <- x
		while(calc_occ(xi) > occ){
			xi[sample(xpositions, 1)] <- 0
		}
		return(xi)
	}))
	out <- phy_or_env_spec(abunds_mat=abunds_mat, denom_type="sim_center", ...)
	out$occupancy <- colSums(abunds_mat > 0) # recalculate empirical occupancy
	return(out)
}

@haiyangzhang798
Copy link
Author

Tested already and it works!! Super! Many thanks, John.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants