Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add codes to extract SoilGrids soil texture data and derive ensemble … #3406

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

Qianyuxuan
Copy link
Collaborator

Add codes to extract SoilGrids soil texture data and derive ensemble soil parameter files

Description

  1. Add a script "soilgrids_texture_extraction.R" that can extract and save three types of soil texture data in parallel for a single or group of lat/long locations based on user-defined site location from SoilGrids250m data.
  2. Add a script "soil_params_ensemble.R" that can estimate the soil parameters based on soil texture data and write the ensemble parameter paths into settings.
  3. Modify "soil_utils.R" and "extract_soil_nc.R" to fix the bugs when generating soil parameter files.
  4. Modify "write.config.SIPNET.R" to include codes that can read and write soil water holding capacity into parameter settings.

Motivation and Context

Provide ensemble soil parameter files based on SoilGrids soil texture data for SDA.

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • My name is in the list of CITATION.cff
  • I agree that PEcAn Project may distribute my contribution under any or all of
    • the same license as the existing code,
    • and/or the BSD 3-clause license.
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

#working on reading soil file (only working for soilWHC)
if (length(settings$run$inputs$soilinitcond$path) > 0) {
soil_ICs_num <- length(settings$run$inputs$soilinitcond$path)
soil_ICs_path <- settings$run$inputs$soilinitcond$path[[sample(1:soil_ICs_num, 1)]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the logic of this bit is that if an ensemble input is passed in then you choose one at random? Personally, I'd prefer to throw an error at this point, as by this point in the workflow ensemble inputs should have already been converted to specific inputs for each run. If they're still ensemble at this point it's an error and we should stop. Proceeding on with a random choice means (a) we don't catch an error and (b) we don't know/record which ensemble member mapped to which run.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I followed the logic as choosing one from ensemble initial conditions here:https://github.com/PecanProject/pecan/blob/develop/models/sipnet/R/write.configs.SIPNET.R#L546. So you suggested we should just randomly choose and write one ensemble member in the script "soil_params_ensemble.R" and read that specific path in "write.configs.SIPNET.R"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So your soil_params_ensemble should generate a bunch of ensembles and load them ALL into the overall settings object. What I'm saying is that write.configs.SIPNET should only end up seeing ONE of those ensembles because run.write.configs should do the joint sampling of all the different types of ensembles (met, soils, IC, etc), record which ensemble member was assigned which inputs, and then ensure that the settings object reaching write.configs.SIPNET is only for a specific ensemble member. If you're finding that write.configs.SIPNET is being handed all of the soil ensemble members then there's something wrong upstream in the pecan.xml (e.g. not including soils in the section) or in run.write.configs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdietze It seems that "run.write.configs" does not play a role in our current SDA workflow but I did find ways to do the sampling of soil ensemble members through the codes in "ensemble.R" by providing "soilinitcond" tag under "ensemble" in the XML file: https://github.com/PecanProject/pecan/blob/develop/modules/uncertainty/R/ensemble.R#L401-L406. However, it is only for a new fresh run and seems not applicable when it is a restart. The current restart components only include met inputs and parameters as https://github.com/PecanProject/pecan/blob/develop/modules/uncertainty/R/ensemble.R#L422C5-L424. What is your suggestion on that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. write.ensemble.configs is what's called by run.write.configs, so if that's what SDA is calling too we should be fine. But as noted earlier, it's important for the code to confirm that it's only receiving on ensemble member for each input, and to throw an error if it's not rather than resampling.
  2. Restart is grabbing inputs and soils should be part of inputs (as should phenology, initial conditions, etc), not just met. You should verify this is working correctly, as it is is critical that ensemble sampling is preserved (not repeated) both iteratively within a SDA run and across SDA runs (e.g., when running a forecast today that starts from yesterday's forecast). Resampling things (params, inputs, IC, etc) that have already been samples is going to seriously mess up the covariance structures in our products

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdietze Based on the current argument passing of restart in "sda.enkf_MultiSite.R": https://github.com/PecanProject/pecan/blob/develop/modules/assim.sequential/R/sda.enkf_MultiSite.R#L358-L368, it seems we didn't grab soil, phenology and IC other than met now for inputs (note only "met" is specified in "input.ens.gen"). There is also a comment: #TODO: incorporate Phyllis's restart work. #sample all inputs specified in the settings$ensemble not just met". So I guess it is what we should add here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added soil parameter files to inputs for the restart list. Phenology parameters are sampled within parameter sampling and initial conditions don't apply for t>1 situation, so I guess the current solution seems fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phenology parameters are sampled within parameter sampling

Not sure what you mean here, phenology files are separate from the parameter posterior files.

models/sipnet/R/write.configs.SIPNET.R Outdated Show resolved Hide resolved
models/sipnet/R/write.configs.SIPNET.R Show resolved Hide resolved
modules/data.land/R/extract_soil_nc.R Outdated Show resolved Hide resolved
modules/data.land/R/soil_params_ensemble.R Outdated Show resolved Hide resolved
modules/data.land/R/soil_params_ensemble.R Show resolved Hide resolved
modules/data.land/R/soil_params_ensemble.R Outdated Show resolved Hide resolved
modules/data.land/R/soil_utils.R Outdated Show resolved Hide resolved
modules/data.land/R/soil_params_ensemble.R Outdated Show resolved Hide resolved
modules/data.land/R/soil_params_ensemble.R Outdated Show resolved Hide resolved
# Prevent silt from being negative due to numerical precision issue
silt <- c()
for (i in seq_along(sand)) {
if (sand[i] + clay[i] >= 100.) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if sand + clay > 100 shouldn't it be renormalized?

if sand + clay > 100{
  sand <- sand/(sand+clay) * 100
  clay <- clay/(sand+clay) * 100
}

and is this an error in SoilGrids that warrants contacting the creators?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now I see this normalizing is done in the rescale_sum_to_one function; would it make sense to add this silt correction that function?

Copy link
Collaborator Author

@Qianyuxuan Qianyuxuan Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normalizing in rescale_sum_to_one mainly solved the data issue (the sum of means doesn't equal to 1) of SoilGrids. But the codes in soil_utils.R are only for the R numeric rounding issue as the texture percentages used here are from Dirichlet distribution sampling and their sums strictly equal to one. So I guess non-normalization here seems fine.

Actually, now I see this normalizing is done in the rescale_sum_to_one function; would it make sense to add this silt correction that function?

Copy link
Contributor

@DongchenZ DongchenZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the general script is well established followed by a similar style as the previous SOC extraction function. The questions that I raised here are: 1) for the parallel design when over the site locations and data products, and 2) for the usage of downloaded local raster files (if they exist).

future::plan(future::multisession,workers=3)
}

sand_data_url <-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a real situation where we might already have the SoilGrids data downloaded on the server, we might want to replace the VRT URL with the actual physical path to the data. To do this, we might need to set those VRT paths as the function arguments (e.g., list(sand_data_url = "/vsicurl?max_retry=30&retry_delay=60&list_dir=no&url=https://files.isric.org/soilgrids/latest/data/sand/sand_")). The downloaded file path, for example on the SCC, can be /projectnb/dietzelab/dongchen/anchorSites/NA_runs/soil_nc/soilgrids_250m/sand/sand_0-5cm_mean/sand/sand_0-5cm_mean.tif.

}


if (future::supportsMulticore()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not recommended to have two layers of parallel settings (e.g., the one you parallel over the site locations and this one parallel over the products). In practice, because of the limitation of SoilGrids API, the parallel over the site locations can be turned off.

name_tag <- expand.grid(depths, data_tag, stringsAsFactors = F) #find the combinations between data and depth tags.
L <- split(as.data.frame(name_tag), seq(nrow(as.data.frame(name_tag))))#convert tags into lists.

if ("try-error" %in% class(try(soilt_real <- L %>% furrr::future_map(function(l){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the site number goes up, this parallel will be more likely to produce errors due to the limitation of the SoilGrids server. The extraction speed is similar with or without the parallel in my previous 6400 site runs. Thus, it's recommended to just delete the parallel extraction part from the script.

}

# Prepare site info for extraction
internal_site_info <- data.frame(site_info$site_id, site_info$site_name, site_info$lat,site_info$lon)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the site_id is enough to describe each site rather than combining site_id and site_name. Thus, you might want to just remove the site_name from the list to make things simpler.

# Reproject locations to soilgrids projection
# Soilgrids data is using Homolosine projection https://www.isric.org/explore/soilgrids/faq-soilgrids
p <-terra::vect(lonlat, crs = "+proj=longlat +datum=WGS84") # Users need to provide lon/lat
newcrs <- "+proj=igh +datum=WGS84 +no_defs +towgs84=0,0,0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The projection to this specific CRS will be used if you only extract values through the SoilGrids VRT URL. If you are using local files, you should use the CRS from the target raster files (see my comments below).

# A function to extract and save one type of soil texture data
download_and_extraction <- function(base_data_url, site_info, outdir) {

if (future::supportsMulticore()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the recommendations that you'll want to remove the parallel over the site locations, the parallel settings might also be removed from the script, if you don't mind.

##'
##' @param site_info A data frame of site info containing the BETYdb site ID,
##' site name, latitude, and longitude, e.g.
##' (site_id, site_name, lat, lon)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The site_name is not necessary for setting up the site lists (see my below comments).

##' @author Qianyu Li
##' @importFrom magrittr %>%

soilgrids_texture_extraction <- function(site_info, outdir=NULL, verbose=TRUE){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's recommended to have extra arguments data_paths with the default data_paths = list(sand_data_url = URL1, clay_data_url = URL2, silt_data_url = URL3), so the URLs provided here can be either the SoilGrids VRT URLs or physical paths to the local raster files (see my comments below for more details).

#LitterWHC
#param[which(param[, 1] == "litterWHC"), 2] <- unlist(soil_IC_list$vals["volume_fraction_of_water_in_soil_at_saturation"])[1]*100
#working on reading soil file
if (length(settings$run$inputs$soilinitcond$path) > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still no clear how you're handling the case if >1 file is passed in

soil_IC_list <- PEcAn.data.land::pool_ic_netcdf2list(template.soilinit)
# Calculate the thickness of soil layers based on the assumption that the depth values are at bottoms and the first layer top is at 0
if ("depth" %in% names(soil_IC_list$dims)) {
thickness<-c(soil_IC_list$dims$depth[1],diff(soil_IC_list$dims$depth))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this line still work if the product only has one layer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should/could there be an "else" here? A bunch of code below continues to check for "depth" being defined, but doesn't ever use depth again -- instead you're just using this to ensure thickness is defined.

soilWHC_total <- sum(unlist(soil_IC_list$vals["volume_fraction_of_water_in_soil_at_saturation"])*thickness)
param[which(param[, 1] == "soilWHC"), 2] <- soilWHC_total
#LitterWHC
param[which(param[, 1] == "litterWHC"), 2] <- soilWHC_total
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. If there's a litter layer it would have a much much smaller WHC than the entire soil pool.

#litwaterDrainrate
param[which(param[, 1] == "litWaterDrainRate"), 2] <- unlist(soil_IC_list$vals["soil_hydraulic_conductivity_at_saturation"])[1]*100/(3600*24)
param[which(param[, 1] == "litWaterDrainRate"), 2] <- soilHC_total
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I'm not sure why you're using the average hydraulic conductivity of the entire soil column to control the rate at which moisture leaves the litter and enters the first soil layer. Second, we should check that the way SIPNET uses the conductivity (as a proportional decay rate, see https://github.com/PecanProject/sipnet/blob/becec4d2d6d857fa25dc47f974c48621a0b9044b/sipnet.c#L1516) is compatible with the way the parameter is defined in the soils file (Darcy's law) or whether additional assumptions or unit conversions are needed

#LitterWHC
param[which(param[, 1] == "litterWHC"), 2] <- soilWHC_total
}
if ("depth" %in% names(soil_IC_list$dims) && "soil_hydraulic_conductivity_at_saturation" %in% names(soil_IC_list$vals)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per comment below about this being for surface litter, I'm not sure you need "depth" here

#working on reading soil file (only working for soilWHC)
if (length(settings$run$inputs$soilinitcond$path) > 0) {
soil_ICs_num <- length(settings$run$inputs$soilinitcond$path)
soil_ICs_path <- settings$run$inputs$soilinitcond$path[[sample(1:soil_ICs_num, 1)]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phenology parameters are sampled within parameter sampling

Not sure what you mean here, phenology files are separate from the parameter posterior files.

@@ -32,17 +32,18 @@ metSplit <- function(conf.settings, inputs, settings, model, no_split = FALSE, o
# Loading the model package - this is required bc of the furrr
library(paste0("PEcAn.",model), character.only = TRUE)

inputs.split <- list()
# Initialize inputs.split with inputs to keep the sub-list of soil parameters "soilinitcond"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following the logic here. The metSplit code should only ever be touching the met. The comment here about soil parameters is confusing, as metSplit shouldn't be touching ANY of the other inputs, not just soils. I'm having a hard time telling if the code below is incorrect, or if it's fixing a long standing bug, or if the larger issue is that metSplit shouldn't ever have been passed the entirety of inputs to begin with.

@@ -355,17 +355,28 @@ sda.enkf.multisite <- function(settings,
#reformatting params
new.params <- sda_matchparam(settings, ensemble.samples, site.ids, nens)
}
#sample met ensemble members
#sample met and soil parameter ensemble members
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be sampling ALL input ensemble members, not just met, not just (met + soils)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  #TODO: incorporate Phyllis's restart work
 #      sample all inputs specified in the settings$ensemble not just met

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI Phyllis's work lives on the BU server at /projectnb/dietzelab/yixuanli and still needs going through to make sure (a) write.ensemble.configs is saving ALL the info it needs about ensembles, not just params and (b) that you can have the option to pass in a specific already-sampled ensemble of params, met, soil, IC, etc. without having the code resample it. This is important in the general case to allow for sensitivity analysis designs. It's also important for the SDA case as the sample from t==1 needs to be the same sample that's used for t>1.

parent_ids = NULL
)
})
input_name <- c("met","soilinitcond")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not be hard coded. The set of inputs that are ensemble-based needs to be detected from the settings

inputs <- list()
new_inputs <- list()
for (i_input in seq_along(input_name)){
inputs[[input_name[i_input]]]<-PEcAn.settings::papply(conf.settings,function(setting) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain what the papply is doing here?

input_name <- c("met","soilinitcond")
inputs <- list()
new_inputs <- list()
for (i_input in seq_along(input_name)){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inputs can't be sampled in an arbitrary order. See the logic in write.ensemble.configs. Indeed, if the SDA can't call write.ensemble.configs directly, then there's a lot of code there than needs to be refactored into a function that can be called by both forward and SDA code, since maintaining two separate versions of the sampling code is clearly resulting in code divergence.

@@ -225,6 +225,7 @@ template <- PEcAn.settings::Settings(list(
ensemble = structure(list(size = 25, variable = "NPP",
samplingspace = structure(list(
parameters = structure(list(method = "lhc")),
soilinitcond = structure(list(method = "sampling")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check to make sure there aren't other inputs that are being left out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is the operational SDA using a LHC sampling of the posterior parameters? That's not random.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants