gSSURGO: Fix spatial sampling and improve data aggregation accuracy #3643

divine7022 · 2025-10-08T10:51:48Z

Description

Refactored extract_soil_gssurgo() to replace point-based WFS queries with raster-based Web Coverage Service (WCS) approach, enabling accurate area-weighted sampling and eliminating spatial coverage gaps.

Implemented bounding box raster query using soilDB::mukey.wcs() to extract map unit keys with complete spatial coverage at 30m resolution. This provides accurate area weighting through pixel counts and eliminates spatial coverage gaps.
Utilized soilDB::get_SDA_property() with "Weighted Average" aggregation method. This retrieves soil properties (sand, silt, clay, organic matter, bulk density) integrated across specified depth ranges with component weighting in a single batch query.
Integrated soilDB::fetchSDA() to obtain complete rock fragment data (fragvol_r) representing total volume across all size classes: 2-75mm (pebbles), 75-250mm (cobbles), 250-600mm (stones), and >600mm (boulders). Applied proper depth and component weighting for fragments:

Motivation and Context

Fixes #3609

Review Time Estimate

Immediately
Within one week
When possible

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My change requires a change to the documentation.
My name is in the list of CITATION.cff
I agree that PEcAn Project may distribute my contribution under any or all of
- the same license as the existing code,
- and/or the BSD 3-clause license.
I have updated the CHANGELOG.md.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

CHANGELOG.md

dlebauer · 2025-10-17T16:36:50Z

modules/data.land/R/extract_soil_nc.R

+      # Keep other soil data, mark fragments as explicitly missing
+      # complete.cases() will filter these out later
+      PEcAn.logger::logger.info(
+        paste("Fragment data unavailable for depth", top_depth, "-", bottom_depth,


Note that the logger has its own internal pasting approach. See output of PEcAn.logger:::logger.message.

thanks for the info!

dlebauer

This PR implements the features specified in #3609. Please check comments and suggestions, then will be ready to merge, thanks!

modules/data.land/R/extract_soil_nc.R

infotroph · 2025-10-31T08:59:33Z

modules/data.land/R/extract_soil_nc.R

 #' @author Hamze Dokoohaki, Akash
 #' @export
 #'  
 extract_soil_gssurgo <- function(outdir, lat, lon, size=1, grid_size=3, grid_spacing=100, depths=c(0.15,0.30,0.60)){


With the change back to a set distance instead of a grid, I think it would also make sense to revert the grid_size and grid_spacing parameters back to something more like the radius that was here before #3534.

This can also greatly simplify aoi setup -- lines 30 through 42 below could reduce to

aoi <- data.frame(lon = lon, lat = lat) |> terra::vect(crs = "epsg:4326") |> terra::buffer(radius)

Followup thought: I'd also love to have support for passing in a custom area of interest so that I could retrieve e.g. soil properties for one whole field instead of a fixed radius/grid that approximates it. I think the change above would make that pretty easy -- if the user provides a lat/lon then construct AOI as shown here, if they provide a polygon then pass that straight into mukey.wcs.

Does the following (!untested!) design make sense to you?

Suggested change

extract_soil_gssurgo <- function(outdir, lat, lon, size=1, grid_size=3, grid_spacing=100, depths=c(0.15,0.30,0.60)){

extract_soil_gssurgo <- function(outdir, lat, lon, aoi = NULL, size = 1, radius = 500, depths=c(0.15, 0.30, 0.60)) {

if (missing(aoi)) {

aoi <- data.frame(lon = lon, lat = lat) |>

terra::vect(crs = "epsg:4326") |>

terra::buffer(radius)

}

... along with deletion of current lines 30 to 42 below?

i don't really have a strong preference here radius vs grid. Switching to radius makes more sense since we are no longer doing grid sampling (single fetchSDA call)
implemented changes, thanks!

modules/data.land/R/extract_soil_nc.R

infotroph · 2025-10-31T10:41:21Z

modules/data.land/R/extract_soil_nc.R

+    # all size classes: 2-75mm (pebbles), 75-250mm (cobbles), 250-600mm (stones), and >600mm (boulders).
+    # plus component weighting needed for aggregation 
+    sda_data <- tryCatch({
+      soilDB::fetchSDA(


Every call to fetchSDA returns all horizons from all components in the requested map units, so this does not need to be inside the loop over depths.

Also, in my local testing it seems all(c("sandtotal_r", "silttotal_r", "claytotal_r", "om_r", "dbthirdbar_r") %in% names(sda_data)) is TRUE, so it may make sense to skip the get_SDA_property call entirely and work just from this result.

I have implemented both suggestions in the recent commit, now fetchSDA retrieves all horizons once, then filters by depth in the loop - much more efficient for complex query

modules/data.land/R/extract_soil_nc.R

infotroph · 2025-10-31T11:32:04Z

modules/data.land/R/extract_soil_nc.R

+          soc_sd <- stats::sd(soc_mean, na.rm = TRUE)
+          n_depths <- length(soc_mean)
+
+          if (n_depths == 1 || is.na(soc_sd) || soc_sd == 0) {


Why NA when n_depths == 1? I would have expected that 1 layer still means one datapoint to be had

I was visioning back then ,if we have one record it's oboes that the sd become NA.
But good catch!, returning NA for a single observation discards valid data unnecessarily
i have update code to :

if (is.na(soc_sd) || soc_sd == 0) { # No variability - use measured value (all ensemble members identical) simulated_soc <- rep(soc_mean, size) } else { # Has variability - sample from gamma distribution shape <- (soc_mean^2) / (soc_sd^2) rate <- soc_mean / (soc_sd^2) simulated_soc <- stats::rgamma(size, shape = shape, rate = rate) }

single component (n=1) sd = NA all ensemble members will get measured value. Multiple different components (sd > 0) sample from distribution.
make sense now ?

modules/data.land/R/extract_soil_nc.R

infotroph · 2025-11-05T00:01:55Z

modules/data.land/R/extract_soil_nc.R

  }

-  # calling the query function sending the mapunit keys
-  soilprop <- gSSURGO.Query(


Does the approach used here completely replace gSSURGO.Query? If so we might be able to deprecate it now for removal in a future release -- as far as I know it is mostly (only?) used inside extract_soil_nc.

Yes new approach completely replaces gSSURGO.Query and uses fetchSDA, As far as I can tell, gSSURGO.Query is only called within extract_soil_nc.R (now removed), so user impact should be minimal.

infotroph · 2025-11-07T22:03:36Z

modules/data.land/R/extract_soil_nc.R


-    # let's fit dirichlet for each depth level separately
-    simulated.soil.props<-soilprop.new.grouped %>%
-      split(list(soilprop.new.grouped$DepthL, soilprop.new.grouped$mukey)) %>%


@divine7022 can you explain more of the intention here? It looks like removing the split by depth switches this from fitting a texture for each depth layer to fitting one texture across the entire depth profile, which is not what we want.

good question! suspecting this is the point where i was got stuck in with the decision

I reverted back to splitting by both depth AND mukey.
The initial change was an experiment to increase sample size for Dirichlet fitting, and get vertical variability.

Context for why I tried it:
The old code used get_SDA_property which returned one pre-aggregated value per (mukey, depth) - essentially 1 observation per group, with n=1 dirichlet parameter estimation is unstable (requires min ~3-5 samples for reasonable MLE convergence).
The new changes uses single fetchSDA call with component-level aggregation, which preserves multiple components per (mukey, depth).

Thanks for reverting. For future work, changes that fundamentally alter a function's behavior ought to be called out in the PR description, especially if they're experimental -- I didn't notice this until I tried a local test run and wondered why it produced files with all layers identical.

…for custom aoi and switched to single fetchSDA call

This reverts commit 7ca57f6.

dlebauer

This looks good to me. Thanks for all of your work on this!

@infotroph it looks like @divine7022 has addressed your suggestions and answered your questions. This seems to be a substantial improvement. Are there any errors that would prevent this from being merged?

infotroph

A few more comments here. I haven't finished my re-review yet, but I can already say that given the magnitude of the changes it would be very helpful to include some test cases so we can verify the function is behaving as intended.

infotroph · 2025-11-12T09:09:42Z

modules/data.land/R/extract_soil_nc.R

+      terra::project("epsg:5070") %>%
+      terra::buffer(width = radius) %>%
+      terra::project("epsg:4326")


Reprojecting introduces distortion that can be severe, and isn't needed because terra knows how to accurately buffer in meters from a geographic CRS. Better to just pass the latlongs and let it buffer from those.

Suggested change

terra::project("epsg:5070") %>%

terra::buffer(width = radius) %>%

terra::project("epsg:4326")

terra::buffer(width = radius)

(for a visual demo of the distortion issue, compare the circular result of data.frame(lat=70, lon = 0) |> vect(crs = "epsg:4326") |> buffer(1000) |> plot() to the very elongated result of data.frame(lat=70, lon = 0) |> vect(crs = "epsg:4326") |> project("epsg:5070") |> buffer(1000) |> project("epsg:4326") |> plot())

infotroph · 2025-11-14T09:45:29Z

modules/data.land/R/extract_soil_nc.R


-    # let's fit dirichlet for each depth level separately
-    simulated.soil.props<-soilprop.new.grouped %>%
-      split(list(soilprop.new.grouped$DepthL, soilprop.new.grouped$mukey)) %>%


Thanks for reverting. For future work, changes that fundamentally alter a function's behavior ought to be called out in the PR description, especially if they're experimental -- I didn't notice this until I tried a local test run and wondered why it produced files with all layers identical.

infotroph · 2025-11-14T10:20:50Z

modules/data.land/R/extract_soil_nc.R

+        sandtotal_r = stats::weighted.mean(sandtotal_r, hz_thickness, na.rm = TRUE),
+        silttotal_r = stats::weighted.mean(silttotal_r, hz_thickness, na.rm = TRUE),
+        claytotal_r = stats::weighted.mean(claytotal_r, hz_thickness, na.rm = TRUE),


Could treating these as three independent averages cause any problems for Dirichlet fitting below in cases where the horizons we're averaging together have strong discontinuities in texture? I haven't worked through the math yet, but maybe you did already.

(By "discontinuity" I mean cases where, say, a horizon with 30% sand and a horizon with 70% sand get combined into a weighted average of 50%. Since soil mappers very often assign horizon boundaries on the basis of a change in texture, this will be a pretty common occurrence in the database.)

divine7022 added 4 commits October 8, 2025 10:22

replace point-based sampling with raster-based WCS approach

fc03b2d

update .Rd

771d05d

update NEWS.md

7f09a36

update CHANGELOG.md

beceada

github-actions bot added the Modules label Oct 8, 2025

infotroph changed the title ~~Fix spatial sampling and improve data aggregation accuracy~~ gSSURGO: Fix spatial sampling and improve data aggregation accuracy Oct 10, 2025

dlebauer reviewed Oct 17, 2025

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

dlebauer reviewed Oct 17, 2025

View reviewed changes

dlebauer requested changes Oct 31, 2025

View reviewed changes

modules/data.land/R/extract_soil_nc.R Outdated Show resolved Hide resolved

modules/data.land/R/extract_soil_nc.R Outdated Show resolved Hide resolved

modules/data.land/R/extract_soil_nc.R Outdated Show resolved Hide resolved

infotroph requested changes Oct 31, 2025

View reviewed changes

infotroph reviewed Nov 5, 2025

View reviewed changes

infotroph reviewed Nov 7, 2025

View reviewed changes

divine7022 added 6 commits November 8, 2025 16:52

Merge remote-tracking branch 'origin/develop' into fix/gssurgo-improve

8bf5f72

replaced grid_size/grid_spacing parameters with radius,added support …

0103bbe

…for custom aoi and switched to single fetchSDA call

replaced grid_size/grid_spacing param with radius

a49755d

update .Rd

3d2db80

update doceker depends

7ca57f6

add aqp and soilDB to DESCRIPTON

0dc560b

github-actions bot added the Dockerfile label Nov 8, 2025

Revert "update doceker depends"

62fb9c0

This reverts commit 7ca57f6.

dlebauer approved these changes Nov 14, 2025

View reviewed changes

infotroph reviewed Nov 14, 2025

View reviewed changes

-extract_soil_gssurgo <- function(outdir, lat, lon, size=1, grid_size=3, grid_spacing=100, depths=c(0.15,0.30,0.60)){
+extract_soil_gssurgo <- function(outdir, lat, lon, aoi = NULL, size = 1, radius = 500, depths=c(0.15, 0.30, 0.60)) {
+  if (missing(aoi)) {
+    aoi <- data.frame(lon = lon, lat = lat) |>
+      terra::vect(crs = "epsg:4326") |>
+      terra::buffer(radius)
+  }

gSSURGO: Fix spatial sampling and improve data aggregation accuracy #3643

Are you sure you want to change the base?

gSSURGO: Fix spatial sampling and improve data aggregation accuracy #3643

Uh oh!

Conversation

divine7022 commented Oct 8, 2025

Description

Motivation and Context

Review Time Estimate

Types of changes

Checklist:

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlebauer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

infotroph Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlebauer left a comment

Choose a reason for hiding this comment

Uh oh!

infotroph left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dlebauer left a comment •

edited

Loading

infotroph Nov 4, 2025 •

edited

Loading