[WIP] Proposed vector data format and application to LandIQ data #3423
+206
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces two new functions to the PEcAn.data.land package:
Motivation and Context
The goal here is to propose a new format for handling geospatial data with large tables.
This is motivated by the CCMMF project, and the use of the LandIQ crop datasets in particular, but should be generalizable to other workflows that use vector geospatial data.
It is also motivated by the desire to decouple workflows from BETYdb and its associated dependence on Posgres+PostGIS that has often been more of a barrier than originally envisioned.
Other options:
Linking CSVs and GPKG
Spatial joins can be slow and we don't want to store geometries in the CSV (they are large as text and that would be redundant).
As proposed the tables are linked by an id generated as a hash by the
digest::digest()
function based on the geometry. This adds a new dependency (though it is already used in the api).An alternative to using the hash as an id would be to would be to store lat+lon of the centroid in both the GPKG and all associated CSVs. Then joins could be on lat+lon.
This would have the advantage of allowing some (many) uses of the CSVs independent of spatial files and libraries.
There is a nonzero chance that distinct geometries could have the same centroid (eg if two cells from raster's with different resolution and perhaps other edge cases.
Types of changes
Checklist: