Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory-safe alternative to fasterizeDT? #5

Closed
mikoontz-vp opened this issue Mar 23, 2023 · 7 comments
Closed

memory-safe alternative to fasterizeDT? #5

mikoontz-vp opened this issue Mar 23, 2023 · 7 comments

Comments

@mikoontz-vp
Copy link

Hello!

I'm wondering how feasible it might be to implement rasterizeDT in addition to fasterizeDT? Both fasterize and fasterizeDT don't work on very large template rasters due to memory issues, and I wonder if this would be a good use case for something more like rasterize that can be memory safe, but with a data.table back end?

Thanks so much for your continued development on this package!

@JoshOBrien
Copy link
Owner

Hi @mikoontz-vp ,

I don't think I'll add a rasterizeDT() to this package, mostly because I don't see an easy way to speed up rasterization using data.table.

That said, you might want to test out this function that I put together a few years ago. At the time, it was about 6 times faster than raster::rasterize(). Also, under the hood, it uses a gdal_rasterize executable that's wrapped up by sf::gdal_utils(). The GDAL utilities operate directly on files (not on raster objects stored in memory), so you shouldn't AFAIK run into any size limitations.

It's been a while since I looked into relative speed of the R world's various rasterization functions, so if I were you I might also have a look at stars::st_rasterize() and terrra::rasterize().

And if either of those work faster/better than the function at the gist I pointed you to, please do let me know what you've learned about them!

@mikoontz-vp
Copy link
Author

Okay, fair enough!

My hack (the only thing I've gotten to work so far, and what made me think that a rasterizeDT might be helpful!) is to use exactextractr::exact_extract() to get cell coverage while retaining the cell id (pretty fast), write a raster to disk the same size as the (very large) template with each cell value being the index of each cell, then using a data.table of the "covered" cell numbers resulting from the exact_extract() call as the dict argument in subsDT() with the template raster being the one with raster cell values being the index of each cell.

I had already left issues for {stars}, {sf}, and {terra} authors about how they wouldn't work for very large rasters:
stars::st_rasterize: r-spatial/stars#621 and r-spatial/sf#2130
terra::rasterize(): rspatial/terra#1072. Robert may have fixed this issue but I can't test as I'm on an Apple Silicon mac and installing anything geospatial from source is a nightmare.

I had forgotten about your gist! I'll try that next. Thanks so much!

@JoshOBrien
Copy link
Owner

JoshOBrien commented Mar 24, 2023

Wow, quite a saga!

If you want to test Robert's fix without recompiling the whole package, you could use this:

library(terra)
trace("rasterize", signature=c(x="SpatVector", y="SpatRaster"), edit=TRUE)

Add Robert's one line fix at the appropriate place and then (if using RStudio) press the Save button to return to the command line. After doing that, when calling rasterize(), you'll get a call to the modified function. (You can't use this trick if you are wanting to add or change any of a function's arguments, but in this case that's not an issue.)

@kadyb
Copy link

kadyb commented Mar 24, 2023

@JoshOBrien, I recently did benchmark of functions for rasterizing polygons and the fastest is {fasterize} (uses completely different algorithm than GDAL). {stars} and {terra} are similar, as they both use GDAL. The slowest is {raster} (several times slower than others). Here is code: kadyb/raster-benchmark#15

@mikoontz-vp, maybe controlledburn from @mdsumner will be helpful in some way here (but I'm not sure)?

And little offtopic. @mikoontz-vp, maybe you should divide this 70 GB raster into smaller blocks before and process in loop? There is a function makeTiles() in terra and st_tile() in stars.

@mikoontz-vp
Copy link
Author

All of this is super helpful; thank you both so much (love the edit from a "." to a "!", Josh, in describing the saga! 😭)

I'm going to close this issue now, as the initial ask (a rasterizeDT() implementation based on the gdal algorithm rather than the in-memory fasterize algorithm) is out of scope for {rasterDT}.

@JoshOBrien
Copy link
Owner

Thanks to both of you. Not unexpectedly, the function in the gist I linked to (with a bit of additional cleaning up) is comparable in speed to the stars, terra, and GDAL approaches in @kadyb 's benchmark.

@mdsumner
Copy link

the crux is definitely memory management, you can use terra or stars now to get tiles but I prefer doing the parts of workflows as steps in the process, fwiw hypertidy/grout will give the tiling based on simple inputs, it's just important to find out what the tiling is if there is one, vs what you define and use

fwiw controlledburn is a way to get the index of fasterize, you'd still need an efficient way to write out the pixel values in a per tile way (again both stars and terra afaik can instantiate a raster and write-update to it now). fasterize is insanely fast but it has no way to avoid that huge matrix being in memory, I plan to just put its algorithm into gdal rather than various R packages - but also there are other compelling aspects to this taking my attention 😆 apologies to go on but seemed worth contextualizing a bit 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants