-
Notifications
You must be signed in to change notification settings - Fork 426
Add STACAPI dataset #412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add STACAPI dataset #412
Conversation
instead of what is currently in
|
Generally, we want to minimize direct dependences
I have previously benchmarked different ways of transforming coordinates (https://gist.github.com/calebrob6/334af85deb4c862689020ae4b3344bfc) and pyproj was by far the fastest. The constructor may be sped up if you replace the
I think we have to reproject to a single resolution. I think it makes sense to choose the highest resolution if the user doesn't specify (what stackstac will do by default).
I think this makes sense -- what do you think @adamjstewart?
I think we can't have a plot for this class as it is too generic (e.g. this could be used to grab anything on the Planetary Computer). |
Another thought -- for the planetary computer specifically, we will need to sign STAC Items before we can load data from them. Perhaps there should be a Callable "item_transformer" argument that takes an Item as input and returns an Item, and is an identity transform by default? For the planetary computer case we can make a transform that signs the item. @TomAugspurger, any thoughts in general here? |
For me, the biggest remaining question is how to integrate this with |
What do you mean here? (e.g. I would consider this integrated as it extends RasterDataset)
No, I don't think so. Users could easily use it with the Landsat8 collection on the Planetary Computer, but would need to do more work if they wanted the features that come with the Landsat dataset. |
torchgeo/datasets/stacapidataset.py
Outdated
# from torchgeo.samplers import RandomGeoSampler | ||
|
||
|
||
class STACAPIDataset(RasterDataset, abc.ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class STACAPIDataset(RasterDataset, abc.ABC): | |
class STACAPIDataset(GeoDataset): |
This is neither an ABC (there are no abstract methods), nor does it need to be a subclass of RasterDataset
(it doesn't re-use any of the logic from RasterDataset
). I would consider this to be just another possible base class that subclasses GeoDataset
.
@calebrob6 I guess we're not on the same page then, I've always thought of STAC integration as no different than finding files on your filesystem. My hope was that we could have a single dataset |
I see what you mean now -- I think we are only not on the same page simply because you are 1 chapter ahead of me ;)
I like this! Some thoughts:
|
@calebrob6 I have been trying out your suggestion with different bounds and stacks, but here the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nilsleh, thanks for opening up this PR, can't wait to see torchgeo be able to use STAC datasets directly! I've got a few suggested changes, including one that should hopefully fix the empty array problem you found.
At the moment this STACAPIDataset implementation seems to be specific to just Sentinel-2, and it also assumes the use of Planetary Computer. Maybe we should have a think about how to make this a bit more generic, the easiest way I can think of is to use some try...except or if...then statements to see if the api_endpoint contains planetarycomputer
, but will need to think this through a bit more.
torchgeo/datasets/stacapidataset.py
Outdated
|
||
if not items: | ||
raise RuntimeError( | ||
f"Your search criteria off {query_parameters} did not return any items" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"Your search criteria off {query_parameters} did not return any items" | |
f"No items returned from search criteria: {query_parameters}" |
torchgeo/datasets/stacapidataset.py
Outdated
crs_dict = {"init": "epsg:{}".format(epsg)} | ||
src_crs = CRS.from_dict(crs_dict) | ||
if crs is None: | ||
crs = CRS.from_dict(crs_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be ok to just get construct the CRS from the EPSG code directly.
crs_dict = {"init": "epsg:{}".format(epsg)} | |
src_crs = CRS.from_dict(crs_dict) | |
if crs is None: | |
crs = CRS.from_dict(crs_dict) | |
src_crs = CRS.from_epsg(epsg) | |
if crs is None: | |
crs = CRS.from_epsg(epsg) |
torchgeo/datasets/stacapidataset.py
Outdated
self.index.insert(i, coords, item) | ||
|
||
self._crs = crs | ||
self.res = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably shouldn't hardcode the resolution to 10 here, not all STAC datasets have a 10m spatial resolution.
self.res = 10 | |
self.res = res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry this was a mypy workaround. I think because in GeoDataset
, res
is expecting a float, but STACAPIDataset
takes res: Optional[float]
as argument mypy is failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it helps, but I see other parts of torchgeo using self.res = typing.cast(float, res)
, E.g. at
torchgeo/torchgeo/datasets/geo.py
Line 379 in 9f96cdd
self.res = cast(float, res) |
aoi = stack.loc[ | ||
..., query.maxy : query.miny, query.minx : query.maxx # type: ignore[misc] | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@calebrob6 I have been trying out your suggestion with different bounds and stacks, but here the
.loc[]
indexing always returns an array of shape [num_items, channels, 0, 0] so no height or width, and I can't seem to figure out what is causing that.
This doesn't work because the stack
DataArray has coordinates in a UTM projection, but the query
was using longitude/latitude coordinates. Need to use the same coordinate reference system in both for this to work. See my suggestion at L113 (#412 (comment)) that should fix this.
torchgeo/datasets/stacapidataset.py
Outdated
minx, miny, maxx, maxy = item.bbox | ||
|
||
transformer = Transformer.from_crs(src_crs.to_epsg(), crs.to_epsg()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output from item.bbox
appears to be longitude/latitude coordinates, i.e. EPSG:4326. I don't know if this is the case for all STAC datasets, but assuming that item.bbox
is always EPSG:4326, then the pyproj.Transformer should be doing the coordinate transform like so. Oh, and usually a good idea to use always_xy=True
.
minx, miny, maxx, maxy = item.bbox | |
transformer = Transformer.from_crs(src_crs.to_epsg(), crs.to_epsg()) | |
minx, miny, maxx, maxy = item.bbox | |
transformer = Transformer.from_crs(4326, crs.to_epsg(), always_xy=True) |
torchgeo/datasets/stacapidataset.py
Outdated
minx = -148.46876 | ||
maxx = -148.31072 | ||
miny = 61.0491 | ||
maxy = 61.12567489536982 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Torchgeo's GeoSampler does the bounding box query in the Rtree spatial index's CRS, which would be a UTM projection in this case.
minx = -148.46876 | |
maxx = -148.31072 | |
miny = 61.0491 | |
maxy = 61.12567489536982 | |
minx = 420688.14962388354 | |
maxx = 429392.15007465985 | |
miny = 6769145.954634559 | |
maxy = 6777492.989499866 |
torchgeo/datasets/stacapidataset.py
Outdated
aoi = stackstac.stack( | ||
signed_items, | ||
assets=self.bands, | ||
bounds_latlon=bounds, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bounds are not in EPSG:4326 now but the STAC dataset's native CRS (i.e. some UTM projection).
bounds_latlon=bounds, | |
bounds=bounds, |
Thank you for reviewing and finding the error, I now understand what I was doing wrong. Mainly I thought to make a draft and self contained example of how such a dataset class could work, so that is why it is so specific at the moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment this STACAPIDataset implementation seems to be specific to just Sentinel-2, and it also assumes the use of Planetary Computer. Maybe we should have a think about how to make this a bit more generic, the easiest way I can think of is to use some try...except or if...then statements to see if the api_endpoint contains
planetarycomputer
, but will need to think this through a bit more.Thank you for reviewing and finding the error, I now understand what I was doing wrong. Mainly I thought to make a draft and self contained example of how such a dataset class could work, so that is why it is so specific at the moment.
No worries, we all need to start somewhere and Sentinel-2 is a pretty common dataset so it makes sense to use this as a test case 😄
key = "image" if self.is_image else "mask" | ||
sample = {key: image, "crs": self.crs, "bbox": query} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think key
could be turned into a parameter set by the user, so people can name the dataset directly? I'm thinking of cases e.g. where there's a Sentinel-2 input, a Landsat input, or a DEM input, and people might want to keep them separate when merging (using IntersectionDataset).
Does this need any support with development? I'd like to contribute my experience with STAC if that would help. In particular to generalize the approach for a range of different STACs. Let me know the best way to do that if so (PR to your fork @nilsleh ?). I'm aware potential changes to |
If we're going for |
Yes, in the long run, the plan is to move towards TorchData, although I don't know when that will happen. I'm fine with having a |
@weiji14 @jamesvrt having looked through the code for this current Reason I ask is because I'm going to create a new custom base Dataset class for working with datasets downloaded from Radiant MLHub using our |
We can keep them separate at the start, but I imagine a future in which |
Currently yes, this PR does use
At the very least, we should generalize this PR's implementation to work for any Planetary Computer dataset on https://planetarycomputer.microsoft.com/catalog. Ideally though, it should work for any API endpoint supported by
Torchdata (see https://pytorch.org/blog/pytorch-1.11-released/#introducing-torchdata) would decompose the current implementation into multiple parts as it follows a composition over inheritance philosophy.
Now, I'm not familiar with |
@adamjstewart @weiji14 thanks for the quick responses! It seems there is possibly a need for an even more generic base class STACDataset based on STAC specifications. Yet the way the catalog would reference STAC objects might differ depending on if it's a local directory of Catalog, Collection and Item objects in JSON files, vs. an API; the key distinction I can think of off the top of my head being, the former has access to traversing the entire catalog on disk vs Items that are returned by the API with pagination. I was not intending to use the PySTAC library for the @weiji14 the |
@KennSmithDS didn't realize GitHub supports LaTeX now, you've just made my day! |
If I understand all this correctly, there are two main aspects to consider:
Given these, I think it makes sense to use To get a workable first pass at this we might focus first on a What do you think? |
Yep, totally agree on going straight for |
@weiji14 thanks for sharing the development on weiji14/zen3geo#46, this is great!. This raises a few questions in my mind looking at zen3geo readthedocs:
|
Yes for
We could have a
Yep, and there's probably a few ways to go from many STAC items to an |
@weiji14 just to make sure I understand, is this something that TorchGeo users will need to do, or something that TorchGeo developers will need to do? I'm still hoping to keep a single |
@KennSmithDS @weiji14 Maybe I confused the issue by calling the proposed class |
Been following this discussion and very interested in the solutions being discussed! I think there might be some solutions for iterating over xarray DataArrays that TorchGeo could standardize on, rather than leaving it to a data engineer to decide whether to represent an imagery dataset as either a DataArray or Dataset.
After creating the xarray.DataArray, to then make an iterable and performant xarray object for dataloading, this sounds like a possible job for xbatcher I haven't used xbatcher yet, but on the face of it, it seems like this might be the library to consolidate performance improvements to loading remotely sensed imagery from xarray objects, and maybe there could be a TorchDataPipe built around xbatcher. Performance of data loading for ML is an open issue xarray-contrib/xbatcher#37 |
Before we get too far down the line with xarray discussions, note that TorchGeo does not currently use xarray for anything. We currently rely on rasterio/fiona (aka GDAL) for all file I/O since it can handle reprojection/resampling. Ultimately, all arrays will end up as PyTorch Tensors, so it doesn't really matter what the intermediate step is. But we need something that can support reprojection/resampling, which numpy/torch/xarray cannot. Maybe something like rioxarray is needed. |
Done at weiji14/zen3geo#22, see tutorial at https://zen3geo.readthedocs.io/en/v0.3.0/chipping.html. (Disclaimer: I'm actually an xbatcher dev).
Done at weiji14/zen3geo#6, see tutorial at https://zen3geo.readthedocs.io/en/v0.3.0/walkthrough.html#read-using-rioxarrayreader Anyways, this DataPipe discussion really should be in #576, let's focus on the STAC discussion 🙂 |
@adamjstewart Xarray comes as a requirement of More STAC related... shall I make a start on the |
Apologies, I have been busy finishing my master thesis until a couple of days ago. I was planning to pick this up again but it seems like there is a lot of interest and expertise around this already. When I started this it was merely an exploration of trying an approach. I don't have a preference to how you proceed so feel free to take this over @jamesvrt @weiji14 @KennSmithDS |
Sorry @nilsleh, I didn't mean to sidetrack from your work here. Congratulations on finishing your masters. Happy to move this discussion elsewhere if that makes more sense to the organisers of this project. I made a start at writing a If I'm correct about the above, maybe it makes more sense for the user to retrieve all of their desired STAC Items up front using their own method (probably from pystac_client import Client
from torchgeo.datapipes import STACItemReader
catalog = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
search = catalog.search(
collections=["landsat-c2-l2"],
bbox=bbox_of_interest,
datetime=time_of_interest,
query={"eo:cloud_cover": {"lt": 10}},
)
items = list(search.get_items())
dp = STACItemReader(iterable=items) This approach has the advantage of allowing users to use their own Item discovery methods for static or API-based STACs. The difficult part as I can see it is selecting the required Item Assets and providing an appropriate method for IO. For those who aren't familiar with STAC, a single Item can contain many Assets (e.g. individually stored bands, masks, thumbnails), with no requirement for these assets to be stored in the same location (e.g. S3) or format (e.g. Cloud Optimized GeoTIFF). Thankfully, In my mind, an ideal
|
You're on the right track 😄 This entire thing sounds more like a While it's possible to create a P.S. I've started a page at weiji14/zen3geo#48 summarizing some of the ideas in this thread, feel free anyone to continue the discussion there! P.P.S. Sorry @nilsleh for going off track from your PR, I'll keep quiet now 🙂 |
This PR closes #403 by adding a STACAPI dataset. I will try to explain my thinking for this proposal:
Here are some explanations for choices I made:
When the getitem method receives a bbox query:
Questions / Left to do:
testing with sampler and dataloader