-
Notifications
You must be signed in to change notification settings - Fork 43
feat(rust/sedona-pointcloud) Initial LAZ format support #471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Looking forward to this! Just a note that the |
|
Sorry for being slow here...I spent today catching up on reviews that had accumulated over the holidays and will take a look tomorrow! |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! I left some comments inline but most of them can be follow-ups.
The main thing this needs before it can be merged is test coverage. I didn't spot any tests but feel free to let me know if I missed them! I would suggest:
- Adding .laz file(s) that cover the range of input options you expose here to apache/sedona-testing, maybe with scripts to generate them if that's possible
- Read the test files and check output with
assert_batches_equal!(). There are some tests in sedona-geoparquet that do this kind of thing except yours can be easier because you can use your Plain geometry output to avoid having to check WKB.
sedona-db/rust/sedona-geoparquet/src/format.rs
Lines 564 to 573 in 68970b0
| fn setup_context() -> SessionContext { | |
| let mut state = SessionStateBuilder::new().build(); | |
| state | |
| .register_file_format(Arc::new(GeoParquetFormatFactory::new()), true) | |
| .unwrap(); | |
| SessionContext::new_with_state(state).enable_url_table() | |
| } | |
| #[tokio::test] | |
| async fn format_from_url_table() { |
- For some lower-level components you could also hard-code some byte ranges (e.g., the header bytes or the bytes for a records). For the builder you could load the builders with some data to check
finish().
2b1084a to
527fecd
Compare
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this (and for handling the Arrow/DataFusion update!)
I left some inline comments about ways to test these files. I am mostly worried that a future contributor or an LLM acting on their behalf will roll through and break something and there won't be a failing test to detect a regression for (e.g.) Int8s with a nodata value.
I think probably a golden file that exercises the full matrix of extra attribute data types / with or without offset / with or without scale / with or without nodata value + assertions that we read the file's content correctly would be sufficient. If it's easy to build that file using the Builder on demand we can do that too.
| .await | ||
| .unwrap(); | ||
|
|
||
| assert_eq!(batch.num_rows(), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This level of granularity is OK if all the lower-level pieces are tested, although in this case they are mostly not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are now tests for any point format and single value extra attributes, each plane, scaled, and with nodata. Does this suffice?
|
@paleolimbot Thank you for the review! Sorry for being slow to address it. I was wondering where the test data/generation should live. Is it in |
No problem! Sorry for being so slow to review the edits! I am hoping to get here later today. |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this!
Mostly nits...I think rust/sedona-pointcloud/testing would work for any example files/scripts to generate the for now to unblock this PR!
I have one more suggestion for testing the final bit of untested code here, which is the file format...I think you can copy any relevant tests from rust/sedona-geoparquet/src except they should be easier since you can use assert_batches_eq!() (I couldn't there because I had geometry column output and your output should just be xyz)`. Among other things, this makes sure that filters, projections, and options work and are all piped through all the steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this submodule need to be reverted or updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No sure how this ended up here so probably it needs to be reverted.
| impl ConfigExtension for LazTableOptions { | ||
| const PREFIX: &'static str = "laz"; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be pointcloud? (e.g., will these apply to las files as well as laz files?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored the config in 43803c3 to reflect the potential scope of future usage.
| fn file_source(&self) -> Arc<dyn FileSource> { | ||
| Arc::new(LazSource::default().with_options(self.options.clone())) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In sedona-geoparquet/src/format.rs you'll see a few tests for the Parquet format that verify a few basic properties of the format in the spirit of "is it plugged in". I think this file would benefit from having those (both to make sure everything here is plugged in and to ensure future contributors to this file have a place to test bug fixes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ported some tests.
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
I will try to follow up with Python bindings for this in the next little bit...I think that DataFusion 52 or 53 has FFIable file formats, or we can expose the listing table wrapper (e.g., sedona-geoparquet/src/provider.rs) as an ffiable TableProvider, although I am not sure the degree to which that works with predicate pushdown + custom functions at the moment.
This provides basic listing table provider functionality for LAZ files, featuring partitioned/parallel read capabilities with low memory footprint.