Skip to content

Section on reading from blobs with Python #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
May 19, 2025
Merged

Section on reading from blobs with Python #71

merged 31 commits into from
May 19, 2025

Conversation

CPBridge
Copy link
Contributor

@CPBridge CPBridge commented Mar 4, 2025

@fedorov here is a new page on using python tools to read directly from blob objects

This is the first time I've used gitbook, so not sure if there's anything else I need to do to make sure the page is included etc?

@CPBridge CPBridge added the documentation Improvements or additions to documentation label Mar 4, 2025
@CPBridge CPBridge requested a review from fedorov March 4, 2025 21:35
@CPBridge
Copy link
Contributor Author

CPBridge commented Mar 4, 2025

Ok I figured it out, it appears to be showing correctly in the preview

@CPBridge
Copy link
Contributor Author

CPBridge commented Mar 4, 2025

I'm getting a load of errors in the CI pipeline about incorrect URLs, but these seem unrelated to my changes

@fedorov
Copy link
Member

fedorov commented Mar 6, 2025

@CPBridge thank you for the PR! I would like to first fix ImagingDataCommons/idc-index#102, since otherwise it will be difficult to explain where users can get those file paths.

Sorry about the false alarms with the failing checks.

@fedorov
Copy link
Member

fedorov commented Mar 27, 2025

@CPBridge I finally implemented support for getting URLs from GCS, and while testing it, I discovered that it seems that the current sample code at least requires the user to authenticate and provide credentials. Is this right? Do you know if it is possible to somehow access the bucket without signing in - since the content is public. idc-index can download the entire file without login.

You can see the error and the updated part getting file URLs via idc-index in this notebook: https://colab.research.google.com/drive/1_5cc_kb-FgUl9r0jePX_QbSvwBFEluMg?usp=sharing

@CPBridge
Copy link
Contributor Author

@fedorov this can be avoided by using an anonymous client. I have updated the examples in the PR to demonstrate this

@fedorov
Copy link
Member

fedorov commented Apr 7, 2025

@CPBridge FYI, the below takes ~30 sec on Colab (and about 47 sec when I ran it first). Is this expected?

image

@CPBridge
Copy link
Contributor Author

CPBridge commented Apr 18, 2025

@fedorov I have updated the recommendations for S3 in most cases to avoid the slow behavior you observed.

I tracked it down to the way smart_open buffers data. It maintains a buffer of data it pulls down but it re-populates this whenever you do a seek operation, even if the new seek position exists within the current buffer. This is problematic because pydicom does a lot of little seek operations when reading in order to jump around through tags. There is some discussion on this topic on the smart_open repo here. I just made some suggestions, we'll see if they are interested in improving this.

@CPBridge
Copy link
Contributor Author

@fedorov reminder about this

@fedorov
Copy link
Member

fedorov commented May 15, 2025

@CPBridge I updated, please review and let me know if you have any concerns!

* added section on getting file URLs using idc-index
* updated all examples with the details on getting bucket file URLs using idc-index
* replaced direct use of tag numbers with keyword lookup
* updated importance of offset tables to clarify it is about SM

This notebook contains all of the code samples for convenient testing - I plan to
update this notebook and add it to IDC-Tutorials.

https://colab.research.google.com/drive/1_5cc_kb-FgUl9r0jePX_QbSvwBFEluMg?usp=sharing
@fedorov fedorov force-pushed the reading-from-blobs branch from 59166d2 to c5bb50e Compare May 15, 2025 16:29
@CPBridge
Copy link
Contributor Author

CPBridge commented May 15, 2025

Thanks @fedorov. I found a few bugs in your examples (some lines looked like they were out of date or spliced from the wrong example). So I fixed them and made some minor cosmetic changes at the same time.

One small concern I have is while the method used to find the largest image in the pyramid nicely demonstrates the some of the capabilities listed on the page, it is also quite inefficient and not how I would recommend people do this. I would recommend doing this either with bigquery or even idc-index, which if memory serves has an instance-level table for SM images that would contain this information. What do you think? I think at the very least we should point out the availability of better options

@CPBridge
Copy link
Contributor Author

P.s. I also added a small snippet to figure out whether a given image has an EOT/BOT, which I figured out a few weeks ago for something else but it might be useful on this page

@fedorov
Copy link
Member

fedorov commented May 16, 2025

I found a few bugs in your examples

Thank you for fixing them, sorry!

I would recommend doing this either with bigquery or even idc-index, which if memory serves has an instance-level table for SM images that would contain this information. What do you think?

It took me a little bit more than I hoped, but I added a helper function (otherwise it would be ugly to build AWS/GCP-specific URLs) in idc-index 0.8.7, and update the relevant code snippet!

If this looks good, I am going to merge - let me know!

@CPBridge
Copy link
Contributor Author

@fedorov looks good to me!

@fedorov fedorov merged commit f719411 into prod May 19, 2025
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants