-
Notifications
You must be signed in to change notification settings - Fork 3
Section on reading from blobs with Python #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ok I figured it out, it appears to be showing correctly in the preview |
I'm getting a load of errors in the CI pipeline about incorrect URLs, but these seem unrelated to my changes |
@CPBridge thank you for the PR! I would like to first fix ImagingDataCommons/idc-index#102, since otherwise it will be difficult to explain where users can get those file paths. Sorry about the false alarms with the failing checks. |
@CPBridge I finally implemented support for getting URLs from GCS, and while testing it, I discovered that it seems that the current sample code at least requires the user to authenticate and provide credentials. Is this right? Do you know if it is possible to somehow access the bucket without signing in - since the content is public. idc-index can download the entire file without login. You can see the error and the updated part getting file URLs via idc-index in this notebook: https://colab.research.google.com/drive/1_5cc_kb-FgUl9r0jePX_QbSvwBFEluMg?usp=sharing |
@fedorov this can be avoided by using an anonymous client. I have updated the examples in the PR to demonstrate this |
@CPBridge FYI, the below takes ~30 sec on Colab (and about 47 sec when I ran it first). Is this expected? |
@fedorov I have updated the recommendations for S3 in most cases to avoid the slow behavior you observed. I tracked it down to the way smart_open buffers data. It maintains a buffer of data it pulls down but it re-populates this whenever you do a |
@fedorov reminder about this |
@CPBridge I updated, please review and let me know if you have any concerns! |
* added section on getting file URLs using idc-index * updated all examples with the details on getting bucket file URLs using idc-index * replaced direct use of tag numbers with keyword lookup * updated importance of offset tables to clarify it is about SM This notebook contains all of the code samples for convenient testing - I plan to update this notebook and add it to IDC-Tutorials. https://colab.research.google.com/drive/1_5cc_kb-FgUl9r0jePX_QbSvwBFEluMg?usp=sharing
59166d2
to
c5bb50e
Compare
Thanks @fedorov. I found a few bugs in your examples (some lines looked like they were out of date or spliced from the wrong example). So I fixed them and made some minor cosmetic changes at the same time. One small concern I have is while the method used to find the largest image in the pyramid nicely demonstrates the some of the capabilities listed on the page, it is also quite inefficient and not how I would recommend people do this. I would recommend doing this either with bigquery or even idc-index, which if memory serves has an instance-level table for SM images that would contain this information. What do you think? I think at the very least we should point out the availability of better options |
P.s. I also added a small snippet to figure out whether a given image has an EOT/BOT, which I figured out a few weeks ago for something else but it might be useful on this page |
Thank you for fixing them, sorry!
It took me a little bit more than I hoped, but I added a helper function (otherwise it would be ugly to build AWS/GCP-specific URLs) in idc-index 0.8.7, and update the relevant code snippet! If this looks good, I am going to merge - let me know! |
@fedorov looks good to me! |
@fedorov here is a new page on using python tools to read directly from blob objects
This is the first time I've used gitbook, so not sure if there's anything else I need to do to make sure the page is included etc?