Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return paths of downloaded files? #151

Open
CPBridge opened this issue Feb 27, 2025 · 4 comments
Open

Return paths of downloaded files? #151

CPBridge opened this issue Feb 27, 2025 · 4 comments

Comments

@CPBridge
Copy link

CPBridge commented Feb 27, 2025

I have been working with idc-index a bit recently, and there's one thing I find a bit frustrating.

I often have a workflow where I want to download an instance or series and then immediately read it in using pydicom in the following line. However, because the file gets downloaded into a tree-like structure (that I understand is controlled by the dirTemplate), I often have to go globbing or iterating through directories to find the files I just downloaded.

I think it would be a nice feature if the "download" methods returned a list of filepaths (as strs or as pathlibs.Paths, I prefer Paths personally) so that you could instantly load in the file(s).

That would allow me to replace code like this:

from pathlib import Path
import highdicom as hd

from idc_index import IDCClient

segmentation_series_uid = "1.2.276.0.7230010.3.1.3.313263360.31993.1706319455.429793"

# Temporary download directory for downloads and new files
download_dir = Path('downloads')
download_dir.mkdir(exist_ok=True)

# Download a segmentation and load it
client = IDCClient()
client.download_dicom_series(
    segmentation_series_uid,
    downloadDir=download_dir,
    dirTemplate='%SeriesInstanceUID'
)

# This is a bit ugly in my opinion
seg_file = list((download_dir / segmentation_series_uid).glob('*.dcm'))[0]
seg = hd.seg.segread(seg_file)

with this

from pathlib import Path
import highdicom as hd

from idc_index import IDCClient

segmentation_series_uid = "1.2.276.0.7230010.3.1.3.313263360.31993.1706319455.429793"

# Temporary download directory for downloads and new files
download_dir = Path('downloads')
download_dir.mkdir(exist_ok=True)

# Download a segmentation and load it
client = IDCClient()
seg_files = client.download_dicom_series(
    segmentation_series_uid,
    downloadDir=download_dir,
    dirTemplate='%SeriesInstanceUID'
)

seg = hd.seg.segread(seg_files[0])

Alternatively/additionally we could even add methods that download files into a temporary directory and load them into memory immediately, though this would require adding pydicom as a dependency.

What do you think @fedorov ?

@fedorov
Copy link
Member

fedorov commented Feb 27, 2025

@CPBridge this is an interesting idea.

As always, it is the question of whether improving convenience for one use case would not make things worse for other use cases.

What if you download a million of series? What would you expect if the series you downloaded is multi-instance?

@CPBridge
Copy link
Author

When I made this suggestion I was thinking of adding the return values just to IDCClient.download_dicom_studies(), IDCClient.download_dicom_series(), and IDCClient.download_dicom_instance() assuming that they only downloaded a single study/series/instance for each invocation (as I have been using them) and they could therefore return a manageable list of paths (for series or study) or a single path (for instance). But upon closer inspection I see that in fact you can pass lists of study/series/instance UIDs to these methods for "bulk" downloads, so perhaps my suggestions makes less sense.

Still, even in the case where you are downloading a long list of studies/series/instances, you could just ignore the returned file paths if you don't want them and the time/memory taken to return them list would be pretty negligible. I do not think it would make sense to add return values for the methods that are clearly intended for "bulk" download, such as download_from_selection, download_from_manifest.

Just a thought. Another option would be to add a function to get the paths for a given previously-downloaded study/series/instance given the downloadDir and the dirTemplate pattern used by the library.

@fedorov
Copy link
Member

fedorov commented Mar 3, 2025

I think the challenge is that there is no easy way to say whether a certain API is used in bulk mode or not. I guess we could have a parameter that would return that list of the number of series is below some threshold.

Convenience function would be perhaps less confusing, but then the user would need to maintain reference of the dirTemplate used, or that helper could simply traverse to the lowest-level folder and return those as a list.

Note that there is nothing to prevent the user from using dirTemplate that does not have SeriesInstanceUID present, so this is another complexity. What would one expect to be returned then?

I have to say, I feel like it is easier to leave it to the user to traverse the folder hierarchy. Or maybe handle in a new or revised API endpoint somehow.

@CPBridge
Copy link
Author

CPBridge commented Mar 3, 2025

Yes, good points. Perhaps if we do anything, it would be easier to add new methods, as you suggest. At the risk of creating a confusing/bloated API of course

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants