Return paths of downloaded files? #151

CPBridge · 2025-02-27T13:26:53Z

I have been working with idc-index a bit recently, and there's one thing I find a bit frustrating.

I often have a workflow where I want to download an instance or series and then immediately read it in using pydicom in the following line. However, because the file gets downloaded into a tree-like structure (that I understand is controlled by the dirTemplate), I often have to go globbing or iterating through directories to find the files I just downloaded.

I think it would be a nice feature if the "download" methods returned a list of filepaths (as strs or as pathlibs.Paths, I prefer Paths personally) so that you could instantly load in the file(s).

That would allow me to replace code like this:

from pathlib import Path
import highdicom as hd

from idc_index import IDCClient

segmentation_series_uid = "1.2.276.0.7230010.3.1.3.313263360.31993.1706319455.429793"

# Temporary download directory for downloads and new files
download_dir = Path('downloads')
download_dir.mkdir(exist_ok=True)

# Download a segmentation and load it
client = IDCClient()
client.download_dicom_series(
    segmentation_series_uid,
    downloadDir=download_dir,
    dirTemplate='%SeriesInstanceUID'
)

# This is a bit ugly in my opinion
seg_file = list((download_dir / segmentation_series_uid).glob('*.dcm'))[0]
seg = hd.seg.segread(seg_file)

with this

from pathlib import Path
import highdicom as hd

from idc_index import IDCClient

segmentation_series_uid = "1.2.276.0.7230010.3.1.3.313263360.31993.1706319455.429793"

# Temporary download directory for downloads and new files
download_dir = Path('downloads')
download_dir.mkdir(exist_ok=True)

# Download a segmentation and load it
client = IDCClient()
seg_files = client.download_dicom_series(
    segmentation_series_uid,
    downloadDir=download_dir,
    dirTemplate='%SeriesInstanceUID'
)

seg = hd.seg.segread(seg_files[0])

Alternatively/additionally we could even add methods that download files into a temporary directory and load them into memory immediately, though this would require adding pydicom as a dependency.

What do you think @fedorov ?

The text was updated successfully, but these errors were encountered:

fedorov · 2025-02-27T17:43:21Z

@CPBridge this is an interesting idea.

As always, it is the question of whether improving convenience for one use case would not make things worse for other use cases.

What if you download a million of series? What would you expect if the series you downloaded is multi-instance?

CPBridge · 2025-02-28T14:02:55Z

When I made this suggestion I was thinking of adding the return values just to IDCClient.download_dicom_studies(), IDCClient.download_dicom_series(), and IDCClient.download_dicom_instance() assuming that they only downloaded a single study/series/instance for each invocation (as I have been using them) and they could therefore return a manageable list of paths (for series or study) or a single path (for instance). But upon closer inspection I see that in fact you can pass lists of study/series/instance UIDs to these methods for "bulk" downloads, so perhaps my suggestions makes less sense.

Still, even in the case where you are downloading a long list of studies/series/instances, you could just ignore the returned file paths if you don't want them and the time/memory taken to return them list would be pretty negligible. I do not think it would make sense to add return values for the methods that are clearly intended for "bulk" download, such as download_from_selection, download_from_manifest.

Just a thought. Another option would be to add a function to get the paths for a given previously-downloaded study/series/instance given the downloadDir and the dirTemplate pattern used by the library.

fedorov · 2025-03-03T20:19:43Z

I think the challenge is that there is no easy way to say whether a certain API is used in bulk mode or not. I guess we could have a parameter that would return that list of the number of series is below some threshold.

Convenience function would be perhaps less confusing, but then the user would need to maintain reference of the dirTemplate used, or that helper could simply traverse to the lowest-level folder and return those as a list.

Note that there is nothing to prevent the user from using dirTemplate that does not have SeriesInstanceUID present, so this is another complexity. What would one expect to be returned then?

I have to say, I feel like it is easier to leave it to the user to traverse the folder hierarchy. Or maybe handle in a new or revised API endpoint somehow.

CPBridge · 2025-03-03T20:52:14Z

Yes, good points. Perhaps if we do anything, it would be easier to add new methods, as you suggest. At the risk of creating a confusing/bloated API of course

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return paths of downloaded files? #151

Return paths of downloaded files? #151

CPBridge commented Feb 27, 2025 •

edited

Loading

fedorov commented Feb 27, 2025

CPBridge commented Feb 28, 2025

fedorov commented Mar 3, 2025

CPBridge commented Mar 3, 2025

Return paths of downloaded files? #151

Return paths of downloaded files? #151

Comments

CPBridge commented Feb 27, 2025 • edited Loading

fedorov commented Feb 27, 2025

CPBridge commented Feb 28, 2025

fedorov commented Mar 3, 2025

CPBridge commented Mar 3, 2025

CPBridge commented Feb 27, 2025 •

edited

Loading