Skip to content

Commit 0b78bd6

Browse files
authored
Merge pull request #65 from ImagingDataCommons/s5cmd-download-instructions
add low-level download instructions
2 parents 417ab22 + 3a8e2e4 commit 0b78bd6

File tree

1 file changed

+139
-2
lines changed

1 file changed

+139
-2
lines changed
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,140 @@
1-
# Downloading data with s5cmd
1+
# Downloading data with `s5cmd`
22

3-
This page is superseded by the [Downloading data](./) page.
3+
Download of data from IDC is a 2-step process covered on this page:
4+
5+
* **Step 1:** create a manifest - a list of the storage bucket URLs of the files to be downloaded. if you want to download the content of the cohort defined in the IDC Portal, [export the `s5cmd` manifest fist](../../portal/cohort-manifests.md), and proceed to Step 2. Alternatively, you can use BigQuery SQL as discussed below to generate the manifest;
6+
* **Step 2**: given the manifest, download files to your computer or to a cloud VM using `s5cmd` command line tool.
7+
8+
To learn more about using Google BigQuery SQL with IDC, check out part 3 of our ["Getting started" tutorial series](https://github.com/ImagingDataCommons/IDC-Tutorials/tree/master/notebooks/getting\_started), which demonstrates how to query and download IDC data!
9+
10+
### Step 1: Create the manifest
11+
12+
{% hint style="info" %}
13+
You will need to complete prerequisites described in [getting-started-with-gcp.md](../../introduction/google-cloud-platform/getting-started-with-gcp.md "mention") in order to be able to execute the manifest generation queries below!
14+
{% endhint %}
15+
16+
A download manifest can be created using either the IDC Portal, or by executing a BQ query. **If you have generated a manifest using the IDC Portal, as discussed** [**here**](../../portal/cohort-manifests.md)**, proceed to Step 2!** In the remainder of this section we describe creating a manifest from a BigQuery query.
17+
18+
The [`dicom_all`](https://console.cloud.google.com/bigquery?p=bigquery-public-data\&d=idc\_current\&t=dicom\_all\&page=table) BigQuery table discussed in [this documentation article](https://learn.canceridc.dev/data/organization-of-data/files-and-metadata#bigquery-tables) can be used to subset the files you need based on the DICOM metadata attributes as needed, utilizing the SQL query interface. The `gcs_url` and `aws_url` columns contain Google Cloud Storage and AWS S3 URLs, respectively, that can be used to retrieve the files.
19+
20+
Start with the query templates provided below, modify them based on your needs, and save the result in a file `query.txt`. The specific values for `PatientID`, `SeriesInstanceUID`, `StudyInstanceUID` are chosen to serve as examples. 
21+
22+
You can use IDC Portal to identify items of interest, or you can use SQL queries to subset your data using any of the DICOM attributes. You are encouraged to use the [BigQuery console](https://console.cloud.google.com/bigquery) to test your queries and explore the data first!
23+
24+
Queries below demonstrate how to get the Google Storage URLs to download cohort files.
25+
26+
{% code overflow="wrap" %}
27+
```sql
28+
# Select all files from GCS for a given PatientID
29+
SELECT DISTINCT(CONCAT("cp s3://", SPLIT(gcs_url,"/")[SAFE_OFFSET(2)], "/", crdc_series_uuid, "/* ."))
30+
FROM `bigquery-public-data.idc_current.dicom_all`
31+
WHERE PatientID = "LUNG1-001"
32+
```
33+
{% endcode %}
34+
35+
{% code overflow="wrap" %}
36+
```sql
37+
# Select all files from GCS for a given collection
38+
SELECT DISTINCT(CONCAT("cp s3://", SPLIT(gcs_url,"/")[SAFE_OFFSET(2)], "/", crdc_series_uuid, "/* ."))
39+
FROM `bigquery-public-data.idc_current.dicom_all`
40+
WHERE collection_id = "nsclc_radiomics"
41+
```
42+
{% endcode %}
43+
44+
{% code overflow="wrap" %}
45+
```sql
46+
# Select all files from GCS for a given DICOM series
47+
SELECT DISTINCT(CONCAT("cp s3://", SPLIT(gcs_url,"/")[SAFE_OFFSET(2)], "/", crdc_series_uuid, "/* ."))
48+
FROM `bigquery-public-data.idc_current.dicom_all`
49+
WHERE SeriesInstanceUID = "1.3.6.1.4.1.32722.99.99.298991776521342375010861296712563382046"
50+
```
51+
{% endcode %}
52+
53+
{% code overflow="wrap" %}
54+
```sql
55+
# Select all files from GCS for a given DICOM study
56+
SELECT DISTINCT(CONCAT("cp s3://", SPLIT(gcs_url,"/")[SAFE_OFFSET(2)], "/", crdc_series_uuid, "/* ."))
57+
FROM `bigquery-public-data.idc_current.dicom_all`
58+
WHERE StudyInstanceUID = "1.3.6.1.4.1.32722.99.99.239341353911714368772597187099978969331"
59+
```
60+
{% endcode %}
61+
62+
If you want to download the files corresponding to the cohort from AWS instead of GCP, substitute aws`_url` for gc`s_url` in the `SELECT` statement of the query, such as in the following SELECT clause:
63+
64+
{% code overflow="wrap" %}
65+
```sql
66+
SELECT DISTINCT(CONCAT("cp s3://", SPLIT(aws_url,"/")[SAFE_OFFSET(2)], "/", crdc_series_uuid, "/* ."))
67+
```
68+
{% endcode %}
69+
70+
Next, use a Google Cloud SDK `bq query` command (from command line) to run the query and save the result into a manifest file, which will be the list of GCP URLs that can be used to download the data.
71+
72+
{% code overflow="wrap" %}
73+
```shell
74+
bq query --use_legacy_sql=false --format=csv --max_rows=20000000 < query.txt > manifest.txt
75+
```
76+
{% endcode %}
77+
78+
{% hint style="danger" %}
79+
Make sure you adjust the `--max_rows` parameter in the queries above to be equal or exceed the number of rows in the result of the query, otherwise your list will be truncated!&#x20;
80+
{% endhint %}
81+
82+
For any of the queries, you can get the count of rows to confirm that the `--max_rows` parameter is sufficiently large (use the [BigQuery console](https://console.cloud.google.com/bigquery) to run these queries):
83+
84+
```sql
85+
# count the number of rows
86+
SELECT COUNT(DISTINCT(crdc_series_uuid))
87+
FROM bigquery-public-data.idc_current.dicom_all
88+
WHERE collection_id = "nsclc_radiomics"
89+
```
90+
91+
You can also get the total disk space that will be needed for the files that you will be downloading:
92+
93+
```sql
94+
# calculate the disk size in GB needed for the files to be downloaded
95+
SELECT ROUND(SUM(instance_size)/POW(1024,3),2) as size_GB
96+
FROM bigquery-public-data.idc_current.dicom_all
97+
WHERE collection_id = "nsclc_radiomics"
98+
```
99+
100+
### Step 2: Download the files defined by the manifest
101+
102+
[`s5cmd`](https://github.com/peak/s5cmd) is a very fast S3 and local filesystem execution tool that can be used for accessing IDC buckets and downloading files both from GCS and AWS.
103+
104+
Install `s5cmd` following the instructions in [https://github.com/peak/s5cmd#installation](https://github.com/peak/s5cmd#installation).
105+
106+
You can verify if your setup was successful by running the following command: it should successfully download one file from IDC.
107+
108+
{% code overflow="wrap" %}
109+
```shell
110+
s5cmd --no-sign-request --endpoint-url https://storage.googleapis.com cp s3://public-datasets-idc/cdac3f73-4fc9-4e0d-913b-b64aa3100977/902b4588-6f10-4342-9c80-f1054e67ee83.dcm .
111+
```
112+
{% endcode %}
113+
114+
Once `s5cmd` is installed, you can use `s5cmd run` command to download the files corresponding to the manifest.&#x20;
115+
116+
If you defined manifest that references GCP buckets:
117+
118+
<pre class="language-bash" data-overflow="wrap"><code class="lang-bash">s5cmd --no-sign-request <a data-footnote-ref href="#user-content-fn-1">--endpoint-url https://storage.googleapis.com</a> run manifest_file_name
119+
</code></pre>
120+
121+
If you defined manifest that references AWS buckets:
122+
123+
<pre class="language-bash" data-overflow="wrap"><code class="lang-bash">s5cmd --no-sign-request <a data-footnote-ref href="#user-content-fn-2">--endpoint-url https://s3.amazonaws.com</a> run manifest_file_name
124+
</code></pre>
125+
126+
{% hint style="info" %}
127+
If you created the manifest using IDC Portal, you will have the instructions to install `s5cmd` and the exact command to download its content in the header of the manifest, which will look like this:
128+
129+
{% code overflow="wrap" %}
130+
```
131+
# To download the files in this manifest, first install s5cmd (https://github.com/peak/s5cmd),
132+
# then run the following command:
133+
# s5cmd --no-sign-request --endpoint-url https://s3.amazonaws.com run cohorts_996_20230505_72608_aws.s5cmd
134+
```
135+
{% endcode %}
136+
{% endhint %}
137+
138+
[^1]: Use this endpoint for accessing GCS buckets
139+
140+
[^2]: Use this endpoint for accessing AWS buckets

0 commit comments

Comments
 (0)