Skip to content

Commit fb10b15

Browse files
fedorovDanielaSchacherer
authored andcommitted
ENH: Replaced dynamic Github release asset access with manually maintained table.
enh: switch to dict for index overview and fix tests bug fixed test to fix bug on macOS test next test enh: add contributing guidelines Based on https://github.com/Slicer/Slicer/blob/main/CONTRIBUTING.md ENH: improve error reporting for unrecognized items from the manifest Re ImagingDataCommons#100 Whenever a crdc_series_uuid from the manifest is not matched to those known to the index, provide error message informing the user of what could be the reasons. Fixed error checking for unrecognized items in the validation function. Report unrecognized items independently of whether validation is requested or not. ENH: add test of the manifest that has mismatches replaced github release access with hardcoded indices overview wip wip
1 parent c52a2a1 commit fb10b15

File tree

5 files changed

+169
-556
lines changed

5 files changed

+169
-556
lines changed

CONTRIBUTING.md

+122
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Contributing to idc-index
2+
3+
There are many ways to contribute to idc-index, with varying levels of effort.
4+
Do try to look through the [documentation](idc-index-docs) first if something is
5+
unclear, and let us know how we can do better.
6+
7+
- Ask a question on the [IDC forum][idc-forum]
8+
- Use [idc-index issues][idc-index-issues] to submit a feature request or bug,
9+
or add to the discussion on an existing issue
10+
- Submit a [Pull Request](https://github.com/ImagingDataCommons/idc-index/pulls)
11+
to improve idc-index or its documentation
12+
13+
We encourage a range of Pull Requests, from patches that include passing tests
14+
and documentation, all the way down to half-baked ideas that launch discussions.
15+
16+
## The PR Process, Circle CI, and Related Gotchas
17+
18+
### How to submit a PR ?
19+
20+
If you are new to idc-index development and you don't have push access to the
21+
repository, here are the steps:
22+
23+
1. [Fork and clone](https://docs.github.com/get-started/quickstart/fork-a-repo)
24+
the repository.
25+
2. Create a branch dedicated to the feature/bugfix you plan to implement (do not
26+
use `main` branch - this will complicate further development and
27+
collaboration)
28+
3. [Push](https://docs.github.com/get-started/using-git/pushing-commits-to-a-remote-repository)
29+
the branch to your GitHub fork.
30+
4. Create a
31+
[Pull Request](https://github.com/ImagingDataCommons/idc-index/pulls).
32+
33+
This corresponds to the `Fork & Pull Model` described in the
34+
[GitHub collaborative development](https://docs.github.com/pull-requests/collaborating-with-pull-requests/getting-started/about-collaborative-development-models)
35+
documentation.
36+
37+
When submitting a PR, the developers following the project will be notified.
38+
That said, to engage specific developers, you can add `Cc: @<username>` comment
39+
to notify them of your awesome contributions. Based on the comments posted by
40+
the reviewers, you may have to revisit your patches.
41+
42+
### How to efficiently contribute ?
43+
44+
We encourage all developers to:
45+
46+
- set up pre-commit hooks so that you can remedy various formatting and other
47+
issues early, without waiting for the continuous integration (CI) checks to
48+
complete: `pre-commit install`
49+
50+
- add or update tests. You can see current tests
51+
[here](https://github.com/ImagingDataCommons/idc-index/tree/main/tests). If
52+
you contribute new functionality, adding test(s) covering it is mandatory!
53+
54+
- you can run individual tests from the root repository using the following
55+
command: `python -m unittest -vv tests.idcindex.TestIDCClient.<test_name>`
56+
57+
### How to write commit messages ?
58+
59+
Write your commit messages using the standard prefixes for commit messages:
60+
61+
- `BUG:` Fix for runtime crash or incorrect result
62+
- `COMP:` Compiler error or warning fix
63+
- `DOC:` Documentation change
64+
- `ENH:` New functionality
65+
- `PERF:` Performance improvement
66+
- `STYLE:` No logic impact (indentation, comments)
67+
- `WIP:` Work In Progress not ready for merge
68+
69+
The body of the message should clearly describe the motivation of the commit
70+
(**what**, **why**, and **how**). In order to ease the task of reviewing
71+
commits, the message body should follow the following guidelines:
72+
73+
1. Leave a blank line between the subject and the body. This helps `git log` and
74+
`git rebase` work nicely, and allows to smooth generation of release notes.
75+
2. Try to keep the subject line below 72 characters, ideally 50.
76+
3. Capitalize the subject line.
77+
4. Do not end the subject line with a period.
78+
5. Use the imperative mood in the subject line (e.g.
79+
`BUG: Fix spacing not being considered.`).
80+
6. Wrap the body at 80 characters.
81+
7. Use semantic line feeds to separate different ideas, which improves the
82+
readability.
83+
8. Be concise, but honor the change: if significant alternative solutions were
84+
available, explain why they were discarded.
85+
9. If the commit refers to a topic discussed on the [IDC forum][idc-forum], or
86+
fixes a regression test, provide the link. If it fixes a compiler error,
87+
provide a minimal verbatim message of the compiler error. If the commit
88+
closes an issue, use the
89+
[GitHub issue closing keywords](https://docs.github.com/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue).
90+
91+
Keep in mind that the significant time is invested in reviewing commits and
92+
_pull requests_, so following these guidelines will greatly help the people
93+
doing reviews.
94+
95+
These guidelines are largely inspired by Chris Beam's
96+
[How to Write a Commit Message](https://chris.beams.io/posts/git-commit/) post.
97+
98+
### How to integrate a PR ?
99+
100+
Getting your contributions integrated is relatively straightforward, here is the
101+
checklist:
102+
103+
- All tests pass
104+
- Consensus is reached. This usually means that at least two reviewers approved
105+
the changes (or added a `LGTM` comment) and at least one business day passed
106+
without anyone objecting. `LGTM` is an acronym for _Looks Good to Me_.
107+
- To accommodate developers explicitly asking for more time to test the proposed
108+
changes, integration time can be delayed by few more days.
109+
- If you do NOT have push access, a core developer will integrate your PR. If
110+
you would like to speed up the integration, do not hesitate to add a reminder
111+
comment to the PR
112+
113+
### Automatic testing of pull requests
114+
115+
Every pull request is tested automatically using GitHub Actions each time you
116+
push a commit to it. The GitHub UI will restrict users from merging pull
117+
requests until the CI build has returned with a successful result indicating
118+
that all tests have passed.
119+
120+
[idc-forum]: https://discourse.canceridc.dev
121+
[idc-index-issues]: https://github.com/ImagingDataCommons/idc-index/issues
122+
[idc-index-docs]: https://idc-index.readthedocs.io/

idc_index/index.py

+41-80
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121

2222
aws_endpoint_url = "https://s3.amazonaws.com"
2323
gcp_endpoint_url = "https://storage.googleapis.com"
24+
asset_endpoint_url = f"https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/{idc_index_data.__version__}"
2425

2526
logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.INFO)
2627
logger = logging.getLogger(__name__)
@@ -67,7 +68,24 @@ def __init__(self):
6768
self.collection_summary = self.index.groupby("collection_id").agg(
6869
{"Modality": pd.Series.unique, "series_size_MB": "sum"}
6970
)
70-
self.indices_overview = self.list_indices()
71+
72+
self.indices_overview = pd.DataFrame(
73+
{
74+
"index": {"description": None, "installed": True, "url": None},
75+
"sm_index": {
76+
"description": None,
77+
"installed": True,
78+
"url": os.path.join(asset_endpoint_url, "sm_index.parquet"),
79+
},
80+
"sm_instance_index": {
81+
"description": None,
82+
"installed": True,
83+
"url": os.path.join(
84+
asset_endpoint_url, "sm_instance_index.parquet"
85+
),
86+
},
87+
}
88+
)
7189

7290
# Lookup s5cmd
7391
self.s5cmdPath = shutil.which("s5cmd")
@@ -172,33 +190,6 @@ def get_idc_version():
172190
idc_version = Version(idc_index_data.__version__).major
173191
return f"v{idc_version}"
174192

175-
@staticmethod
176-
def _get_latest_idc_index_data_release_assets():
177-
"""
178-
Retrieves a list of the latest idc-index-data release assets.
179-
180-
Returns:
181-
release_assets (list): List of tuples (asset_name, asset_url).
182-
"""
183-
release_assets = []
184-
url = f"https://api.github.com/repos/ImagingDataCommons/idc-index-data/releases/tags/{idc_index_data.__version__}"
185-
try:
186-
response = requests.get(url, timeout=30)
187-
if response.status_code == 200:
188-
release_data = response.json()
189-
assets = release_data.get("assets", [])
190-
for asset in assets:
191-
release_assets.append(
192-
(asset["name"], asset["browser_download_url"])
193-
)
194-
else:
195-
logger.error(f"Failed to fetch releases: {response.status_code}")
196-
197-
except FileNotFoundError:
198-
logger.error(f"Failed to fetch releases: {response.status_code}")
199-
200-
return release_assets
201-
202193
def list_indices(self):
203194
"""
204195
Lists all available indices including their installation status.
@@ -207,40 +198,6 @@ def list_indices(self):
207198
indices_overview (pd.DataFrame): DataFrame containing information per index.
208199
"""
209200

210-
if "indices_overview" not in locals():
211-
indices_overview = {}
212-
# Find installed indices
213-
for file in distribution("idc-index-data").files:
214-
if str(file).endswith("index.parquet"):
215-
index_name = os.path.splitext(
216-
str(file).rsplit("/", maxsplit=1)[-1]
217-
)[0]
218-
219-
indices_overview[index_name] = {
220-
"description": None,
221-
"installed": True,
222-
"local_path": os.path.join(
223-
idc_index_data.IDC_INDEX_PARQUET_FILEPATH.parents[0],
224-
f"{index_name}.parquet",
225-
),
226-
}
227-
228-
# Find available indices from idc-index-data
229-
release_assets = self._get_latest_idc_index_data_release_assets()
230-
for asset_name, asset_url in release_assets:
231-
if asset_name.endswith(".parquet"):
232-
asset_name = os.path.splitext(asset_name)[0]
233-
if asset_name not in indices_overview:
234-
indices_overview[asset_name] = {
235-
"description": None,
236-
"installed": False,
237-
"url": asset_url,
238-
}
239-
240-
self.indices_overview = pd.DataFrame.from_dict(
241-
indices_overview, orient="index"
242-
)
243-
244201
return self.indices_overview
245202

246203
def fetch_index(self, index) -> None:
@@ -251,23 +208,22 @@ def fetch_index(self, index) -> None:
251208
index (str): Name of the index to be downloaded.
252209
"""
253210

254-
if index not in self.indices_overview.index.tolist():
211+
if index not in self.indices_overview.keys():
255212
logger.error(f"Index {index} is not available and can not be fetched.")
256-
elif self.indices_overview.loc[index, "installed"]:
213+
elif self.indices_overview[index]["installed"]:
257214
logger.warning(
258215
f"Index {index} already installed and will not be fetched again."
259216
)
260217
else:
261-
response = requests.get(self.indices_overview.loc[index, "url"], timeout=30)
218+
response = requests.get(self.indices_overview[index]["url"], timeout=30)
262219
if response.status_code == 200:
263220
filepath = os.path.join(
264221
idc_index_data.IDC_INDEX_PARQUET_FILEPATH.parents[0],
265222
f"{index}.parquet",
266223
)
267224
with open(filepath, mode="wb") as file:
268225
file.write(response.content)
269-
self.indices_overview.loc[index, "installed"] = True
270-
self.indices_overview.loc[index, "local_path"] = filepath
226+
self.indices_overview[index]["installed"] = True
271227
else:
272228
logger.error(f"Failed to fetch index: {response.status_code}")
273229

@@ -668,8 +624,8 @@ def _validate_update_manifest_and_get_download_size(
668624
# create a copy of the index
669625
index_df_copy = self.index
670626

671-
# Extract s3 url and crdc_instance_uuid from the manifest copy commands
672-
# Next, extract crdc_instance_uuid from aws_series_url in the index and
627+
# Extract s3 url and crdc_series_uuid from the manifest copy commands
628+
# Next, extract crdc_series_uuid from aws_series_url in the index and
673629
# try to verify if every series in the manifest is present in the index
674630

675631
# TODO: need to remove the assumption that manifest commands will have 'cp'
@@ -697,8 +653,9 @@ def _validate_update_manifest_and_get_download_size(
697653
seriesInstanceuid,
698654
s3_url,
699655
series_size_MB,
700-
index_crdc_series_uuid==manifest_crdc_series_uuid AS crdc_series_uuid_match,
656+
index_crdc_series_uuid is not NULL as crdc_series_uuid_match,
701657
s3_url==series_aws_url AS s3_url_match,
658+
manifest_temp.manifest_cp_cmd,
702659
CASE
703660
WHEN s3_url==series_aws_url THEN 'aws'
704661
ELSE
@@ -717,19 +674,23 @@ def _validate_update_manifest_and_get_download_size(
717674

718675
endpoint_to_use = None
719676

720-
if validate_manifest:
721-
# Check if crdc_instance_uuid is found in the index
722-
if not all(merged_df["crdc_series_uuid_match"]):
723-
missing_manifest_cp_cmds = merged_df.loc[
724-
~merged_df["crdc_series_uuid_match"], "manifest_cp_cmd"
725-
]
726-
missing_manifest_cp_cmds_str = f"The following manifest copy commands do not have any associated series in the index: {missing_manifest_cp_cmds.tolist()}"
727-
raise ValueError(missing_manifest_cp_cmds_str)
677+
# Check if any crdc_series_uuid are not found in the index
678+
if not all(merged_df["crdc_series_uuid_match"]):
679+
missing_manifest_cp_cmds = merged_df.loc[
680+
~merged_df["crdc_series_uuid_match"], "manifest_cp_cmd"
681+
]
682+
logger.error(
683+
"The following manifest copy commands are not recognized as referencing any associated series in the index.\n"
684+
"This means either these commands are invalid, or they may correspond to files available in a release of IDC\n"
685+
f"different from {self.get_idc_version()} used in this version of idc-index. The corresponding files will not be downloaded.\n"
686+
)
687+
logger.error("\n" + "\n".join(missing_manifest_cp_cmds.tolist()))
728688

729-
# Check if there are more than one endpoints
689+
if validate_manifest:
690+
# Check if there is more than one endpoint
730691
if len(merged_df["endpoint"].unique()) > 1:
731692
raise ValueError(
732-
"Either GCS bucket path is invalid or manifest has a mix of GCS and AWS urls. If so, please use urls from one provider only"
693+
"Either GCS bucket path is invalid or manifest has a mix of GCS and AWS urls. "
733694
)
734695

735696
if (

pyproject.toml

+1
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,7 @@ disallow_incomplete_defs = true
125125

126126
[tool.ruff]
127127
src = ["idc_index"]
128+
extend-exclude = ["./CONTRIBUTING.md"]
128129

129130
[tool.ruff.lint]
130131
extend-select = [

0 commit comments

Comments
 (0)