-
Notifications
You must be signed in to change notification settings - Fork 28
Implements Metadata Standard & Metadata Search #154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements Metadata Standard & Metadata Search #154
Conversation
8a5fa71
to
bdac1fd
Compare
Hey @sarudak , fixed formatting issue and all appears functioning, please check when you have time! |
4d2ce31
to
9522bad
Compare
- Implement DNA Methylation Array Data Standard (0=female, 1=male, NaN=unknown) - Move search_metadata to GeoData.search() static method for data scientists - Fix test assertions for new sex encoding - Bump cache version to v2 for data integrity - Remove CLI dependencies, focus on Jupyter notebook usage Fixes bio-learn#122, addresses bio-learn#44
- Fix metadata_standard.rst to show NaN instead of -1 for unknown sex - Remove CLI example that no longer exists - Update Python example to use GeoData.search() - Remove -1 mapping in search method for consistency
- Remove useless CACHE_VERSION constant from cache.py - Update DataSource.CACHE_VERSION from v1 to v2 for sex standardization - Follow proper cache system from PR bio-learn#153 using (key, category, version) - Ensure old cached data with mixed sex encodings gets invalidated
- Fix search method to not populate sex metadata from parser configs - Keep sex standardization for actual data processing (0=female, 1=male, NaN=unknown) - Allow library validation test to pass by avoiding parser string conflicts - Search now includes datasets with sex parser config since actual values need data loading
- Fix search method to copy metadata dict instead of referencing original - Prevents search from adding sex metadata to original library entries - Resolves CI test failure in test_sex_values_are_strings - Maintains all functionality: search, validation, and sex standardization
- Prevent any modification of original library data - Handle missing metadata fields gracefully - Properly normalize sex values for filtering (strings and numbers) - Include datasets with parser configs when filtering by sex - Only add non-None values to result entries - All tests passing: validation, search, and standardization
- Import _iter_library_items from metadata module instead of redefining - Ensures consistent library parsing across the codebase - Reduces code duplication and potential inconsistencies - Maintains all functionality while improving maintainability This completes the implementation of Issues bio-learn#122 (sex standardization) and bio-learn#44 (metadata preview) with minimal, focused changes.
…a handling - Move search() from GeoData to DataLibrary as instance method - Fix KeyError by handling both 'metadata' and 'metadata_keys_parse' parser structures - Update docstring to NumPy format and remove ** from criteria parameter - Update tests to use DataLibrary().search() instead of GeoData.search() Addresses bio-learn#44 - search method now properly belongs to DataLibrary for discovering datasets
7bc1efb
to
772db12
Compare
@sarudak - ok, I've tested locally and in jupyter notebook trying to break this PR - please review and leave any comments and or items to fix or enhance. |
biolearn/data_library.py
Outdated
- 1 = male | ||
- NaN = unknown/missing | ||
""" | ||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imports should live at the top of the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, updated per comment & PEP 8
biolearn/data_library.py
Outdated
@@ -308,30 +304,34 @@ def convert_biolearn_to_standard_sex(s): | |||
s (Any): The internal sex value (1 for Female, 2 for Male, 0 for unknown). | |||
|
|||
Returns: | |||
Union[int, str]: Returns 0 if the input is 1 (Female), 1 if input is 2 (Male), or "NaN" otherwise. | |||
Union[int, float]: Returns 0 if the input is 1 (Female), 1 if input is 2 (Male), or NaN otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Biolearn sex and standard sex are the same with this PR it seems like these two methods should be removed as no conversion is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarudak after reviewing this comment, I think the naming maybe created confusion
With this PR, we've unified the internal representation to use the standard format (0=female, 1=male, NaN=unknown) throughout. However, these conversion methods are still needed for backward compatibility with CSV files that may have been saved using the old biolearn format (1=female, 2=male, 0=unknown).
The methods are used in save_csv()
and load_csv()
to handle existing data files that might be in the legacy format.
Would you prefer we:
- Rename them to
convert_legacy_to_standard_sex()
andconvert_standard_to_legacy_sex()
to make their purpose clearer? - Or remove them entirely if you think the backward compatibility isn't worth maintaining?
Let me know and I can update accordingly.
biolearn/metadata.py
Outdated
from __future__ import annotations | ||
|
||
# --------------------------------------------------------------------- | ||
# 1. Sex / age standard helpers (Issue #122) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't reference resolved issues. This will just be confusing for someone in the future.
biolearn/test/test_data_library.py
Outdated
@@ -209,7 +209,12 @@ def test_can_load_dnam(): | |||
assert "cancer" in df.metadata.columns.to_list() | |||
assert np.issubdtype(df.metadata["age"], np.number) | |||
assert np.issubdtype(df.metadata["sex"], np.number) | |||
assert (df.metadata["sex"] != 0).all() | |||
# With new standardization: 0=female, 1=male, NaN=unknown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use terms in comments that will be out of date once the PR is merged. This is not the "new" standard, it's the current and only one in the codebase once this merges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Summary
search_metadata
helper + optional CLI (Issue Design enhancement - improve metadata documentation #44)metadata_standard
Details
sex
→{"female": 0, "male": 1, "unknown": -1}
age
to float yearsCloses
Fixes #44
Fixes #122
Screenshots