Implements Metadata Standard & Metadata Search #154

marcbal77 · 2025-05-30T22:58:57Z

Summary

Implements sex / age standard (Issue sex coding in biolearn #122)
Adds search_metadata helper + optional CLI (Issue Design enhancement - improve metadata documentation #44)
New doc page metadata_standard
Test suite now 126 ✓ / 1 skipped

Details

Standardises sex → {"female": 0, "male": 1, "unknown": -1}
Normalises age to float years
Provides YAML-agnostic loader so search works across layouts
Docs page shows both CLI & Python usage examples

Closes

Fixes #44
Fixes #122

Screenshots

marcbal77 · 2025-05-31T17:36:51Z

Hey @sarudak , fixed formatting issue and all appears functioning, please check when you have time!

- Implement DNA Methylation Array Data Standard (0=female, 1=male, NaN=unknown) - Move search_metadata to GeoData.search() static method for data scientists - Fix test assertions for new sex encoding - Bump cache version to v2 for data integrity - Remove CLI dependencies, focus on Jupyter notebook usage Fixes bio-learn#122, addresses bio-learn#44

- Fix metadata_standard.rst to show NaN instead of -1 for unknown sex - Remove CLI example that no longer exists - Update Python example to use GeoData.search() - Remove -1 mapping in search method for consistency

- Remove useless CACHE_VERSION constant from cache.py - Update DataSource.CACHE_VERSION from v1 to v2 for sex standardization - Follow proper cache system from PR bio-learn#153 using (key, category, version) - Ensure old cached data with mixed sex encodings gets invalidated

- Fix search method to not populate sex metadata from parser configs - Keep sex standardization for actual data processing (0=female, 1=male, NaN=unknown) - Allow library validation test to pass by avoiding parser string conflicts - Search now includes datasets with sex parser config since actual values need data loading

- Fix search method to copy metadata dict instead of referencing original - Prevents search from adding sex metadata to original library entries - Resolves CI test failure in test_sex_values_are_strings - Maintains all functionality: search, validation, and sex standardization

- Prevent any modification of original library data - Handle missing metadata fields gracefully - Properly normalize sex values for filtering (strings and numbers) - Include datasets with parser configs when filtering by sex - Only add non-None values to result entries - All tests passing: validation, search, and standardization

- Import _iter_library_items from metadata module instead of redefining - Ensures consistent library parsing across the codebase - Reduces code duplication and potential inconsistencies - Maintains all functionality while improving maintainability This completes the implementation of Issues bio-learn#122 (sex standardization) and bio-learn#44 (metadata preview) with minimal, focused changes.

…a handling - Move search() from GeoData to DataLibrary as instance method - Fix KeyError by handling both 'metadata' and 'metadata_keys_parse' parser structures - Update docstring to NumPy format and remove ** from criteria parameter - Update tests to use DataLibrary().search() instead of GeoData.search() Addresses bio-learn#44 - search method now properly belongs to DataLibrary for discovering datasets

marcbal77 · 2025-07-04T04:02:33Z

@sarudak - ok, I've tested locally and in jupyter notebook trying to break this PR - please review and leave any comments and or items to fix or enhance.

sarudak · 2025-07-08T15:09:59Z

biolearn/data_library.py

+        - 1 = male
+        - NaN = unknown/missing
+        """
+        import pandas as pd


Imports should live at the top of the file

Yes, updated per comment & PEP 8

sarudak · 2025-07-08T15:11:39Z

biolearn/data_library.py

@@ -308,30 +304,34 @@ def convert_biolearn_to_standard_sex(s):
            s (Any): The internal sex value (1 for Female, 2 for Male, 0 for unknown).

        Returns:
-            Union[int, str]: Returns 0 if the input is 1 (Female), 1 if input is 2 (Male), or "NaN" otherwise.
+            Union[int, float]: Returns 0 if the input is 1 (Female), 1 if input is 2 (Male), or NaN otherwise.


Since Biolearn sex and standard sex are the same with this PR it seems like these two methods should be removed as no conversion is required.

@sarudak after reviewing this comment, I think the naming maybe created confusion

With this PR, we've unified the internal representation to use the standard format (0=female, 1=male, NaN=unknown) throughout. However, these conversion methods are still needed for backward compatibility with CSV files that may have been saved using the old biolearn format (1=female, 2=male, 0=unknown).

The methods are used in save_csv() and load_csv() to handle existing data files that might be in the legacy format.

Would you prefer we:

Rename them to convert_legacy_to_standard_sex() and convert_standard_to_legacy_sex() to make their purpose clearer?

Or remove them entirely if you think the backward compatibility isn't worth maintaining?

Let me know and I can update accordingly.

sarudak · 2025-07-08T15:12:21Z

biolearn/metadata.py

+from __future__ import annotations
+
+# ---------------------------------------------------------------------
+# 1. Sex / age standard helpers (Issue #122)


Don't reference resolved issues. This will just be confusing for someone in the future.

sarudak · 2025-07-08T15:13:51Z

biolearn/test/test_data_library.py

@@ -209,7 +209,12 @@ def test_can_load_dnam():
    assert "cancer" in df.metadata.columns.to_list()
    assert np.issubdtype(df.metadata["age"], np.number)
    assert np.issubdtype(df.metadata["sex"], np.number)
-    assert (df.metadata["sex"] != 0).all()
+    # With new standardization: 0=female, 1=male, NaN=unknown


Don't use terms in comments that will be out of date once the PR is merged. This is not the "new" standard, it's the current and only one in the codebase once this merges.

sarudak

LGTM

marcbal77 changed the title ~~Metadata standard + metadata search (Fixes #44, #122)~~ Implements Metadata Standard & Metadata Search May 30, 2025

marcbal77 force-pushed the fix/44-metadata-preview branch from 8a5fa71 to bdac1fd Compare May 31, 2025 16:03

marcbal77 requested a review from sarudak May 31, 2025 17:35

marcbal77 force-pushed the fix/44-metadata-preview branch from 4d2ce31 to 9522bad Compare June 30, 2025 03:32

marcbal77 added 16 commits July 3, 2025 19:22

feat(meta): standardise sex/age and bump cache v2 (bio-learn#122)

fee5c51

fix(meta-search): accept missing/GEO numeric sex codes

28accee

docs: add sex/age standard and search examples

2d9f59d

test: ensure library sex field uses standard strings

1e763ea

docs: render Python example with code-block directive

ab13910

style: fix flake8-docstrings to pass CI

18153c7

reformat

16c6b91

chore(cache): revert unintended CACHE_VERSION bump

ca167f4

fix(docs): correct sex standardization to use NaN for unknown

3b1045b

- Fix metadata_standard.rst to show NaN instead of -1 for unknown sex - Remove CLI example that no longer exists - Update Python example to use GeoData.search() - Remove -1 mapping in search method for consistency

marcbal77 force-pushed the fix/44-metadata-preview branch from 7bc1efb to 772db12 Compare July 4, 2025 03:15

marcbal77 mentioned this pull request Jul 8, 2025

Fix #143: Improve GrimAge error messages for missing metadata #160

Merged

sarudak reviewed Jul 8, 2025

View reviewed changes

Fixed issues regarding tests and incorrect methods

11cc04c

sarudak approved these changes Jul 25, 2025

View reviewed changes

marcbal77 merged commit d99029a into bio-learn:master Jul 25, 2025
1 check passed

marcbal77 deleted the fix/44-metadata-preview branch July 25, 2025 23:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implements Metadata Standard & Metadata Search #154

Implements Metadata Standard & Metadata Search #154

Uh oh!

marcbal77 commented May 30, 2025

Uh oh!

marcbal77 commented May 31, 2025

Uh oh!

marcbal77 commented Jul 4, 2025 •

edited

Loading

Uh oh!

sarudak Jul 8, 2025

Uh oh!

marcbal77 Jul 23, 2025

Uh oh!

sarudak Jul 8, 2025

Uh oh!

marcbal77 Jul 23, 2025

Uh oh!

sarudak Jul 8, 2025

Uh oh!

sarudak Jul 8, 2025

Uh oh!

sarudak left a comment

Uh oh!

Uh oh!

Uh oh!

Implements Metadata Standard & Metadata Search #154

Implements Metadata Standard & Metadata Search #154

Uh oh!

Conversation

marcbal77 commented May 30, 2025

Summary

Details

Closes

Screenshots

Uh oh!

marcbal77 commented May 31, 2025

Uh oh!

marcbal77 commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarudak Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

marcbal77 Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

sarudak Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

marcbal77 Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

sarudak Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

sarudak Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

sarudak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

marcbal77 commented Jul 4, 2025 •

edited

Loading