Skip to content

Conversation

marcbal77
Copy link
Collaborator

Summary

Details

  • Standardises sex{"female": 0, "male": 1, "unknown": -1}
  • Normalises age to float years
  • Provides YAML-agnostic loader so search works across layouts
  • Docs page shows both CLI & Python usage examples

Closes

Fixes #44
Fixes #122

Screenshots

Success1 Success2

@marcbal77 marcbal77 changed the title Metadata standard + metadata search (Fixes #44, #122) Implements Metadata Standard & Metadata Search May 30, 2025
@marcbal77 marcbal77 force-pushed the fix/44-metadata-preview branch from 8a5fa71 to bdac1fd Compare May 31, 2025 16:03
@marcbal77 marcbal77 requested a review from sarudak May 31, 2025 17:35
@marcbal77
Copy link
Collaborator Author

Hey @sarudak , fixed formatting issue and all appears functioning, please check when you have time!

@marcbal77 marcbal77 force-pushed the fix/44-metadata-preview branch from 4d2ce31 to 9522bad Compare June 30, 2025 03:32
marcbal77 added 16 commits July 3, 2025 19:22
- Implement DNA Methylation Array Data Standard (0=female, 1=male, NaN=unknown)
- Move search_metadata to GeoData.search() static method for data scientists
- Fix test assertions for new sex encoding
- Bump cache version to v2 for data integrity
- Remove CLI dependencies, focus on Jupyter notebook usage

Fixes bio-learn#122, addresses bio-learn#44
- Fix metadata_standard.rst to show NaN instead of -1 for unknown sex
- Remove CLI example that no longer exists
- Update Python example to use GeoData.search()
- Remove -1 mapping in search method for consistency
- Remove useless CACHE_VERSION constant from cache.py
- Update DataSource.CACHE_VERSION from v1 to v2 for sex standardization
- Follow proper cache system from PR bio-learn#153 using (key, category, version)
- Ensure old cached data with mixed sex encodings gets invalidated
- Fix search method to not populate sex metadata from parser configs
- Keep sex standardization for actual data processing (0=female, 1=male, NaN=unknown)
- Allow library validation test to pass by avoiding parser string conflicts
- Search now includes datasets with sex parser config since actual values need data loading
- Fix search method to copy metadata dict instead of referencing original
- Prevents search from adding sex metadata to original library entries
- Resolves CI test failure in test_sex_values_are_strings
- Maintains all functionality: search, validation, and sex standardization
- Prevent any modification of original library data
- Handle missing metadata fields gracefully
- Properly normalize sex values for filtering (strings and numbers)
- Include datasets with parser configs when filtering by sex
- Only add non-None values to result entries
- All tests passing: validation, search, and standardization
- Import _iter_library_items from metadata module instead of redefining
- Ensures consistent library parsing across the codebase
- Reduces code duplication and potential inconsistencies
- Maintains all functionality while improving maintainability

This completes the implementation of Issues bio-learn#122 (sex standardization)
and bio-learn#44 (metadata preview) with minimal, focused changes.
…a handling

- Move search() from GeoData to DataLibrary as instance method
- Fix KeyError by handling both 'metadata' and 'metadata_keys_parse' parser structures
- Update docstring to NumPy format and remove ** from criteria parameter
- Update tests to use DataLibrary().search() instead of GeoData.search()

Addresses bio-learn#44 - search method now properly belongs to DataLibrary for discovering datasets
@marcbal77 marcbal77 force-pushed the fix/44-metadata-preview branch from 7bc1efb to 772db12 Compare July 4, 2025 03:15
@marcbal77
Copy link
Collaborator Author

marcbal77 commented Jul 4, 2025

@sarudak - ok, I've tested locally and in jupyter notebook trying to break this PR - please review and leave any comments and or items to fix or enhance.

- 1 = male
- NaN = unknown/missing
"""
import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imports should live at the top of the file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated per comment & PEP 8

@@ -308,30 +304,34 @@ def convert_biolearn_to_standard_sex(s):
s (Any): The internal sex value (1 for Female, 2 for Male, 0 for unknown).

Returns:
Union[int, str]: Returns 0 if the input is 1 (Female), 1 if input is 2 (Male), or "NaN" otherwise.
Union[int, float]: Returns 0 if the input is 1 (Female), 1 if input is 2 (Male), or NaN otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Biolearn sex and standard sex are the same with this PR it seems like these two methods should be removed as no conversion is required.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarudak after reviewing this comment, I think the naming maybe created confusion

With this PR, we've unified the internal representation to use the standard format (0=female, 1=male, NaN=unknown) throughout. However, these conversion methods are still needed for backward compatibility with CSV files that may have been saved using the old biolearn format (1=female, 2=male, 0=unknown).

The methods are used in save_csv() and load_csv() to handle existing data files that might be in the legacy format.

Would you prefer we:

  1. Rename them to convert_legacy_to_standard_sex() and convert_standard_to_legacy_sex() to make their purpose clearer?
  2. Or remove them entirely if you think the backward compatibility isn't worth maintaining?

Let me know and I can update accordingly.

from __future__ import annotations

# ---------------------------------------------------------------------
# 1. Sex / age standard helpers (Issue #122)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't reference resolved issues. This will just be confusing for someone in the future.

@@ -209,7 +209,12 @@ def test_can_load_dnam():
assert "cancer" in df.metadata.columns.to_list()
assert np.issubdtype(df.metadata["age"], np.number)
assert np.issubdtype(df.metadata["sex"], np.number)
assert (df.metadata["sex"] != 0).all()
# With new standardization: 0=female, 1=male, NaN=unknown
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use terms in comments that will be out of date once the PR is merged. This is not the "new" standard, it's the current and only one in the codebase once this merges.

Copy link
Member

@sarudak sarudak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@marcbal77 marcbal77 merged commit d99029a into bio-learn:master Jul 25, 2025
1 check passed
@marcbal77 marcbal77 deleted the fix/44-metadata-preview branch July 25, 2025 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sex coding in biolearn Design enhancement - improve metadata documentation
2 participants