Dataset Selection and Pathing#17
Open
MattsonCam wants to merge 5 commits intomainfrom
Open
Conversation
added 5 commits
May 7, 2026 13:31
Point training, analysis scripts, notebooks, and documentation to the new dataset location at /mnt/big_drive/nuclear_speckle_data/initial_dataset/initial_dataset_raw. Also make train.py data-root configurable via --data-dir and NUCLEAR_SPECKLES_DATA_DIR, and remove the stale repo-local dataset ignore entry.
Switch crop cache generation to use U2OS flat TIFF + parquet inputs with underscore-based filename parsing (plate/well/site/channel) and hardcoded U2OS default paths in training. Remove mask-dependent crop processing and use bbox-only CH0->CH2 crop extraction with excluded-folder filtering. Update documentation to match the current state by removing the detailed U2OS-specific data pipeline block from README and keeping the top-level project description concise.
Refactor training and cache generation to support selectable dataset configs with per-dataset paths, channel mappings, and schema normalization. U2OS now maps DAPI->CH01 and Gold->CH03, while initial keeps DAPI->CH0 and Gold->CH2, with Image_Metadata_* columns remapped for profile compatibility and top-level parquet directory loading enabled. Move cache outputs to dataset-specific model_cache roots, isolate crop/tensor cache directories per dataset root, and update README usage/docs for dataset selection, channel mappings, and cache locations.
Uppercase parsed/input/target channel IDs in crop cache building to avoid case-sensitivity mismatches during cache lookup and manifest validation. Add Metadata_Position -> Metadata_Site mapping for U2OS so dataset metadata aligns with the training pipeline expectations.
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Member
Author
|
The vanilla Unet models were in the repo before, but I accidentally removed them so I added them back in so that training can proceed with this model pipeline |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pr updates dataset selection and pathing so that I can easily train on different datasets. The main branch will likely not be updated each time I train a new model or perform fine-tuning. Instead, I plan to log the runs in main and develop on different branches. However, I still want to make all of the models available in main. In a future prs I will modify the splitting based on the dataset selected and include additional models.