Skip to content

Dataset Selection and Pathing#17

Open
MattsonCam wants to merge 5 commits intomainfrom
train_switch_datasets
Open

Dataset Selection and Pathing#17
MattsonCam wants to merge 5 commits intomainfrom
train_switch_datasets

Conversation

@MattsonCam
Copy link
Copy Markdown
Member

This pr updates dataset selection and pathing so that I can easily train on different datasets. The main branch will likely not be updated each time I train a new model or perform fine-tuning. Instead, I plan to log the runs in main and develop on different branches. However, I still want to make all of the models available in main. In a future prs I will modify the splitting based on the dataset selected and include additional models.

Cameron Mattson added 5 commits May 7, 2026 13:31
Point training, analysis scripts, notebooks, and documentation to the new
dataset location at /mnt/big_drive/nuclear_speckle_data/initial_dataset/initial_dataset_raw.
Also make train.py data-root configurable via --data-dir and
NUCLEAR_SPECKLES_DATA_DIR, and remove the stale repo-local dataset ignore entry.
Switch crop cache generation to use U2OS flat TIFF + parquet inputs with
underscore-based filename parsing (plate/well/site/channel) and hardcoded
U2OS default paths in training. Remove mask-dependent crop processing and
use bbox-only CH0->CH2 crop extraction with excluded-folder filtering.

Update documentation to match the current state by removing the detailed
U2OS-specific data pipeline block from README and keeping the top-level
project description concise.
Refactor training and cache generation to support selectable dataset configs with per-dataset
paths, channel mappings, and schema normalization. U2OS now maps DAPI->CH01 and Gold->CH03,
while initial keeps DAPI->CH0 and Gold->CH2, with Image_Metadata_* columns remapped for
profile compatibility and top-level parquet directory loading enabled.

Move cache outputs to dataset-specific model_cache roots, isolate crop/tensor cache directories
per dataset root, and update README usage/docs for dataset selection, channel mappings, and
cache locations.
Uppercase parsed/input/target channel IDs in crop cache building to avoid
case-sensitivity mismatches during cache lookup and manifest validation.
Add Metadata_Position -> Metadata_Site mapping for U2OS so dataset metadata
aligns with the training pipeline expectations.
@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@MattsonCam
Copy link
Copy Markdown
Member Author

The vanilla Unet models were in the repo before, but I accidentally removed them so I added them back in so that training can proceed with this model pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant