Allow `cutoff` predictions, support empty background pixels and tests #29

andsild · 2025-06-04T16:46:04Z

Hi. Let me know if you want me to try and break down the PR into smaller PRs.

Otherwise, this PR is a first step to work with AML in DSA.
Changes are 1:

Adds an option for cutoff predictions to be made, so that you only do predictions on n unlabeled features per round/epoch. All labeled samples will always be included. This is to speed up training in cases where you have many slides with many annotations (in our AML case, several millions). The benefit is speed, the downside may be that feature files no longer contain all the data if a user wants to download them.
Define a background superpixel in cases for sparse pixelmaps (in my case, white blood cells where we are not interested in the space between them). The background superpixel will always be ignored. This is mostly for the UI (histomics-label). I tried a cleaner approach where we did not use dummy/mock values for this, but it requires changes to several parts of the code, also in histomics-label (to index at 1) large_image (for drawing superpixels where the indexing starts at 1) and histomicstk(to generate superpixels with starting value 1). I wanted to start simple by allowing an empty, unlabeled background pixel of size 1, which makes it more or less invisible in the UI.
Sort all labeled superpixels last in the "confidence" order from predictions to prevent them from re-emerging first to the user in the filmstrip during active learning.
Added tests to make sure my code (hopefully) doesn't break anything. The tests can also be used to benchmark the code.
Using CUDA is now an optional parameter. I've mostly done this to make tests run on the CPU only. But users can now toggle GPU/CPU on and off in slicer.
Miscellaneous, spelling fix/print statements

I have tested in DSA with my own AML slides and also using the default superpixels from the UI. The tests should also help to establish a baseline of what works and doesn't (test_full_predict starts from scratch with a slide of MNIST numbers and gets an accuracy around 80%`).

For the UI, see also DigitalSlideArchive/histomics-label#207

false used to be the default, it now has to be set manually since we have custom code in the model checkpoint

This is to print out the filename if anything fails for further inspection

This is to speed up AL loop Not a perfect solution, the UI will now recommend that users predict "default" for a lot of the labels. But it is a first step to make sure we can handle large slide with millions of annotations

When testing the code, I prefer keeping all the models on CPU. Before this commit, the models would automatically go to the GPU if available. After this commit, users can pass a parameter to keep models on CPU

Easier for tests

This also makes it easier to run tests in CPU-only environments

exists, skip it. This is to allow sparse pixelmaps which has empty space between bounding boxes/superpixels

andsild · 2025-06-05T15:47:50Z

superpixel_classification/SuperpixelClassification/SuperpixelClassification.xml

+      <longflag>usecuda</longflag>
+      <description>Whether or not to use GPU/cuda (true) or cpu (false).</description>
+      <label>Use CUDA</label>
+      <default>false</default>


this should be true by default

andsild · 2025-06-05T15:58:44Z

superpixel_classification/SuperpixelClassification/SuperpixelClassification.xml

+      <longflag>cutoff</longflag>
+      <label>Number of annotations per slide</label>
+      <default>500</default>
+      <description>Number of unannotated superpixels to use per slide for features, training and predictions</description>


training only uses labeled samples

manthey · 2025-06-05T16:27:58Z

@andsild I find that having multiple PRs accelerates review -- we can get simple things merged in and then the complex things look simpler. I often will do a git diff master > some_diff_file.txt, then edit the diff file to have the isolated change and apply it to a new branch based on master to tease apart the PRs (but if you use an IDE it might do all that more conveniently).

andsild · 2025-06-05T19:44:12Z

@andsild I find that having multiple PRs accelerates review -- we can get simple things merged in and then the complex things look simpler. I often will do a git diff master > some_diff_file.txt, then edit the diff file to have the isolated change and apply it to a new branch based on master to tease apart the PRs (but if you use an IDE it might do all that more conveniently).

I get that 😄 good practice to do anyway

I've opened up

which is just this PR divided into multiple others

cooperlab and others added 23 commits June 4, 2025 11:43

Adjust pixelmap tiff filename

09374fb

Adjust h5 filename

f587f8d

remove obsolete import

700c08f

replace f string

2604209

match complex feature.h5 files

a9c8603

fallback to hash-based h5 naming

d4a1af4

use feature file list

302880b

Add weights_only(false) to torch.load

094f57b

false used to be the default, it now has to be set manually since we have custom code in the model checkpoint

Add try-catch to model load

941764c

This is to print out the filename if anything fails for further inspection

Support cutoff predictions for AL

0d67b82

This is to speed up AL loop Not a perfect solution, the UI will now recommend that users predict "default" for a lot of the labels. But it is a first step to make sure we can handle large slide with millions of annotations

Add missing import

689d3b4

Fix spelling

713dcc8

Ignore any temporary files (generated during training in git

27829b8

Bugfix: use global index, not batch index, for bounding boxes

5f31877

Make device optional parameter

c113120

When testing the code, I prefer keeping all the models on CPU. Before this commit, the models would automatically go to the GPU if available. After this commit, users can pass a parameter to keep models on CPU

Add simple tests for features, training, pred

5e8c588

Make girder client a parameter

2a0e0cd

Easier for tests

Remove unused parameter

fba383a

Add print statements for tracking

ae783a5

Make using CUDA an optional parameter

4a73ce3

This also makes it easier to run tests in CPU-only environments

Add benchmark for model training

5d405c7

Add simple script to inspect feature files

95ae284

Allow a background pixel of [0,0,1,1] to be defined, and if it

5d47833

exists, skip it. This is to allow sparse pixelmaps which has empty space between bounding boxes/superpixels

andsild force-pushed the aza4423-AL-big branch from 8200591 to 5d47833 Compare June 4, 2025 17:24

andsild marked this pull request as draft June 4, 2025 17:24

andsild commented Jun 5, 2025

View reviewed changes

andsild mentioned this pull request Jun 5, 2025

Allow a background pixel of [0,0,1,1], which will always be ignored #33

Draft

andsild closed this Jun 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow `cutoff` predictions, support empty background pixels and tests #29

Allow `cutoff` predictions, support empty background pixels and tests #29

Uh oh!

andsild commented Jun 4, 2025 •

edited

Loading

Uh oh!

andsild Jun 5, 2025

Uh oh!

andsild Jun 5, 2025

Uh oh!

manthey commented Jun 5, 2025

Uh oh!

andsild commented Jun 5, 2025

Uh oh!

Uh oh!

Allow cutoff predictions, support empty background pixels and tests #29

Allow cutoff predictions, support empty background pixels and tests #29

Uh oh!

Conversation

andsild commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andsild Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

andsild Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

manthey commented Jun 5, 2025

Uh oh!

andsild commented Jun 5, 2025

Uh oh!

Uh oh!

Allow `cutoff` predictions, support empty background pixels and tests #29

Allow `cutoff` predictions, support empty background pixels and tests #29

andsild commented Jun 4, 2025 •

edited

Loading