Optimize VISTA3D (#8123)

binliunls · yiheng-wang-nv · KumoLiu · web-flow · commit 684688a2877b · 2024-10-22T12:18:02.000Z
Fixes #8122 . ### Description As shown in [this PR](Project-MONAI/model-zoo#671), the memory malloc and mask embedding for-loop are the bottlenecks that caused the vista3d slow inference. Therefore, this PR fixed them by adding the logic for malloc and replacing the for-loop with a tensor multiplication. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Integration tests passed locally by running `./runtests.sh -f -u --net --coverage`. - [ ] Quick tests passed locally by running `./runtests.sh --quick --unittests --disttests`. - [ ] In-line docstrings updated. - [ ] Documentation updated, tested `make html` command in the `docs/` folder. Signed-off-by: binliu <binliu@nvidia.com> Co-authored-by: Yiheng Wang <68361391+yiheng-wang-nv@users.noreply.github.com> Co-authored-by: YunLiu <55491388+KumoLiu@users.noreply.github.com>
diff --git a/monai/networks/nets/segresnet_ds.py b/monai/networks/nets/segresnet_ds.py
@@ -508,8 +508,10 @@ def forward(  # type: ignore
 
         outputs: list[torch.Tensor] = []
         outputs_auto: list[torch.Tensor] = []
-        x_ = x.clone()
+        x_ = x
         if with_point:
+            if with_label:
+                x_ = x.clone()
             i = 0
             for level in self.up_layers:
                 x = level["upsample"](x)
diff --git a/monai/networks/nets/vista3d.py b/monai/networks/nets/vista3d.py
@@ -639,12 +639,10 @@ def forward(self, src: torch.Tensor, class_vector: torch.Tensor):
         if self.use_mlp:
             class_embedding = self.mlp(class_embedding)
         # [b,1,feat] @ [1,feat,dim], batch dimension become class_embedding batch dimension.
-        masks = []
-        for i in range(b):
-            mask = class_embedding @ src[[i]].view(1, c, h * w * d)
-            masks.append(mask.view(-1, 1, h, w, d))
+        masks_embedding = class_embedding.squeeze() @ src.view(b, c, h * w * d)
+        masks_embedding = masks_embedding.view(b, -1, h, w, d).transpose(0, 1)
 
-        return torch.cat(masks, 1), class_embedding
+        return masks_embedding, class_embedding
 
 
 class TwoWayTransformer(nn.Module):