Skip to content

insertDeallocate inspects inner scopes#6007

Open
Priya2698 wants to merge 20 commits intomainfrom
pm/deallocate
Open

insertDeallocate inspects inner scopes#6007
Priya2698 wants to merge 20 commits intomainfrom
pm/deallocate

Conversation

@Priya2698
Copy link
Collaborator

No description provided.

@github-actions
Copy link

github-actions bot commented Feb 24, 2026

Review updated until commit 2078dbc

Description

  • Implement post-dominator tree-based deallocation to handle inner scopes correctly

  • Refactor Node class out of DominatorTree and add parent pointer for tree traversal

  • Add computeLeastCommonAncestor function to find latest safe deallocation point

  • Add tests verifying no memory leaks in two matmul fusion scenarios

Changes walkthrough

Relevant files
Enhancement
allocate_and_deallocate.cpp
Implement post-dominator tree for deallocation                     

csrc/host_ir/allocate_and_deallocate.cpp

  • Extract Node class from DominatorTree and add parent pointer
  • Add standalone depthFirstTraverse function for reuse
  • Implement PostDominatorTree class traversing expressions in reverse
  • Add computeLeastCommonAncestor using post-dominator tree
  • Rewrite insertDeallocations to use LCA-based deallocation strategy
  • Replace std::for_each with std::ranges::for_each
  • +190/-100
    Tests
    test_host_ir_passes.cpp
    Add memory leak tests for host IR passes                                 

    tests/cpp/test_host_ir_passes.cpp

  • Add HostIrPassesTest fixture with HostIrLowering enabled
  • Add TwoMatmulsInlinable test case checking memory leak
  • Add TwoMatmulsNotInlinable test case checking memory leak
  • Add helper functions to collect and validate persistent TensorViews
  • +164/-0 
    Configuration changes
    CMakeLists.txt
    Add new test file to build                                                             

    CMakeLists.txt

    • Add test_host_ir_passes.cpp to HOSTIR_TEST_SRCS
    +1/-0     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Post-dominator tree build iteration issue

    In PostDominatorTree::build (lines 154-176), the reverse iteration uses auto it = exprs.end(); it != exprs.begin(); --it. This pattern is problematic because exprs.end() is a valid iterator to one-past-the-end, and dereferencing it would be undefined behavior. While the loop body doesn't dereference it directly (it first decrements), this pattern is fragile and could cause issues if the container is empty. Consider adding an explicit empty check or using std::ranges::reverse_view for safer reverse iteration.

    void build(Scope& scope, Node* parent) {
      auto& exprs = scope.exprs();
      for (auto it = exprs.end(); it != exprs.begin();) {
        --it;
        Expr* e = *it;
        auto [node_it, inserted] = nodes_.try_emplace(e, &scope, it, parent);
        NVF_ERROR(inserted);
        Node& node = node_it->second;
        if (parent != nullptr) {
          parent->addChild(&node);
        }
    
        if (auto* loop = dynamic_cast<hir::ForLoop*>(e)) {
          build(loop->body(), &node);
        }
        if (auto* ite = dynamic_cast<kir::IfThenElse*>(e)) {
          build(ite->thenBody(), &node);
          build(ite->elseBody(), &node);
        }
    
        parent = &node;
      }
    }
    Missing inner scope handling in PostDominatorTree

    The PostDominatorTree::build method handles hir::ForLoop and kir::IfThenElse (lines 166-172), but there's a potential inconsistency: it uses hir::ForLoop but kir::IfThenElse. The original DominatorTree::build also uses hir::ForLoop and kir::IfThenElse. However, there may be other scope-creating expressions (e.g., hir::WhileLoop or other compound statements) that should also be handled. Verify all scope-creating expressions are properly covered.

    if (auto* loop = dynamic_cast<hir::ForLoop*>(e)) {
      build(loop->body(), &node);
    }
    if (auto* ite = dynamic_cast<kir::IfThenElse*>(e)) {
      build(ite->thenBody(), &node);
      build(ite->elseBody(), &node);
    }
    Potential null pointer in LCA computation

    In computeLeastCommonAncestor (lines 237-294), the findLCA lambda handles null nodes by returning the other node (lines 242-247). However, the depth map lookup at line 248 uses depth.at(a) and depth.at(b) without null checks after the initial null handling. If either a or b becomes null during the while loops (which shouldn't happen based on the logic), this would throw. The logic appears sound but the null handling could be more explicit.

    auto findLCA = [&](const Node* a, const Node* b) -> const Node* {
      if (a == nullptr) {
        return b;
      }
      if (b == nullptr) {
        return a;
      }
      int64_t depth_a = depth.at(a);
      int64_t depth_b = depth.at(b);
      while (depth_a > depth_b) {
        a = a->parent();
        depth_a--;
      }
      while (depth_b > depth_a) {
        b = b->parent();
        depth_b--;
      }
      while (a != b) {
        a = a->parent();
        b = b->parent();
      }
      return a;
    };

    @Priya2698
    Copy link
    Collaborator Author

    !test

    @Priya2698 Priya2698 mentioned this pull request Feb 24, 2026
    @Priya2698
    Copy link
    Collaborator Author

    !test

    @Priya2698
    Copy link
    Collaborator Author

    !test

    @Priya2698 Priya2698 marked this pull request as ready for review February 25, 2026 00:05
    @Priya2698 Priya2698 requested a review from wujingyue February 25, 2026 00:06
    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 25, 2026

    Greptile Summary

    Refactors insertDeallocations to inspect inner scopes (loops, if-then-else blocks) instead of only top-level expressions. Key changes:

    • Extracts Node class from DominatorTree with parent pointer support
    • Introduces PostDominatorTree for reverse control flow analysis
    • Implements LCA (Least Common Ancestor) algorithm to find optimal deallocation points across all tensor uses
    • Replaces simple backward iteration with scope-aware deallocation placement using computeLeastCommonAncestor
    • Adds comprehensive tests for memory leak detection in multi-matmul scenarios

    The refactoring addresses previous concerns about .at() throwing errors by eliminating the last_use map approach entirely. The new LCA-based approach correctly handles tensors used across multiple scopes.

    Confidence Score: 4/5

    • This PR is safe to merge with one minor style suggestion
    • The core refactoring is sound - the LCA algorithm correctly identifies deallocation points, and the post-dominator tree properly traverses inner scopes. The changes successfully address previous concerns about map access errors. Score reflects the single non-critical style issue in the test helper function
    • No files require special attention - the style suggestion in the test file is minor and optional

    Important Files Changed

    Filename Overview
    CMakeLists.txt Adds new test file test_host_ir_passes.cpp to the HOSTIR_TEST_SRCS list
    csrc/host_ir/allocate_and_deallocate.cpp Refactors deallocation logic to use post-dominator tree and LCA algorithm for inspecting inner scopes (loops, if-then-else blocks); extracts Node class and depthFirstTraverse function; replaces top-level-only deallocation with scope-aware placement
    tests/cpp/test_host_ir_passes.cpp New test file with two test cases for memory leak checking; collectPersistentTensorViews helper missing kir::IfThenElse scope traversal

    Last reviewed commit: 2078dbc

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    3 files reviewed, 2 comments

    Edit Code Review Agent Settings | Greptile

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 25, 2026

    Additional Comments (1)

    csrc/host_ir/allocate_and_deallocate.cpp, line 269
    Add null check before dereferencing definition(). While fusion inputs/outputs are filtered above, a TensorView may not have a definition in other cases (e.g., allocated but unused tensors).

      if (tv->definition() != nullptr && tv->definition()->isA<ShardByStream>()) {
    

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 26, 2026

    Additional Comments (3)

    csrc/host_ir/allocate_and_deallocate.cpp, line 45
    member variable is iter_ (line 66), not iterator_

        return iter_;
    

    csrc/host_ir/allocate_and_deallocate.cpp, line 49
    member variable is iter_, not iterator_

        return *iter_;
    

    csrc/host_ir/allocate_and_deallocate.cpp, line 194
    it is a reverse iterator (std::reverse_iterator<ExprList::const_iterator>) but Node constructor expects forward iterator type Scope::Iterator (ExprList::const_iterator). Convert using .base() or use forward iteration with different tree construction logic

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 27, 2026

    Additional Comments (2)

    csrc/host_ir/allocate_and_deallocate.cpp, line 239
    iter() method does not exist on Node

    The Node class (defined at line 30) exposes only iterator(), not iter(). This call will fail to compile.

    The diff shows this was introduced by this PR, changing node->iterator()node->iter(), but no corresponding rename was made in the Node class definition.

                node->scope()->insert(node->iterator(), allocate);
    

    csrc/host_ir/allocate_and_deallocate.cpp, line 181
    Reverse iterator passed where Scope::Iterator (forward iterator) is expected

    scope.exprs() returns const ExprList& (see Scope::exprs() in ir/internal_nodes.h), so rbegin() yields a std::list<Expr*>::const_reverse_iterator. However, the Node constructor signature is:

    Node(Scope* scope, Scope::Iterator iterator, const Node* parent)

    where Scope::Iterator = ExprList::const_iterator. There is no implicit conversion from const_reverse_iterator to const_iterator, so this will fail to compile.

    Beyond the type error, even if the iterator were stored, insertDeallocations calls:

    lca_node->scope()->insert(std::next(lca_node->iterator()), deallocate);

    std::next on a reverse iterator advances backward in the list, so the insertion point would be wrong (before rather than after the LCA node's expression in program order).

    The fix is to convert to a forward iterator before storing. The element pointed to by reverse iterator it can be obtained as a forward iterator via std::prev(it.base()):

        for (auto it = exprs.rbegin(); it != exprs.rend(); ++it) {
          Expr* e = *it;
          Scope::Iterator fwd_it = std::prev(it.base());
          auto [node_it, inserted] = nodes_.try_emplace(e, &scope, fwd_it, parent);

    @Priya2698
    Copy link
    Collaborator Author

    !test

    @Priya2698 Priya2698 requested a review from wujingyue February 27, 2026 01:00
    @Priya2698
    Copy link
    Collaborator Author

    !test

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 27, 2026

    Additional Comments (1)

    tests/cpp/test_host_ir_passes.cpp, line 49
    Missing recursive check for kir::IfThenElse expressions. The function checks ForLoop bodies but not IfThenElse then/else bodies. If allocations exist inside if-then-else blocks, they won't be validated for proper deallocation.

        if (auto* loop = dynamic_cast<hir::ForLoop*>(e)) {
          collectPersistentTensorViews(loop->body(), allocated);
        }
        if (auto* ite = dynamic_cast<kir::IfThenElse*>(e)) {
          collectPersistentTensorViews(ite->thenBody(), allocated);
          collectPersistentTensorViews(ite->elseBody(), allocated);
        }
    

    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Almost there. I'm reviewing the LCA part...

    // For each TensorView that is allocated or used as an input, find its
    // least common ancestor in the Post-dominator Tree — the latest point at which
    // it can be deallocated.
    std::unordered_map<TensorView*, const Node*> computeLeastCommonAncestor(
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Suggested change
    std::unordered_map<TensorView*, const Node*> computeLeastCommonAncestor(
    std::unordered_map<TensorView*, const Node*> computeLowestCommonAncestor(

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    My bad.

    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM otherwise

    }
    PostDominatorTree post_dominator_tree(hic);
    const std::unordered_map<TensorView*, const Node*>& lca_map =
    computeLeastCommonAncestor(post_dominator_tree);
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Consider wrapping this in a class so we don't have to expose std::unordered_map to the user.

    lcas = LowestCommonAncestors(post_dominator_tree);
    lcas.getLca(tv);
    

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    What is the downside of exposing the lca_map?

    Comment on lines +257 to +281
    std::unordered_map<const Node*, int64_t> depth;

    auto findLCA = [&](const Node* a, const Node* b) -> const Node* {
    if (a == nullptr) {
    return b;
    }
    if (b == nullptr) {
    return a;
    }
    int64_t depth_a = depth.at(a);
    int64_t depth_b = depth.at(b);
    while (depth_a > depth_b) {
    a = a->parent();
    depth_a--;
    }
    while (depth_b > depth_a) {
    b = b->parent();
    depth_b--;
    }
    while (a != b) {
    a = a->parent();
    b = b->parent();
    }
    return a;
    };
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Consider making it a private method of class LowestCommonAncestors, and making depth LowestCommonAncestors::depth_.

    Comment on lines +27 to +64
    // Traverse the IR and collect all allocated Tensorviews and remove them when
    // a Deallocate is encountered.
    void collectPersistentTensorViews(
    const Scope& scope,
    std::unordered_set<TensorView*>& allocated) {
    for (Expr* e : scope.exprs()) {
    if (auto* dealloc = dynamic_cast<hir::Deallocate*>(e)) {
    allocated.erase(dealloc->buffer());
    continue;
    }
    if (auto* alloc = dynamic_cast<kir::Allocate*>(e)) {
    allocated.insert(alloc->buffer()->as<TensorView>());
    continue;
    }
    for (auto* tv : ir_utils::filterByType<TensorView>(e->inputs())) {
    allocated.insert(tv);
    }
    for (auto* tv : ir_utils::filterByType<TensorView>(e->outputs())) {
    allocated.insert(tv);
    }
    if (auto* loop = dynamic_cast<hir::ForLoop*>(e)) {
    collectPersistentTensorViews(loop->body(), allocated);
    }
    }
    }

    void checkMemoryLeak(const hir::HostIrContainer& hic) {
    std::unordered_set<TensorView*> allocated;
    collectPersistentTensorViews(hic.topLevel(), allocated);
    EXPECT_TRUE(std::all_of(
    allocated.begin(),
    allocated.end(),
    [](TensorView* tv) {
    return tv->isFusionInput() || tv->isFusionOutput();
    }))
    << "Some TensorViews allocated in IR are not deallocated and not fusion "
    "inputs/outputs.";
    }
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I understood your intention for better coverage but these helpers make the test less DAMP: https://testing.googleblog.com/2019/12/testing-on-toilet-tests-too-dry-make.html. Someone debugging a test has to also understand the logic of these two functions.

    It may work better if we simply test the number of deallocations in the host IR container.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I see your point.

    What do you think about testing the deallocation for the intermediate output -- that is, the number of deallocations in top_level?

    When testing directly, I don't find it very obvious/upfront how many intermediates we will have that get a deallocate -- hence, the traversal based approach so I don't have to figure that out when reading the test.

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 28, 2026

    Additional Comments (1)

    tests/cpp/test_host_ir_passes.cpp, line 49
    also handle kir::IfThenElse to match the main code's scope traversal

        if (auto* loop = dynamic_cast<hir::ForLoop*>(e)) {
          collectPersistentTensorViews(loop->body(), allocated);
        }
        if (auto* ite = dynamic_cast<kir::IfThenElse*>(e)) {
          collectPersistentTensorViews(ite->thenBody(), allocated);
          collectPersistentTensorViews(ite->elseBody(), allocated);
        }
    

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants