Skip to content

Conversation

shengfukevin
Copy link
Contributor

Summary

add tensor lazy allocate and recyle strategy to comp replay

Test Plan

mpirun -np 2 et_replay.par --trace-path param_bench/fb/integration_tests/resnet-2gpu

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 28, 2024
@facebook-github-bot
Copy link
Contributor

@shengfukevin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor Author

@shengfukevin shengfukevin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding support for "lazy allocate tensor" and recycle storage tensor. Please check my inlined comments.

I suggest to skip this feature for now since it may need more work to make it robust. I did not run into out of memory issue with the current version.

n.op_schema == ""
or n.name == "aten::record_stream"
or n.name.startswith("aten::_foreach")
#or n.name.startswith("aten::_foreach")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TaekyungHeo , why this line is commented out?

# The tensor with the same storage id may located on different devices.
self.tensor_storage_map: Dict[int, []] = defaultdict(set)
self.tensor_storage_map: Dict[int, Dict[torch.device, torch.Tensor]] = {}
self.tensor_alloc_set = set()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like tensor_alloc_set is populated, but never be referenced. Shall we clean it up?


def allocate_tensors(self):
start_ns = time.time_ns()
def get_tensor_from_storage(self, node, storage_id, data_offset, elem_bytes, device, shape, data_type):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tensor stride is missing. Please refer to the code in main branch to fix it.

tensor_strides = node.get_input_tensor_strides()
for idx, (data_type, t_id, shape) in enumerate(get_input_tensors(node)):
device = self.device
tensor_id, storage_id, storage_offset, element_num, item_size, device_str = t_id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may fail, t_id may not have device_str


self.tensor_storage_sizes[storage_id] = max(storage_offset + element_num * item_size, self.tensor_storage_sizes[storage_id])

self.tensor_registry_permanent[replay_t_id] = ("lazy_alloc", (storage_id, device),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code hard coded the tensor allocation to "lazy_alloc", I suggest to make it optional since lazy_alloc may affect the performance. If the tensors are allocated ahead of time, we usually does not include allocation time during benchmark. However, it is lazy_alloc, the tensor allocation time is always included during benchmark time.

def lazy_alloc_tensors(self, inputs, node):
for i in range(len(inputs)):
if isinstance(inputs[i], tuple) and inputs[i][0] == "lazy_alloc":
inputs[i] = inputs[i][2](node)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazy_alloc should only run once, and self.tensor_registry should be updated to use the allocated tensor next time. This code seems to allocate the tensor during each access.

del self.tensor_registry[replay_t_id]
need_del_replay_t_ids_in_input.add(replay_t_id)
elif replay_t_id in self.instantiate and device is not None:
self.recycle_instantiate_tensors(node.id, storage_id, device)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reference count is an overkill for recycling the tensor in self.instantiate. Why not just use the same idea for self.replay_tensor_id_to_last_node_id_map? We can add storage_tensor_id_to_last_node_id_map, and populate it in add_unique_tensor.

help="when a et_id is being replayed multiple times, setting this to false will use temsors from previous runs.",
)
parser.add_argument(
"--recycle_storages",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change to "recycle_tensor_storage"? Also if this is true, we need to allocate the storage tensor again when next iteration starts.

We need another flag "--lazy-alloc-tensor"

Copy link

meta-cla bot commented Aug 16, 2025

Hi @shengfukevin!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants