Merge latest changes from main to 'Documentation' branch #192

rsareddy0329 · 2025-08-05T23:05:11Z

PR Approval Steps

For Requester

Description
- Check the PR title and description for clarity. It should describe the changes made and the reason behind them.
- Ensure that the PR follows the contribution guidelines, if applicable.
Security requirements
- Ensure that a Pull Request (PR) does not expose passwords and other sensitive information by using git-secrets and upload relevant evidence: https://github.com/awslabs/git-secrets
- Ensure commit has GitHub Commit Signature
Manual review
1. Click on the Files changed tab to see the code changes. Review the changes thoroughly:
  - Code Quality: Check for coding standards, naming conventions, and readability.
  - Functionality: Ensure that the changes meet the requirements and that all necessary code paths are tested.
  - Security: Check for any security issues or vulnerabilities.
  - Documentation: Confirm that any necessary documentation (code comments, README updates, etc.) has been updated.
Check for Merge Conflicts:
- Verify if there are any merge conflicts with the base branch. GitHub will usually highlight this. If there are conflicts, you should resolve them.

For Reviewer

Go through For Requester section to double check each item.
Request Changes or Approve the PR:
1. If the PR is ready to be merged, click Review changes and select Approve.
2. If changes are required, select Request changes and provide feedback. Be constructive and clear in your feedback.
Merging the PR
1. Check the Merge Method:
  1. Decide on the appropriate merge method based on your repository's guidelines (e.g., Squash and merge, Rebase and merge, or Merge).
2. Merge the PR:
  1. Click the Merge pull request button.
  2. Confirm the merge by clicking Confirm merge.

Co-authored-by: adishaa <[email protected]>

… with minor improvements and bug fixes (#137)

… with minor improvements and bug fixes. (#139)

…and ux (#136)

…ception count data (#140)

* manual release v3.0.1

…alarm fix (#147)

… regionalized HMA URI (#141)

* Add unique time string to integ test * Update syntax

* Training CLI & SDK: example notebook and README update * Update training cli example notebook --------- Co-authored-by: Roja Reddy Sareddy <[email protected]>

* Update inferenece SDK examples * Update readme

* Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * Enable Hyperpod telemetry * CLI: Enable Telemetry * CLI: Enable Telemetry --------- Co-authored-by: Roja Reddy Sareddy <[email protected]>

…102)

* update help text to avoid truncation * update volume flag to support hostPath and pvc, before e2e testing * clean up and e2e working * Minor updates after PR * update * Added unit tests for volume, all cli unit tests passed

Co-authored-by: pintaoz <[email protected]>

* Update inference config and integ tests * Update integ tests for new canaries

* Manual release v3.0.2 * Update changelog --------- Co-authored-by: pintaoz <[email protected]>

* Update readme for volume flag * Add schema pattern check to pytorch-job template, unit test added, all test passed locally

…8s (#138) * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Fix unit test cases * Move regex to a constant. **Description** - Removed an integration test case that was being mocked. - Moved a regex to a constant. **Testing Done** Unit test cases pass no changes made to integration test cases and they should not be affected. * Add k8s version validation check between server and client version according to the supported versioning constraints by k8s * Add ref link for version comptability contraints **Description** Added a link to k8s documentation mentioning the constraints that rule the version compatibility of client and server. **Testing Done** No breaking changes.

* Fix SDK training test: Add wait time before refresh * Fix training tests in canaries

…189) Co-authored-by: pintaoz <[email protected]>

* Update documentation-with-new-changes branch with latest changes from main (#190) * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <[email protected]> --------- Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> * Documentation Fixes (#191) Co-authored-by: Roja Reddy Sareddy <[email protected]> * update documentation with new changes branch with latest changes (#194) * Fix training test (#184) * Fix SDK training test: Add wait time before refresh * Fix training tests in canaries * Update logging information for submitting and deleting training job (#189) Co-authored-by: pintaoz <[email protected]> --------- Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> * Documentation Fixes (#195) * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Documentation Fixes (#197) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Documentation Fixes (#198) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Documentation fixes (#199) * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes * Documentation Fixes --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> --------- Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> Co-authored-by: Roja Reddy Sareddy <[email protected]>

…s to view SDK config code (#188) Co-authored-by: Mohamed Zeidan <[email protected]>

* Add instance type support for ml.p6e-gb200.36xlarge Updated support for ml.p6-b200.48xlarge as well * Add ml.p6e-gb200.36xlarge to efa plugin

…holder value (#206) Co-authored-by: Mohamed Zeidan <[email protected]>

Co-authored-by: Roja Reddy Sareddy <[email protected]>

* Add labels to the top level metadata (#158) Co-authored-by: pintaoz <[email protected]> * Implemented GPU Quota Allocation Feature. Co-authored-by: aleszewi <[email protected]> * Revert "Implemented GPU Quota Allocation Feature." This reverts commit 790b8f1df59494a982463aaed9e5b3f2afa44123. * Fix: Template issue - pick user defined template version (#154) * Fix: Template issue - pick user defined template version * Fix: Template issue - pick user defined template version & add topology labels in 1.1 * Fix: Template issue - pick user defined template version & add topology labels in 1.1 --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Fix: Add __init__ to the new schema (#163) * Fix: Template issue - pick user defined template version * Fix: Template issue - pick user defined template version & add topology labels in 1.1 * Fix: Template issue - pick user defined template version & add topology labels in 1.1 * Fix: Add __init__ to load the new schema --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Add labels and annotations to top level metadata v1.1 (#165) * Add labels to top level metadata v1.1 * Move topology labels to annotations * Update topology parameter names * Add unit test --------- Co-authored-by: pintaoz <[email protected]> * Added GPU quota allocation. Co-authored-by: aleszewi <[email protected]> * Changed neuron key to neurondevice. (#177) Co-authored-by: Marta Aleszewicz <[email protected]> * fix: Renamed memory-in-gib to memory for consistency. (#179) cr: https://code.amazon.com/reviews/CR-214599587 Co-authored-by: Marta Aleszewicz <[email protected]> * Add validation to topology labels (#178) * Add validation to topology labels * Add validation to topology labels * Add validation to topology labels --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Add integ tests for topology annotations (#180) * Add labels to top level metadata v1.1 * Move topology labels to annotations * Update topology parameter names * Add unit test * Topology integ tests * Add invalid test case * Add empty test case --------- Co-authored-by: pintaoz <[email protected]> * Add integration tests for gpu quota allocation feature (#184) * add integration tests for gpu quota allocation feature * add valueError assertions for invalid test cases * Updating the CHANGELOG and minor version --------- Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> Co-authored-by: Marta Aleszewicz <[email protected]> Co-authored-by: rsareddy0329 <[email protected]> Co-authored-by: Roja Reddy Sareddy <[email protected]> Co-authored-by: mx26pol <[email protected]> Co-authored-by: satish Kumar <[email protected]>

…gs to hyp-jumpstart-endpoint (#213) * Update generate_click_command inject logic to not expose unwanted flags to hyp-jumpstart-endpoint * Update unit tests for bug fix, change --label_selector to --label-selector

* Update generate_click_command inject logic to not expose unwanted flags to hyp-jumpstart-endpoint * Update unit tests for bug fix, change --label_selector to --label-selector * Update README, example notebooks and documentation to 1)remove model_version, 2)add --model-volume-mount-name 3)remove tar.gz from --model-location 4)update unique mount_path for --volume * Update README, example notebooks and documentation to remove tls-config for jumpstart * minor update to remove tar.gz from --model-location for documentation

#219) * add metadata_name argument to js and custom endpoint to match with SDK * fix integ

* Add cert mgr installation * Add cert mgr installation * update cert-mgr readme --------- Co-authored-by: Xin Wang <[email protected]>

**Description** - Removed outdated Helm installation requirement for HyperPod CLI V3 - Fixed step numbering in installation section (1, 2, 3 instead of 1, 1, 1) - Simplified installation process by removing unnecessary Helm setup steps **Testing Done** Not needed, just README updates.

* Update description for scheduler type Tested in terminal with command `hyp create hyp-pytorch-job --help` and can see new description * Update scheduler type description in v1_0

Co-authored-by: Xin Wang <[email protected]>

… with minor improvements and bug fixes. (#225)

* feat: add get_operator_logs to pytorch job * feat: add get_operator_logs to pytorch job * feat: add get_operator_logs to pytorch job * feat: add get_operator_logs to pytorch job --------- Co-authored-by: Roja Reddy Sareddy <[email protected]>

* add metadata_name argument to js and custom endpoint to match with SDK * fix integ * change container name in pytorch template * update v1_0 too * update default container name for pytorch job template

…staging repo (#228)

…227) * Update list_pods to only display pods of corresponding endpoint type * Use list endpoints to check endpoint type --------- Co-authored-by: pintaoz <[email protected]>

Aditi2424 and others added 25 commits July 18, 2025 12:24

Update telemetry status to be Integer for parity (#130)

223af40

Co-authored-by: adishaa <[email protected]>

Release new version for Health Monitoring Agent (1.0.643.0_1.0.192.0)…

cf77296

… with minor improvements and bug fixes (#137)

Release new version for Health Monitoring Agent (1.0.674.0_1.0.199.0)…

0342f60

… with minor improvements and bug fixes. (#139)

update inference CLI describe command print for better visualization …

631ddf9

…and ux (#136)

Update inference integ test to add dependency to improve telemetry ex…

dc440c3

…ception count data (#140)

Manual release v3.0.1 (#143)

cc08405

* manual release v3.0.1

change security-monitoring metrics data destination to us-east-2 for …

079fafd

…alarm fix (#147)

feat: Add region detection to install Health Monitoring Agent and use…

29a16c5

… regionalized HMA URI (#141)

Add unique time string to integ test (#150)

66232ed

* Add unique time string to integ test * Update syntax

update example notebook for inference CLI (#151)

9fbec4a

Training: Main documentation update (#153)

8034a24

* Training CLI & SDK: example notebook and README update * Update training cli example notebook --------- Co-authored-by: Roja Reddy Sareddy <[email protected]>

Update inferenece SDK examples (#155)

0bcee6d

* Update inferenece SDK examples * Update readme

update help text to avoid truncation (#158)

d2130e9

Add an option to disable the deployment of KubeFlow TrainingOperator (#…

293f9b9

…102)

Remove unused param from documentation (#170)

9f534b4

Update volume flag to support hostPath and pvc (#171)

ec8800d

* update help text to avoid truncation * update volume flag to support hostPath and pvc, before e2e testing * clean up and e2e working * Minor updates after PR * update * Added unit tests for volume, all cli unit tests passed

Restructure list-cluster output (#173)

95e073e

Co-authored-by: pintaoz <[email protected]>

Update inference config and integ tests (#167)

a8a2baf

* Update inference config and integ tests * Update integ tests for new canaries

Update readme for volume flag (#176)

2908a62

Manual release v3.0.2 (#177)

9b7220c

* Manual release v3.0.2 * Update changelog --------- Co-authored-by: pintaoz <[email protected]>

Add schema pattern check to pytorch-job template (#178)

36fac66

* Update readme for volume flag * Add schema pattern check to pytorch-job template, unit test added, all test passed locally

Fix training test (#184)

dcbc8fb

* Fix SDK training test: Add wait time before refresh * Fix training tests in canaries

Update logging information for submitting and deleting training job (#…

28424e4

…189) Co-authored-by: pintaoz <[email protected]>

rsareddy0329 requested a review from a team as a code owner August 5, 2025 23:05

rsareddy0329 and others added 4 commits August 6, 2025 13:51

Added new column 'deploymeny configs' to the itable that allows user'…

6553766

…s to view SDK config code (#188) Co-authored-by: Mohamed Zeidan <[email protected]>

Add instance type support for ml.p6e-gb200.36xlarge (#204)

63ff3b4

* Add instance type support for ml.p6e-gb200.36xlarge Updated support for ml.p6-b200.48xlarge as well * Add ml.p6e-gb200.36xlarge to efa plugin

changed endpoint name from value user has to manually insert to place…

e3f697a

…holder value (#206) Co-authored-by: Mohamed Zeidan <[email protected]>

rsareddy0329 and others added 17 commits August 12, 2025 14:55

Enable PR checks on feature branches (#207)

d16d1b3

Co-authored-by: Roja Reddy Sareddy <[email protected]>

update CHANGELOG.md (#175)

96c5b2b

Add metadata_name argument to js and custom endpoint to match with SDK (

f747815

#219) * add metadata_name argument to js and custom endpoint to match with SDK * fix integ

Add cert mgr installation which is required by HPTO (#180)

a4f0465

* Add cert mgr installation * Add cert mgr installation * update cert-mgr readme --------- Co-authored-by: Xin Wang <[email protected]>

Implementing hyp version command (#223)

9c07154

Update description for scheduler type (#222)

73a41b3

* Update description for scheduler type Tested in terminal with command `hyp create hyp-pytorch-job --help` and can see new description * Update scheduler type description in v1_0

fix: Set cert mgr installation disable by default (#224)

743bd4d

Co-authored-by: Xin Wang <[email protected]>

Release new version for Health Monitoring Agent (1.0.742.0_1.0.241.0)…

99121e7

… with minor improvements and bug fixes. (#225)

Change default container name in pytorch template (#220)

d2bd3c2

* add metadata_name argument to js and custom endpoint to match with SDK * fix integ * change container name in pytorch template * update v1_0 too * update default container name for pytorch job template

Enhanced Error Handling for all hyp commands

cc9eec6

update v1.1 pytorch job template to match parity with v1.0 change in …

f571859

…staging repo (#228)

Update list_pods to only display pods of corresponding endpoint type (#…

935a4d9

…227) * Update list_pods to only display pods of corresponding endpoint type * Use list endpoints to check endpoint type --------- Co-authored-by: pintaoz <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge latest changes from main to 'Documentation' branch #192

Merge latest changes from main to 'Documentation' branch #192

rsareddy0329 commented Aug 5, 2025

Uh oh!

Uh oh!

Merge latest changes from main to 'Documentation' branch #192

Are you sure you want to change the base?

Merge latest changes from main to 'Documentation' branch #192

Conversation

rsareddy0329 commented Aug 5, 2025

PR Approval Steps

For Requester

For Reviewer

Uh oh!

Uh oh!