Skip to content

Commit 74a8ec3

Browse files
authored
Notebook update code + updated integration test (#61)
Notebook Instance Update PR This is a PR that adds the update operation to the notebook operator and also contains a sample. How update works: 1. Controller stops the Notebook if its not Stopped already.(Sets a status field "StoppedByControllerMETA") if it did stop the notebook. 2. Controller requeues during intermediate "Stopping" state until it reaches "Stopped" 3. Controller updates the notebook and also sets the StoppedByControllerMETA to "UpdateTriggered" if its not nil. 4. After the update the controller starts the Notebook if StoppedByControllerMETA is "UpdateTriggered". 5. The controller resets the StoppedByControllerMETA status field. Note: The update operation is supposed to start the notebook after an update if it was not in a stopped state and keep the notebook in the stopped state if it was in the stopped state prior to an update. We use a status field StoppedByControllerMETA to keep track of whether the notebook was stopped by a controller and an annotation done_updating to keep track of whether an update has finished. Why is StoppedByControllerMETA a status and not an annotation? After the controller stops a Notebook Instance, it can’t update the notebook because it will usually be in the stopping state and will stay in “Stopping” for at least a minute. Because of this an error is returned as the notebook can only be updated when it is in the stopped state. If we dont return an error the spec gets updated and the controller wont be able to detect a delta. Currently it is impossible for the controller to update an annotation if an update call has failed. If an update fails the controller patches the resource status and not the metadata(https://github.com/aws-controllers-k8s/runtime/blob/055b089c6a508317ad4bb57ce037dc55061e1829/pkg/runtime/reconciler.go#L235). We can’t also not return an error because that would patch the spec too(which would end up with the controller not calling update). Note: Even after 0.7.1, you cant update annotations(metadata) without also updating the spec, we cannot update the spec when we stop the notebook as that would prevent the update from occurring(runtime will not detect a diff).(https://github.com/aws-controllers-k8s/runtime/blob/d7b6e45c7ba9e54664b49f681747abf275a232f9/pkg/runtime/reconciler.go#L587) Files added: custom_sdk_api.go: Handles api calls to Sagemaker Hooks.go: Contains code to requeue and also customsetoutputDescribe which is used to set the resource synced conditions and also used to start the Notebook after an update(if it was not stopped before the update). notebook_instance/sdk_read_one_post_set_output.go.tpl: Calls custom_set_describe notebook_instance/sdk_read_one_pre_set_output.go.tpl. sdk_read_one_pre_set_output.go.tpl sets the LifecycleConfigName spec field(code generator does not generate it). notebook_instance/sdk_update_pre_build_request.go.tpl: Requeues if in the stopping/updating/pending state. If the Notebook is in the InService state stopNotebookInstance is called, if it does not return an error, the status field StoppedByControllerMETA is set to true. notebook_instance/sdk_update_post_set_output.go.tpl: Sets the is_updating status and sets the resource synced condition to false so the controller requeues. Covered in Create/Delete: #56 Removed a check for assert for UpdateTriggered because it makes the test much more flakey because update can take as little as a minute at times ———————————————————————————————————— Custom code: pkg/resource/notebook_instance/custom_set_delta.go : Fills in default values to prevent an update after creation pkg/resource/notebook_instance/hooks.go : Contains code to requeue and set sync conditions, also contains custom_set_output_describe that starts a Notebook(after an update). templates/notebook_instance/sdk_delete_pre_build_request.go.tpl : Stops the notebook and re queues on any state other than Stopped/Failed.
1 parent 56688d0 commit 74a8ec3

19 files changed

+479
-28
lines changed

apis/v1alpha1/generator.yaml

+13-4
Original file line numberDiff line numberDiff line change
@@ -652,14 +652,18 @@ resources:
652652
- InvalidAction
653653
- UnrecognizedClientException
654654
hooks:
655+
sdk_read_one_post_set_output:
656+
template_path: notebook_instance/sdk_read_one_post_set_output.go.tpl
657+
delta_pre_compare:
658+
code: customSetDefaults(a, b)
659+
sdk_update_pre_build_request:
660+
template_path: notebook_instance/sdk_update_pre_build_request.go.tpl
661+
sdk_update_post_set_output:
662+
code: rm.customSetOutputUpdate(ko, latest)
655663
sdk_delete_pre_build_request:
656664
template_path: notebook_instance/sdk_delete_pre_build_request.go.tpl
657665
sdk_delete_post_request:
658666
template_path: common/sdk_delete_post_request.go.tpl
659-
delta_pre_compare:
660-
code: customSetDefaults(a, b)
661-
sdk_read_one_post_set_output:
662-
code: rm.customSetOutput(&resource{ko})
663667
fields:
664668
NotebookInstanceStatus:
665669
is_read_only: true
@@ -673,6 +677,11 @@ resources:
673677
from:
674678
operation: DescribeNotebookInstance
675679
path: Url
680+
stoppedByControllerMetadata:
681+
is_read_only: true
682+
from:
683+
operation: DescribeNotebookInstance
684+
path: Url
676685
FailureReason:
677686
is_read_only: true
678687
print:

apis/v1alpha1/notebook_instance.go

+3
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

apis/v1alpha1/zz_generated.deepcopy.go

+5
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

config/crd/bases/sagemaker.services.k8s.aws_notebookinstances.yaml

+4
Original file line numberDiff line numberDiff line change
@@ -224,6 +224,10 @@ spec:
224224
notebookInstanceStatus:
225225
description: The status of the notebook instance.
226226
type: string
227+
stoppedByControllerMetadata:
228+
description: The URL that you use to connect to the Jupyter notebook
229+
that is running in your notebook instance.
230+
type: string
227231
url:
228232
description: The URL that you use to connect to the Jupyter notebook
229233
that is running in your notebook instance.

generator.yaml

+13-4
Original file line numberDiff line numberDiff line change
@@ -652,14 +652,18 @@ resources:
652652
- InvalidAction
653653
- UnrecognizedClientException
654654
hooks:
655+
sdk_read_one_post_set_output:
656+
template_path: notebook_instance/sdk_read_one_post_set_output.go.tpl
657+
delta_pre_compare:
658+
code: customSetDefaults(a, b)
659+
sdk_update_pre_build_request:
660+
template_path: notebook_instance/sdk_update_pre_build_request.go.tpl
661+
sdk_update_post_set_output:
662+
code: rm.customSetOutputUpdate(ko, latest)
655663
sdk_delete_pre_build_request:
656664
template_path: notebook_instance/sdk_delete_pre_build_request.go.tpl
657665
sdk_delete_post_request:
658666
template_path: common/sdk_delete_post_request.go.tpl
659-
delta_pre_compare:
660-
code: customSetDefaults(a, b)
661-
sdk_read_one_post_set_output:
662-
code: rm.customSetOutput(&resource{ko})
663667
fields:
664668
NotebookInstanceStatus:
665669
is_read_only: true
@@ -673,6 +677,11 @@ resources:
673677
from:
674678
operation: DescribeNotebookInstance
675679
path: Url
680+
stoppedByControllerMetadata:
681+
is_read_only: true
682+
from:
683+
operation: DescribeNotebookInstance
684+
path: Url
676685
FailureReason:
677686
is_read_only: true
678687
print:

helm/Chart.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
apiVersion: v1
22
name: sagemaker-chart
33
description: A Helm chart for the ACK service controller for sagemaker
4-
version: v0.0.3
5-
appVersion: v0.0.3
4+
version: v0.0.4-beta
5+
appVersion: v0.0.4-beta
66
home: https://github.com/aws-controllers-k8s/sagemaker-controller
77
icon: https://raw.githubusercontent.com/aws/eks-charts/master/docs/logo/aws.png
88
sources:
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
2+
---
3+
apiVersion: apiextensions.k8s.io/v1
4+
kind: CustomResourceDefinition
5+
metadata:
6+
annotations:
7+
controller-gen.kubebuilder.io/version: v0.6.1
8+
creationTimestamp: null
9+
name: notebookinstances.sagemaker.services.k8s.aws
10+
spec:
11+
group: sagemaker.services.k8s.aws
12+
names:
13+
kind: NotebookInstance
14+
listKind: NotebookInstanceList
15+
plural: notebookinstances
16+
singular: notebookinstance
17+
scope: Namespaced
18+
versions:
19+
- additionalPrinterColumns:
20+
- jsonPath: .status.failureReason
21+
name: FAILURE-REASON
22+
priority: 1
23+
type: string
24+
- jsonPath: .status.notebookInstanceStatus
25+
name: STATUS
26+
type: string
27+
name: v1alpha1
28+
schema:
29+
openAPIV3Schema:
30+
description: NotebookInstance is the Schema for the NotebookInstances API
31+
properties:
32+
apiVersion:
33+
description: 'APIVersion defines the versioned schema of this representation
34+
of an object. Servers should convert recognized schemas to the latest
35+
internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
36+
type: string
37+
kind:
38+
description: 'Kind is a string value representing the REST resource this
39+
object represents. Servers may infer this from the endpoint the client
40+
submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
41+
type: string
42+
metadata:
43+
type: object
44+
spec:
45+
description: NotebookInstanceSpec defines the desired state of NotebookInstance.
46+
properties:
47+
acceleratorTypes:
48+
description: A list of Elastic Inference (EI) instance types to associate
49+
with this notebook instance. Currently, only one instance type can
50+
be associated with a notebook instance. For more information, see
51+
Using Elastic Inference in Amazon SageMaker (https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html).
52+
items:
53+
type: string
54+
type: array
55+
additionalCodeRepositories:
56+
description: An array of up to three Git repositories to associate
57+
with the notebook instance. These can be either the names of Git
58+
repositories stored as resources in your account, or the URL of
59+
Git repositories in AWS CodeCommit (https://docs.aws.amazon.com/codecommit/latest/userguide/welcome.html)
60+
or in any other Git repository. These repositories are cloned at
61+
the same level as the default repository of your notebook instance.
62+
For more information, see Associating Git Repositories with Amazon
63+
SageMaker Notebook Instances (https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-git-repo.html).
64+
items:
65+
type: string
66+
type: array
67+
defaultCodeRepository:
68+
description: A Git repository to associate with the notebook instance
69+
as its default code repository. This can be either the name of a
70+
Git repository stored as a resource in your account, or the URL
71+
of a Git repository in AWS CodeCommit (https://docs.aws.amazon.com/codecommit/latest/userguide/welcome.html)
72+
or in any other Git repository. When you open a notebook instance,
73+
it opens in the directory that contains this repository. For more
74+
information, see Associating Git Repositories with Amazon SageMaker
75+
Notebook Instances (https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-git-repo.html).
76+
type: string
77+
directInternetAccess:
78+
description: "Sets whether Amazon SageMaker provides internet access
79+
to the notebook instance. If you set this to Disabled this notebook
80+
instance will be able to access resources only in your VPC, and
81+
will not be able to connect to Amazon SageMaker training and endpoint
82+
services unless your configure a NAT Gateway in your VPC. \n For
83+
more information, see Notebook Instances Are Internet-Enabled by
84+
Default (https://docs.aws.amazon.com/sagemaker/latest/dg/appendix-additional-considerations.html#appendix-notebook-and-internet-access).
85+
You can set the value of this parameter to Disabled only if you
86+
set a value for the SubnetId parameter."
87+
type: string
88+
instanceType:
89+
description: The type of ML compute instance to launch for the notebook
90+
instance.
91+
type: string
92+
kmsKeyID:
93+
description: The Amazon Resource Name (ARN) of a AWS Key Management
94+
Service key that Amazon SageMaker uses to encrypt data on the storage
95+
volume attached to your notebook instance. The KMS key you provide
96+
must be enabled. For information, see Enabling and Disabling Keys
97+
(https://docs.aws.amazon.com/kms/latest/developerguide/enabling-keys.html)
98+
in the AWS Key Management Service Developer Guide.
99+
type: string
100+
lifecycleConfigName:
101+
description: 'The name of a lifecycle configuration to associate with
102+
the notebook instance. For information about lifestyle configurations,
103+
see Step 2.1: (Optional) Customize a Notebook Instance (https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html).'
104+
type: string
105+
notebookInstanceName:
106+
description: The name of the new notebook instance.
107+
type: string
108+
roleARN:
109+
description: "When you send any requests to AWS resources from the
110+
notebook instance, Amazon SageMaker assumes this role to perform
111+
tasks on your behalf. You must grant this role necessary permissions
112+
so Amazon SageMaker can perform these tasks. The policy must allow
113+
the Amazon SageMaker service principal (sagemaker.amazonaws.com)
114+
permissions to assume this role. For more information, see Amazon
115+
SageMaker Roles (https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html).
116+
\n To be able to pass this role to Amazon SageMaker, the caller
117+
of this API must have the iam:PassRole permission."
118+
type: string
119+
rootAccess:
120+
description: "Whether root access is enabled or disabled for users
121+
of the notebook instance. The default value is Enabled. \n Lifecycle
122+
configurations need root access to be able to set up a notebook
123+
instance. Because of this, lifecycle configurations associated with
124+
a notebook instance always run with root access even if you disable
125+
root access for users."
126+
type: string
127+
securityGroupIDs:
128+
description: The VPC security group IDs, in the form sg-xxxxxxxx.
129+
The security groups must be for the same VPC as specified in the
130+
subnet.
131+
items:
132+
type: string
133+
type: array
134+
subnetID:
135+
description: The ID of the subnet in a VPC to which you would like
136+
to have a connectivity from your ML compute instance.
137+
type: string
138+
tags:
139+
description: An array of key-value pairs. You can use tags to categorize
140+
your AWS resources in different ways, for example, by purpose, owner,
141+
or environment. For more information, see Tagging AWS Resources
142+
(https://docs.aws.amazon.com/general/latest/gr/aws_tagging.html).
143+
items:
144+
description: Describes a tag.
145+
properties:
146+
key:
147+
type: string
148+
value:
149+
type: string
150+
type: object
151+
type: array
152+
volumeSizeInGB:
153+
description: The size, in GB, of the ML storage volume to attach to
154+
the notebook instance. The default value is 5 GB.
155+
format: int64
156+
type: integer
157+
required:
158+
- instanceType
159+
- notebookInstanceName
160+
- roleARN
161+
type: object
162+
status:
163+
description: NotebookInstanceStatus defines the observed state of NotebookInstance
164+
properties:
165+
ackResourceMetadata:
166+
description: All CRs managed by ACK have a common `Status.ACKResourceMetadata`
167+
member that is used to contain resource sync state, account ownership,
168+
constructed ARN for the resource
169+
properties:
170+
arn:
171+
description: 'ARN is the Amazon Resource Name for the resource.
172+
This is a globally-unique identifier and is set only by the
173+
ACK service controller once the controller has orchestrated
174+
the creation of the resource OR when it has verified that an
175+
"adopted" resource (a resource where the ARN annotation was
176+
set by the Kubernetes user on the CR) exists and matches the
177+
supplied CR''s Spec field values. TODO(vijat@): Find a better
178+
strategy for resources that do not have ARN in CreateOutputResponse
179+
https://github.com/aws/aws-controllers-k8s/issues/270'
180+
type: string
181+
ownerAccountID:
182+
description: OwnerAccountID is the AWS Account ID of the account
183+
that owns the backend AWS service API resource.
184+
type: string
185+
required:
186+
- ownerAccountID
187+
type: object
188+
conditions:
189+
description: All CRS managed by ACK have a common `Status.Conditions`
190+
member that contains a collection of `ackv1alpha1.Condition` objects
191+
that describe the various terminal states of the CR and its backend
192+
AWS service API resource
193+
items:
194+
description: Condition is the common struct used by all CRDs managed
195+
by ACK service controllers to indicate terminal states of the
196+
CR and its backend AWS service API resource
197+
properties:
198+
lastTransitionTime:
199+
description: Last time the condition transitioned from one status
200+
to another.
201+
format: date-time
202+
type: string
203+
message:
204+
description: A human readable message indicating details about
205+
the transition.
206+
type: string
207+
reason:
208+
description: The reason for the condition's last transition.
209+
type: string
210+
status:
211+
description: Status of the condition, one of True, False, Unknown.
212+
type: string
213+
type:
214+
description: Type is the type of the Condition
215+
type: string
216+
required:
217+
- status
218+
- type
219+
type: object
220+
type: array
221+
failureReason:
222+
description: If status is Failed, the reason it failed.
223+
type: string
224+
notebookInstanceStatus:
225+
description: The status of the notebook instance.
226+
type: string
227+
stoppedByControllerMetadata:
228+
description: The URL that you use to connect to the Jupyter notebook
229+
that is running in your notebook instance.
230+
type: string
231+
url:
232+
description: The URL that you use to connect to the Jupyter notebook
233+
that is running in your notebook instance.
234+
type: string
235+
required:
236+
- ackResourceMetadata
237+
- conditions
238+
type: object
239+
type: object
240+
served: true
241+
storage: true
242+
subresources:
243+
status: {}
244+
status:
245+
acceptedNames:
246+
kind: ""
247+
plural: ""
248+
conditions: []
249+
storedVersions: []

helm/templates/cluster-role-controller.yaml

+20
Original file line numberDiff line numberDiff line change
@@ -262,6 +262,26 @@ rules:
262262
- get
263263
- patch
264264
- update
265+
- apiGroups:
266+
- sagemaker.services.k8s.aws
267+
resources:
268+
- notebookinstances
269+
verbs:
270+
- create
271+
- delete
272+
- get
273+
- list
274+
- patch
275+
- update
276+
- watch
277+
- apiGroups:
278+
- sagemaker.services.k8s.aws
279+
resources:
280+
- notebookinstances/status
281+
verbs:
282+
- get
283+
- patch
284+
- update
265285
- apiGroups:
266286
- sagemaker.services.k8s.aws
267287
resources:

helm/templates/role-reader.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ rules:
2121
- modelpackagegroups
2222
- modelqualityjobdefinitions
2323
- monitoringschedules
24+
- notebookinstances
2425
- processingjobs
2526
- trainingjobs
2627
- transformjobs

0 commit comments

Comments
 (0)