-
Notifications
You must be signed in to change notification settings - Fork 47
Feature/multi model artifact handler #869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/multi model artifact handler #869
Conversation
Signed-off-by: Yash Pandit <[email protected]>
Signed-off-by: Yash Pandit <[email protected]>
Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application. When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated. If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public. |
….com/YashPandit4u/accelerated-data-science into feature/multi-model-artifact-handler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a few comments, I think we should be reusing most of the implementation from DataScienceModel class instead of recreating it in here. A high level design would look like:
class DataScienceModelCollection(DataScienceModel):
def __init__():
super().__init__(**kwargs)
# no need to add model self.modeldescriptionjson since this attribute is already present in DataScienceModel
def add():
...
def remove():
...
def show():
...
The usage would look like:
>>> model = (DataScienceModelCollection()
... .with_compartment_id()
... .with_project_id()
... .with_display_name()
... .with_description()
... .with_freeform_tags()
... .with_artifact("model1_path", "model2_path")
>>> model.add(...)
>>> model.remove(...)
>>> model.create(model_by_reference=True)
On second thought, one other option is to have add(), remove() within DataScienceModel itself, then we don't need the collection class. The add/remove functions are only manipulating the model_file_description property that represents multiple models, so having a separate subclass may not be required.
any thoughts on this?
ads/model/model_description.py
Outdated
# from ads.common import logger | ||
import logging | ||
|
||
logger = logging.getLogger("ads.model_description") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we reuse the logger from ads.common?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I used logger from ads.common I don't see any info messages and for some methods in this class I want to show info messages.
ads/model/model_description.py
Outdated
# Remove if the model already exists | ||
self.remove(namespace, bucket, prefix) | ||
|
||
def checkIfFileExists(fileName): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we reuse is_path_exists from ads.common.utils
instead?
ads/model/model_description.py
Outdated
return isExists | ||
|
||
# Function to un-paginate the api call with while loop | ||
def listObjectVersionsUnpaginated(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ads.common.object_storage_details.list_object_versions
can be used instead?
ads/model/model_description.py
Outdated
logger.error(f"An unexpected error occurred: {e}") | ||
raise e | ||
|
||
def add(self, namespace, bucket, prefix=None, files=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend using type hinting for all the functions, for example:
def add(self, namespace: str, bucket: str, ... ): -> None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added type hints.
ads/model/model_description.py
Outdated
display_name = ( | ||
"Created by MMS SDK on " | ||
+ datetime.datetime.now(pytz.utc).strftime("%Y-%m-%d %H:%M:%S %Z") | ||
if (display_name == None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use if display_name:
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggested statement will trigger even when we have blank string.
ads/model/model_description.py
Outdated
logger.error(f"An unexpected error occurred: {e}") | ||
raise e | ||
|
||
def add(self, namespace, bucket, prefix=None, files=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have misunderstood what we're trying to achieve in this PR, but wouldn't a user experience like the following be more streamlined?
model_description = ModelDescription().add("oci://bucket_name@namespace/prefix1", "oci://bucket_name@namespace/person_prefix2")
Here, ModelDescription
acts as a builder for the JSON configuration needed.
It still can contain all the introduced methods in this PR: add, remove, build, print...
I would prefer instead of .build to use
.to_json(uri="path/to/the/json")
where if user don't provide uri then JSON should be returned as a result of the function. Also adding .to_dict()
might be useful as well.
In this case we don't need .show()
method.
We could then enhance the DatascienceModel
class with:
DatascienceModel().with_model_description(model_description).create()
This approach leverages the existing DatascienceModel
class, which already supports CRUD operations for models, so we wouldn't need to replicate these in the ModelDescription
class. The purpose of ModelDescription
would be to simplify the preparation of the correct JSON configuration for users.
Perhaps we could consider a clearer name for clarity, as "ModelDescription" might be somewhat confusing. How about using "ModelArtifactConfigBuilder" to more accurately reflect its purpose as a tool for constructing model artifact configurations?
Then user experience would be
model_artifact_config = ModelArtifactConfigBuilder().add().add()
DatascienceModel().with_model_artifact(model_artifact_config).create()
Hi @VipulMascarenhas , I have now extended from the class already present in ads and the usage of the utility looks like this now:
Please take a look when free and give your comments. Thanks and Regards |
ads/model/model_collection.py
Outdated
from typing import List, Optional | ||
import logging | ||
|
||
logger = logging.getLogger("ads.model_description") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this be better - logger = logging.getLogger(__name__)
?
ads/model/model_collection.py
Outdated
# Remove if the model already exists | ||
self.remove(namespace, bucket, prefix) | ||
|
||
def check_if_file_exists(fileName): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please double check if the existing utils methods are already implemented such functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YashPandit4u suggested a few changes, just needs to be re-organized to make it consistent with ads usage.
ads/model/model_collection.py
Outdated
raise e | ||
return self | ||
|
||
def add( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's rename this to add_artifact(uri:str, files=Optional[List[str]]) which takes in uri and files list as input to be consistent with how paths are referred to within ads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user won't have uri handy, hence we have parameters like namespace, and bucket to make it easy.
Will it be possible to keep these parameters? The function signature would look like:
def add_artifact( self, namespace: str, bucket: str, prefix: Optional[str] = None, files: Optional[List[str]] = None, )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't the uri same as f"oci://{bucketName}@{namespace}/{prefix}"
, all of which are added as input to this function. My concern is that this input makes it incosistent with all other DataScienceModel
or GenericModel
operations in ads and we wouldn't be able to change it later on.
@mrDzurb what do you think?
ads/model/model_collection.py
Outdated
"objects": objects, | ||
} | ||
) | ||
self.set_spec(self.CONST_MODEL_FILE_DESCRIPTION, tmp_model_file_description) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we modify _prepare_file_description_artifact
with optional params for files list and content json? This replaces the above functions check_if_file_exists
and list_obj_versions_unpaginated
altogether.
If content is passed, then no need to create the dict within that function.
for example:
if not content:
content = dict()
content["version"] = MODEL_BY_REFERENCE_VERSION
...
In this case _prepare_file_description_artifact
would take in inputs like def _prepare_file_description_artifact(bucket_uri: list, content: Optional[Dict] = None, files: Optional[List[List[str]]] = None)
.
The inputs for files isn't ideal, it might be good to implement a dataclass for this and pass a List[ModelFileDescriptionArtifact]
instead of bucket_uri list.
@dataclass
class ModelFileDescriptionArtifact:
bucket_uri: str
files: List[str] = None
This isn't urgent though, we can update this later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this suggestion.
Will it be okay if we do this in the next PR due to customer deadlines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, let's mark a todo somewhere to make sure of this.
ads/model/model_collection.py
Outdated
) | ||
self.set_spec(self.CONST_MODEL_FILE_DESCRIPTION, tmp_model_file_description) | ||
|
||
def remove(self, namespace: str, bucket: str, prefix: Optional[str] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as add_artifact - rename this to remove_artifact(uri:str) which takes in uri to be consistent with how paths are referred to within ads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes done.
ads/model/model_collection.py
Outdated
# model found case | ||
self.model_file_description["models"].pop(modelSearchIdx) | ||
|
||
def create(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't needed since we can just the existing create
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes removed this.
ads/model/model_collection.py
Outdated
logger.info("Model Artifact stored successfully.") | ||
return os.path.abspath(file_path) | ||
|
||
def show(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is covered by model.model_file_description
property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes removing this.
@YashPandit4u thanks for making all the changes here, looks good overall. One concern I have is around the path inputs, it would be good to have the oci path like I mentioned in the comments above as user can construct the path easily with the params (bucket, namespace and prefix). Also, could you please add a couple unit tests in cc: @mrDzurb |
ads/model/datascience_model.py
Outdated
|
||
def add_artifact( | ||
self, | ||
namespace: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency it would be better to accept OCIFS "oci://" path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have changed both functions add_artifact and remove_artifact with this uri type input.
ads/model/datascience_model.py
Outdated
self.set_spec(self.CONST_MODEL_FILE_DESCRIPTION, self.empty_json) | ||
|
||
# Get object storage client | ||
authData = default_signer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically the auth object is already constructed and accessible by
self.dsc_model.auth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, using this now.
….com/YashPandit4u/accelerated-data-science into feature/multi-model-artifact-handler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YashPandit4u thanks for the update. There are a few improvements to make for the next update:
- check_if_file_exists and list_obj_versions_unpaginated should reuse existing methods.
- _extract_oci_uri_components wasn't required, as mentioned in previous comments we could just use
ObjectStorageDetails.from_path(uri)
to get bucket name, namespace and prefix. If we want to add validation ObjectStorageDetails has a few static methods likeis_valid_uri
to do this.
Hi @VipulMascarenhas , thanks for telling the improvement items.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm 👍
Hi @VipulMascarenhas , |
Hi @VipulMascarenhas , I have now added options for both uri and (namespace, bucket, prefix) in add_artifact, remove_artifact. |
@@ -1466,3 +1467,226 @@ def _download_file_description_artifact(self) -> Tuple[Union[str, List[str]], in | |||
bucket_uri.append(uri) | |||
|
|||
return bucket_uri[0] if len(bucket_uri) == 1 else bucket_uri, artifact_size | |||
|
|||
def add_artifact( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Could you please add an examples section for the add and delete artifacts feature?
if object_storage_details.filepath == "" | ||
else object_storage_details.filepath | ||
) | ||
if (not namespace) or (not bucket): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: if not (namespace and bucket):
else object_storage_details.filepath | ||
) | ||
if (not namespace) or (not bucket): | ||
raise ValueError("Both 'namespace' and 'bucket' must be provided.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Wouldn't it be better to say something like - Artifacts cannot be added. Both 'namespace' and 'bucket' parameters must be provided.
. otherwise it would be hard for user to understand where they need to provide the namespace and bucket.
- If no objects are found to add to the model description, a ValueError is raised. | ||
""" | ||
|
||
if uri and (namespace or bucket): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can do something like this?
if uri and (namespace or bucket):
raise ValueError("Artifacts cannot be added. Please provide either a 'uri' alone, or 'namespace' and 'bucket' together, but not 'uri' with 'namespace' or 'bucket'.")
elif not (uri or (namespace and bucket)):
raise ValueError("Artifacts cannot be added. You must provide either a 'uri' or both 'namespace' and 'bucket'.")
# Remove if the model already exists | ||
self.remove_artifact(namespace=namespace, bucket=bucket, prefix=prefix) | ||
|
||
def check_if_file_exists(fileName): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I would not recommend to create nested functions, it would be hard to write unit tests for such methods. I would rather recommend to move such methods into utils or outside the class.
] | ||
|
||
if len(objects) == 0: | ||
error_message = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT:
error_message = (
f"No files found to add in the bucket '{bucket}' within the namespace '{namespace}' "
f"and prefix '{prefix}'. Expected file names: {files}"
)
) | ||
self.set_spec(self.CONST_MODEL_FILE_DESCRIPTION, tmp_model_file_description) | ||
|
||
def remove_artifact( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the examples section.
- If the model description JSON is None. | ||
""" | ||
|
||
if uri and (namespace or bucket): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comments for the add_artifact method.
changes requested were implemented, pending items will be covered in subsequent update.
Adding a new class to easily handle multi-model artifact creation in ADS.