-
Notifications
You must be signed in to change notification settings - Fork 114
Dataset namespaces #1081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Just talked with @shcheklein - we had an idea to improve this: Let’s use So:
PS1: To keep in mind. A code should be reusable in CLI and Studio. This naming convention seems satisfies this requirements. THis code should work in both CLI and Studio: ds = dc.read_dataset("/mycats")
ds1 = ds.filter(dc.C("color") == "Red").save("red-cats") # <-- Local dataset
ds2 = ds1.map(....).save("/my_red_cats_with_bmi_index") PS2: Versioning is outside the scope of this issue. |
Another idea: maybe use Also, BTW I would maybe avoid using Global terminology and use only Studio to avoid confusion and having multiple words for the same thing, WDYT? |
@ilongin that's good idea but we need to keep in mind that we will need to introduce org/team in the future like |
@dmpetrov just to note, default user team will be the one written in config file (by running |
@shcheklein @dmpetrov Question about |
Sure, let's keep it simple. If user need a special local name, they can read dataset and save under local name. |
Uh oh!
There was an error while loading. Please reload this page.
Description
Right now, local and global/Studio datasets use the same names, which causes confusion:
dc.read_dataset("mycats")
is unclear - it depends on the local state, which may be outdated or conflicting.UPDATE:
Idea is to have dataset fully qualified name consisting of namespace, project and dataset name connected with
.
so schema would be<namespace>.<project>.<dataset_name>
e.gdev.my_project.my_ds
.Phase 1
dc.namespaces.create("dev")
anddc.projects.create("chatbot")
dc.namespace.delete("dev")
anddc.projects.delete("dev")
-
dc.use("dev", "chatbot").from_storage(...).save("text_train_ds")
-
dc.from_storage(...).save("dev.chatbot.text_train_ds")
Questions:
namespace
andproject
inSettings
instead of introducingnew method
DataChain.use(...)
?A: use settings for now
A:
*
local.local
-> local*
users.<user_name>
-> StudioUser cannot create new namespace explicitly in local env. If he pulls a dataset from Studio it will implicitly create namespace / project for that dataset as dataset name must stay the same
Can user delete default namespace? - probably not
A: User is not allowed to delete namespace if datasets are inside of it
.use()
) and if yes, does it put dataset into default namespace / project? e.gdc.from_storage(...).save("my-ds")
. Similar, if user doesn't specify namespace / project on read do we try to find dataset in default namespace or throw error, e.gdc.read_dataset("my-ds")
?A: yes, default namespace is used
Follow up
Questions of follow up:
A:
local
is reserved keyword and if something is used that is notlocal
it will be seen as Studio dataset. e.gdev.my_project.my_ds
-> Studio dataset,local.local.my_ds
-> local dataset.dc.read_dataset(.dev.my_project.my_ds).save(dev.my_project.my_ds)
(it can also choose different name)dc.read_dataset(..., studio_cache=True)
for this? What if there is dataset with same name / version already locally but different data (differentUUID
).A: we should automatically cache, no additional flag is needed. If the same dataset exists locally then throw exception?
The text was updated successfully, but these errors were encountered: