Skip to content

Dataset namespaces #1081

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
6 tasks
dmpetrov opened this issue May 2, 2025 · 6 comments · May be fixed by #1115
Open
6 tasks

Dataset namespaces #1081

dmpetrov opened this issue May 2, 2025 · 6 comments · May be fixed by #1115
Assignees
Labels
enhancement New feature or request

Comments

@dmpetrov
Copy link
Member

dmpetrov commented May 2, 2025

Description

Right now, local and global/Studio datasets use the same names, which causes confusion:

  1. dc.read_dataset("mycats") is unclear - it depends on the local state, which may be outdated or conflicting.
  2. The API is cluttered with studio=True/False flags

UPDATE:

Idea is to have dataset fully qualified name consisting of namespace, project and dataset name connected with . so schema would be <namespace>.<project>.<dataset_name> e.g dev.my_project.my_ds.

Phase 1

  • User can create namespace and project with new API, e.g. dc.namespaces.create("dev") and dc.projects.create("chatbot")
  • User can remove namespace and project with new API, e.g. dc.namespace.delete("dev") and dc.projects.delete("dev")
  • User should be able to save dataset into created namespace / project in 2 ways:
    - dc.use("dev", "chatbot").from_storage(...).save("text_train_ds")
    - dc.from_storage(...).save("dev.chatbot.text_train_ds")

Questions:

  1. Should we add namespace and project in Settings instead of introducing
    new method DataChain.use(...)?
    A: use settings for now
  2. Is there a default namespace and project? Probably yes, so how should we call them?
    A:
    * local.local -> local
    * users.<user_name> -> Studio
    User cannot create new namespace explicitly in local env. If he pulls a dataset from Studio it will implicitly create namespace / project for that dataset as dataset name must stay the same
    Can user delete default namespace? - probably not
  3. If user can delete namespace / project, what happens with datasets that were in them - are they moved to some default namespace / project? If there is no default then we need to remove them?
    A: User is not allowed to delete namespace if datasets are inside of it
  4. Is user allowed to create dataset withoug fully qualified name (or using .use()) and if yes, does it put dataset into default namespace / project? e.g dc.from_storage(...).save("my-ds"). Similar, if user doesn't specify namespace / project on read do we try to find dataset in default namespace or throw error, e.g dc.read_dataset("my-ds")?
    A: yes, default namespace is used

Follow up

  • Add ability to move dataset from one namespace / project to another
  • Add ability to rename namespace / project?
  • Studio & local datasets refactoring (bigger project)

Questions of follow up:

  1. How should we distinghish Studio and local datasets.?
    A: local is reserved keyword and if something is used that is not local it will be seen as Studio dataset. e.g dev.my_project.my_ds -> Studio dataset, local.local.my_ds -> local dataset.
    dc.read_dataset(.dev.my_project.my_ds).save(dev.my_project.my_ds) (it can also choose different name)
  2. Should reading dataset from Studio automatically cache (save) that dataset locally with the same name / version or not? Should we have additional flag e.g dc.read_dataset(..., studio_cache=True) for this? What if there is dataset with same name / version already locally but different data (different UUID).
    A: we should automatically cache, no additional flag is needed. If the same dataset exists locally then throw exception?
@dmpetrov dmpetrov added the enhancement New feature or request label May 2, 2025
@dmpetrov
Copy link
Member Author

dmpetrov commented May 2, 2025

Just talked with @shcheklein - we had an idea to improve this:

Let’s use / as a prefix for global (Studio) datasets, so we can keep @ for version naming like [email protected], which will be important with upcoming SemVer support (#1076).

So:

  • Global dataset - /mycat
  • Local dataset - mycat

PS1: To keep in mind. A code should be reusable in CLI and Studio. This naming convention seems satisfies this requirements. THis code should work in both CLI and Studio:

ds = dc.read_dataset("/mycats")
ds1 = ds.filter(dc.C("color") == "Red").save("red-cats")  # <-- Local dataset
ds2 = ds1.map(....).save("/my_red_cats_with_bmi_index")

PS2: Versioning is outside the scope of this issue.

@ilongin
Copy link
Contributor

ilongin commented May 4, 2025

Another idea: maybe use studio/mycats instead of just /mycats ? ... this is more verbose but more clear and similar to git branches naming convention where we have origin/mycats. Having only / as prefix seems like some kind of relative vs absolute path thing to me...

Also, ds = dc.read_dataset("/mycats") ran in Studio is basically the same as ds = dc.read_dataset("mycats") as local dataset is the same as Studio dataset, right?

BTW I would maybe avoid using Global terminology and use only Studio to avoid confusion and having multiple words for the same thing, WDYT?

@dmpetrov
Copy link
Member Author

dmpetrov commented May 5, 2025

@ilongin that's good idea but we need to keep in mind that we will need to introduce org/team in the future like myorg/[email protected] andm I'm thinking about empty org / as a default of user's team. If we introduce studio org it won't look good - specific org name.

@shcheklein shcheklein changed the title Dataset namespaces for CLI Dataset namespaces May 8, 2025
@ilongin ilongin self-assigned this May 13, 2025
@ilongin
Copy link
Contributor

ilongin commented May 13, 2025

@ilongin that's good idea but we need to keep in mind that we will need to introduce org/team in the future like myorg/[email protected] andm I'm thinking about empty org / as a default of user's team. If we introduce studio org it won't look good - specific org name.

@dmpetrov just to note, default user team will be the one written in config file (by running datachain auth team <team_name>) or added by env variable DVC_STUDIO_TEAM.

@ilongin ilongin linked a pull request May 21, 2025 that will close this issue
5 tasks
@ilongin
Copy link
Contributor

ilongin commented May 28, 2025

@shcheklein @dmpetrov Question about datachain pull -> currently we can set optional local dataset name / version to which studio dataset will be pulled. I'm wondering if we should remove this and put into "contract" that everything pulled from Studio should have the same fully qualified name in local ... datachain pull is basically just a cache of Studio dataset anyway and I don't see any reason for users to have it as different name locally and this option to rename it locally just complicates things specially now that we will have namespaces / projects....

@dmpetrov
Copy link
Member Author

Sure, let's keep it simple.

If user need a special local name, they can read dataset and save under local name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants