Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using tags to mark results for retention #115

Open
1 of 4 tasks
loj opened this issue Nov 10, 2023 · 2 comments
Open
1 of 4 tasks

Using tags to mark results for retention #115

loj opened this issue Nov 10, 2023 · 2 comments
Labels
support-tracker Track a support event that occurred elsewhere via-datalad-channel report origin is a datalad-specific channel (chat/email/office hour)

Comments

@loj
Copy link
Contributor

loj commented Nov 10, 2023

Origin: DataLad matrix chat; Nov 2, 2023

OP asks for suggestions on how to handle marking results for retention with tags:

How do you generally proceed in datalad about marking results for retention? My analysis produces results that are larger than the input data. So I would like to find a compromise between keeping intermediate results for some versions but not for all of them. For a single dataset one can solve this using git tag to keep annexed content of a certain commit from being listed by git annex unused.

However, how do you do it for subdatasets? My analysis has another analysis as a submodule and relies on these data as an input to the computations. So for each tag on the superdataset I would need to create a tag in the subdataset and push these tags to their respective remotes on the archive disk.

So far this can be solved by e.g. if I create a tag project-meeting21 in the superdataset, I could automatically create a tag in the subdataset called needed/(datalad-id-superdataset)/project-meeting21. Now I want to also delete needed/ tags in the subdataset if the corresponding tag in the superdataset is gone. This can lead to problems if I have the datasets in multiple places and delete tags in one of them. How to decide when to delete the needed/ tags and how to make sure that if I delete a tag, it is not added back from another instance of the dataset?

Is there any partial or complete solution to this yet or should I make up a solution on my own?

TODO (not necessarily to be performed in this order)

  • Inform OP/Add reference to this issue at origin
  • Clarifying Qs asked or not needed
  • Nature of the issue is understood
  • Inform OP about resolution
@loj loj added support-tracker Track a support event that occurred elsewhere via-datalad-channel report origin is a datalad-specific channel (chat/email/office hour) labels Nov 10, 2023
@loj
Copy link
Contributor Author

loj commented Nov 10, 2023

I don't have much experience using tags, but I took a brief look at the --version-tag option with datalad save.

--version-tag ID
an additional marker for that state. Every dataset that is touched will receive the tag

@mlell, maybe this could be something to look into for the situation you described.

Also happy for others with more tag experience to weigh in.

@mlell
Copy link

mlell commented Nov 13, 2023

Thanks for following up on this one... that idea is interesting because it flipped around my initial model of the thing: I always imagined tags to be in the dataset namespace. That would have meant that I need to take care that a tag does not overwrite a similarly-named tag in another dataset, so an auto-naming scheme in subdatasets would have been required that links back to the original tag.

Now you show a simpler way: To approach this by considering tags to be defined across multiple datasets, then the problem simplifies because I can have the same tag name in all datasets and therefore update/delete them based on this name in all data sets!

So that leaves more specific questions:

  • The save option is called --version-tag, does this just call git tag recursively (given save -r)? This means with the datalad implementation I will not be able to add information to a tag like for git tag -a, correct? (Which might or might not be desired, maybe I want to save further info on a tag only in one of the data sets?) Also I will need git tag instead of datalad save to tag a previous commit I think?
  • When initializing a dataset using datalad get, the newly-generated annex is not known to the parent dataset and if datalad push is never called, it stays that way. This means that I can datalad drop --what all the subdataset without needing to declare its annex dead (see #6111, #3887). However, the flow of information is now reversed for retention tags: If I set a tag using save -r -t ... in a read-only clone, I need to push that information to the remote of that read-only clone (the read-write clone of the dataset). If I use datalad push for this will this cause the subdataset annex to become known to the superdataset? Because if it does, I will not be able to drop it without causing the error: to-be-deleted local annex not declared dead, as drop will not declare git annex dead here; git annex push --to origin by itself, right? So instead of datalad push, should a git push <tag_name> be done to propagate the tag from a short-term submodule to a long-term clone of a dataset?
  • git push has an option --follow-tags and config push.followTags. This pushes tags if they are
    • annotated
    • reachable from the branch being pushed
      This means that the git convention seems to be that bare tags are more for local usage and annotated tags are meant to be pushed. Does that mean that tags that denote data retention should also always be annotated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
support-tracker Track a support event that occurred elsewhere via-datalad-channel report origin is a datalad-specific channel (chat/email/office hour)
Projects
None yet
Development

No branches or pull requests

2 participants