Using tags to mark results for retention #115

loj · 2023-11-10T11:06:05Z

Origin: DataLad matrix chat; Nov 2, 2023

OP asks for suggestions on how to handle marking results for retention with tags:

How do you generally proceed in datalad about marking results for retention? My analysis produces results that are larger than the input data. So I would like to find a compromise between keeping intermediate results for some versions but not for all of them. For a single dataset one can solve this using git tag to keep annexed content of a certain commit from being listed by git annex unused.

However, how do you do it for subdatasets? My analysis has another analysis as a submodule and relies on these data as an input to the computations. So for each tag on the superdataset I would need to create a tag in the subdataset and push these tags to their respective remotes on the archive disk.

So far this can be solved by e.g. if I create a tag project-meeting21 in the superdataset, I could automatically create a tag in the subdataset called needed/(datalad-id-superdataset)/project-meeting21. Now I want to also delete needed/ tags in the subdataset if the corresponding tag in the superdataset is gone. This can lead to problems if I have the datasets in multiple places and delete tags in one of them. How to decide when to delete the needed/ tags and how to make sure that if I delete a tag, it is not added back from another instance of the dataset?

Is there any partial or complete solution to this yet or should I make up a solution on my own?

TODO (not necessarily to be performed in this order)

Inform OP/Add reference to this issue at origin
Clarifying Qs asked or not needed
Nature of the issue is understood
Inform OP about resolution

loj · 2023-11-10T11:20:09Z

I don't have much experience using tags, but I took a brief look at the --version-tag option with datalad save.

--version-tag ID
an additional marker for that state. Every dataset that is touched will receive the tag

@mlell, maybe this could be something to look into for the situation you described.

Also happy for others with more tag experience to weigh in.

mlell · 2023-11-13T11:05:29Z

Thanks for following up on this one... that idea is interesting because it flipped around my initial model of the thing: I always imagined tags to be in the dataset namespace. That would have meant that I need to take care that a tag does not overwrite a similarly-named tag in another dataset, so an auto-naming scheme in subdatasets would have been required that links back to the original tag.

Now you show a simpler way: To approach this by considering tags to be defined across multiple datasets, then the problem simplifies because I can have the same tag name in all datasets and therefore update/delete them based on this name in all data sets!

So that leaves more specific questions:

The save option is called --version-tag, does this just call git tag recursively (given save -r)? This means with the datalad implementation I will not be able to add information to a tag like for git tag -a, correct? (Which might or might not be desired, maybe I want to save further info on a tag only in one of the data sets?) Also I will need git tag instead of datalad save to tag a previous commit I think?
When initializing a dataset using datalad get, the newly-generated annex is not known to the parent dataset and if datalad push is never called, it stays that way. This means that I can datalad drop --what all the subdataset without needing to declare its annex dead (see #6111, #3887). However, the flow of information is now reversed for retention tags: If I set a tag using save -r -t ... in a read-only clone, I need to push that information to the remote of that read-only clone (the read-write clone of the dataset). If I use datalad push for this will this cause the subdataset annex to become known to the superdataset? Because if it does, I will not be able to drop it without causing the error: to-be-deleted local annex not declared dead, as drop will not declare git annex dead here; git annex push --to origin by itself, right? So instead of datalad push, should a git push <tag_name> be done to propagate the tag from a short-term submodule to a long-term clone of a dataset?
git push has an option --follow-tags and config push.followTags. This pushes tags if they are
- annotated
- reachable from the branch being pushed
  This means that the git convention seems to be that bare tags are more for local usage and annotated tags are meant to be pushed. Does that mean that tags that denote data retention should also always be annotated?

loj added support-tracker Track a support event that occurred elsewhere via-datalad-channel report origin is a datalad-specific channel (chat/email/office hour) labels Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using tags to mark results for retention #115

Using tags to mark results for retention #115

loj commented Nov 10, 2023 •

edited

Loading

loj commented Nov 10, 2023

mlell commented Nov 13, 2023

Using tags to mark results for retention #115

Using tags to mark results for retention #115

Comments

loj commented Nov 10, 2023 • edited Loading

loj commented Nov 10, 2023

mlell commented Nov 13, 2023

loj commented Nov 10, 2023 •

edited

Loading