Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: run-procedure for BIDS dataset configuration #114

Open
jsheunis opened this issue Dec 9, 2022 · 9 comments
Open

ENH: run-procedure for BIDS dataset configuration #114

jsheunis opened this issue Dec 9, 2022 · 9 comments

Comments

@jsheunis
Copy link
Member

jsheunis commented Dec 9, 2022

I'm wondering if it would be useful to add a run-procedure to this extension to configure BIDS+datalad datasets such that all files in the root BIDS directory are committed to git while all the rest of the files, irrespective of type, go to the annex?

I'm thinking of use-cases related to distributed dataset-level metadata extraction and catalog generation. Data in the annex (typically all subfolders of the root BIDS directory) would need to be protected because of data privacy concerns, while data in the root directory (participants.tsv, dataset_description.json, any json sidecar files defined at the root level, any additional dataset-level metadata added at root level) are typically considered non-sensitive or have specifically been edited to be so, and can therefore be considered safe to commit to git.

Configuring a dataset like that (as opposed to annexing all files in the dataset) would allow sufficient metadata extraction on any clone without requiring access to the annex.

The run procedure would add something like this to .gitattributes:

* annex.largefiles=anything
/* annex.largefiles=nothing

The procedure (let's call it rootfiles2git) would be available in this extension because it seems (to me) like it could be generally applicable to BIDS datasets collected in the EU (because of GDPR).

WDYT @yarikoptic @bpoldrack @mslw @CPernet @loj

@CPernet
Copy link

CPernet commented Dec 9, 2022

that's the 'standard' way to approach a BIDS dataset, make sense to see root directory info (=git) while the rest goes into the annex (also make easy catalog :-)) 👍🏻

@bpoldrack
Copy link
Member

Generally, I think it does make sense, but the problem lies in

or have specifically been edited to be so

Editing something to be so, implies that there was a state before that, which must never have been datalad save'd. Such a setup doesn't really allow for mistakes, since you can't easily get things out of git again. Kinda the point of version control.
That's why I'd hesitate recommending a specific config from the start. It really depends when in your workflow you'd want to apply that.

@jsheunis
Copy link
Member Author

Fair point, although that problem/challenge exists whether one applies a run-procedure or not. It is something that the people managing the data would need to consider in any case when they turn it into a datalad dataset.

@bpoldrack
Copy link
Member

bpoldrack commented Dec 12, 2022

Yes, but a default that annexes everything doesn't lead you in a trap.

Public and restricted content can still be separated in terms of storage. May be a little less convenient, but you don't get in a situation that is really hard to fix.

To be fair: The existence of a procedure isn't exactly a default. I'm a bit worried though, that it goes the way of text2git. Pointed out as convenience in a toy example in documentation and then everybody starts using it without realizing its disadvantages.

@mslw
Copy link
Contributor

mslw commented Dec 12, 2022

I think this is a sane approach, with two caveats (though keep in mind that my knowledge of BIDS spec might be not up to date):

  • With inheritance principle for BIDS metadata, there is no guarantee that a metadata file in top level directory describes all matching data, as values defined on top level can be overridden by files deeper in the file tree. E.g. fMRI task information: TaskName, RepetitionTime, SliceTiming, etc., in ...task-xyz_bold.json can be defined on any level (either top level or just next to the specific _bold.nii file). It seems to me that it has become a fairly common principle to promote these to top-level (and for good reason), but technically there is no guarantee of dataset-scope.
  • Speaking of participants.tsv, this is a recommended file, and commonly used optional columns in participants.tsv files are age, sex, handedness, strain, and strain_rrid - I wonder what is the status of these.

@loj
Copy link
Contributor

loj commented Dec 12, 2022

My biggest concern with this approach is when participants need to be removed. If the participants.tsv file or any other top-level file that contains participant data is saved to git, this becomes problematic.

@mih
Copy link
Member

mih commented Dec 12, 2022

I agree, I would be hesitant to put anything other than a README and a LICENSE into git by default.

Code is another candidate, but only if the file identifiers are at minimum pseudonymized.

@yarikoptic
Copy link
Member

I agree, I would be hesitant to put anything other than a README and a LICENSE into git by default.

and CHANGE(S|LOG), with all sensible/support extensions, is indeed the "safest"! Worth smth like cfg_minimal2git or alike (it isn't really BIDS specific probably).

There is always a "hard to strike" balance in what to put into git and what into git-annex. For heudiconv all .json and .tsv go into git besides the _scans.tsv since those are to contain full dates. The minimal above would be "safest" but then forget about lovely git grep etc which I do like to use quite often in BIDS etc datasets.

@jsheunis
Copy link
Member Author

Thanks for everyone's input!

FYI @CPernet there is already a standard BIDS config that does the above process to an extent. See here for an update: #115.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants