More need for a split dataset command #100

adswa · 2023-07-04T14:36:51Z

Origin: Office hour

Amir came into the office hour, and presented a superdataset, into which he accidentally saved ~380k files with a total disk space usage of multiple TB. The superdataset became painfully slow in response. He inquired how to get the data into a subdataset instead of having it in the superdataset directly. We pointed him to https://knowledge-base.psychoinformatics.de/kbi/0013/index.html and advised to split his directory (era_5) into year-wise subdatasets.

This support event is mostly documenting the need for a command to split datasets similar to how https://knowledge-base.psychoinformatics.de/kbi/0013/index.html outlines, but with DataLad tooling for ease of use.

TODO (not necessarily to be performed in this order)

Inform OP/Add reference to this issue at origin
Clarifying Qs asked or not needed
Nature of the issue is understood
Inform OP about resolution

adswa · 2023-07-11T15:02:03Z

Because the era5 data was added as the most recent change, @mih pointed to a simpler alternative that we exercised in today's office hour, using merely a cp with dereferencing and hardlinking:

mkdir era5_sub
cd era5_sub
mkdir 1970
cp -v -l -L -R /p/largedata2/detectdata/CentralDB/era5/1970 .
datalad create --force
datalad save  -m "new era5 1970" .

The dereference and hardlinking does not lead to a doubling of space! The inodes are the same:

ls -i era5/1979/1970_01/<firstfile>
ls -i -L era5_sub/1970_01/<firstfile>
-> returns same inode!

then in era5 the following still needs to happen:

git annex unused
git annex dropunused all

We should consider to write this up as a KBI, too, and link it to the one on splitting datasets (0013)

adswa · 2023-07-17T12:12:48Z

update from July 17th:

All the yearly directories (1978, 1979, ...) are their own datalad datasets
we have point-checked that they are clean (via git status)
all subdatasets are in the directory CentralDB/era5_sub, which is an untracked directory
CentralDB still has a era5 directory with individually tracked files

challenge for now:

git reset --hard HEAD~1 to remove the last commit in the CentralDB dataset ("add era5 1970")
save a few left-over files: mv era5 era5_pre
datalad create era5 inside of CentralDB
cd era5_sub, find . -maxdepth 1 -type d -exec mv {} ../era5/{} \;
cd ../era5, git status -> untracked directories, datalad save -m "Move to new era5
git show
git status -> should be all clean, but took a while
cd ../ (to CentralDB), make era5 a subdataset with datalad save -d . era5 and amend the commit message
git annex unused (runs a while)
git annex dropunused all (not executed together, but left with Amir)

adswa added the support-tracker Track a support event that occurred elsewhere label Jul 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More need for a split dataset command #100

More need for a split dataset command #100

adswa commented Jul 4, 2023

adswa commented Jul 11, 2023

adswa commented Jul 17, 2023 •

edited

Loading

More need for a split dataset command #100

More need for a split dataset command #100

Comments

adswa commented Jul 4, 2023

adswa commented Jul 11, 2023

adswa commented Jul 17, 2023 • edited Loading

adswa commented Jul 17, 2023 •

edited

Loading