Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More need for a split dataset command #100

Open
4 tasks
adswa opened this issue Jul 4, 2023 · 2 comments
Open
4 tasks

More need for a split dataset command #100

adswa opened this issue Jul 4, 2023 · 2 comments
Labels
support-tracker Track a support event that occurred elsewhere

Comments

@adswa
Copy link
Contributor

adswa commented Jul 4, 2023

Origin: Office hour

Amir came into the office hour, and presented a superdataset, into which he accidentally saved ~380k files with a total disk space usage of multiple TB. The superdataset became painfully slow in response. He inquired how to get the data into a subdataset instead of having it in the superdataset directly. We pointed him to https://knowledge-base.psychoinformatics.de/kbi/0013/index.html and advised to split his directory (era_5) into year-wise subdatasets.

This support event is mostly documenting the need for a command to split datasets similar to how https://knowledge-base.psychoinformatics.de/kbi/0013/index.html outlines, but with DataLad tooling for ease of use.

TODO (not necessarily to be performed in this order)

  • Inform OP/Add reference to this issue at origin
  • Clarifying Qs asked or not needed
  • Nature of the issue is understood
  • Inform OP about resolution
@adswa adswa added the support-tracker Track a support event that occurred elsewhere label Jul 4, 2023
@adswa
Copy link
Contributor Author

adswa commented Jul 11, 2023

Because the era5 data was added as the most recent change, @mih pointed to a simpler alternative that we exercised in today's office hour, using merely a cp with dereferencing and hardlinking:

mkdir era5_sub
cd era5_sub
mkdir 1970
cp -v -l -L -R /p/largedata2/detectdata/CentralDB/era5/1970 .
datalad create --force
datalad save  -m "new era5 1970" .

The dereference and hardlinking does not lead to a doubling of space! The inodes are the same:

ls -i era5/1979/1970_01/<firstfile>
ls -i -L era5_sub/1970_01/<firstfile>
-> returns same inode!

then in era5 the following still needs to happen:

git annex unused
git annex dropunused all

We should consider to write this up as a KBI, too, and link it to the one on splitting datasets (0013)

@adswa
Copy link
Contributor Author

adswa commented Jul 17, 2023

update from July 17th:

  • All the yearly directories (1978, 1979, ...) are their own datalad datasets
  • we have point-checked that they are clean (via git status)
  • all subdatasets are in the directory CentralDB/era5_sub, which is an untracked directory
  • CentralDB still has a era5 directory with individually tracked files

challenge for now:

  • git reset --hard HEAD~1 to remove the last commit in the CentralDB dataset ("add era5 1970")
  • save a few left-over files: mv era5 era5_pre
  • datalad create era5 inside of CentralDB
  • cd era5_sub, find . -maxdepth 1 -type d -exec mv {} ../era5/{} \;
  • cd ../era5, git status -> untracked directories, datalad save -m "Move to new era5
  • git show
  • git status -> should be all clean, but took a while
  • cd ../ (to CentralDB), make era5 a subdataset with datalad save -d . era5 and amend the commit message
  • git annex unused (runs a while)
  • git annex dropunused all (not executed together, but left with Amir)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
support-tracker Track a support event that occurred elsewhere
Projects
None yet
Development

No branches or pull requests

1 participant