Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After push to GIN, remote retains folders that were deleted from dataset #120

Open
3 of 4 tasks
jsheunis opened this issue Feb 1, 2024 · 3 comments
Open
3 of 4 tasks
Assignees
Labels
support-tracker Track a support event that occurred elsewhere

Comments

@jsheunis
Copy link
Contributor

jsheunis commented Feb 1, 2024

Origin: Office Hour chatroom message

Description

User reported:

I have an acquisition computer, an analysis computer and a gin repository. The experiment files are in a subset (rawdata) and pushed to gin, then retrieved from the analysis computer.

Now, I have deleted/restructured the data in the acquisiton computer (deleted, renamed, moved), saved the changes and pushed, but some of the old folders are still there on Gin. All the files are gone, but the folder structure remains on the gin repository and no push will remove them.

Besides that, part of this restructuring was changing folder names from "bla folder" to "bla_folder", and I keep getting the old version in my acquisition computer - so I have "bla folder" on the acquisition computer and cannot get the correct one "bla_folder", even if "bla folder" does not exist in the repository anymore.

@adswa asked to confirm that:

  • the actual files were successfully pushed (i.e., there are on Gin and safely backed up)?
  • what remains on the acquisition computer are empty directories with outdated names?

User answer:

The actual files are pushed to Gin. The acquisition computer has the original and ideal version of this dataset.
The old folders with outdated names remain in Gin, and are present in my analysis computer. I cannot get their correct versions.

I am new to datalad and so far only using it to transfer data (and have version control) this way, acquisition -> gin -> analysis.

So when I am done acquiring new data, I use save, then update --merge and finally push --to gin. Only the rawdata subdataset is present in the acquisition computer.

From the analysis computer, I update and get whichever files I need to work on.

As for the structure of the datasets, I have superset in the analysis computer, this contains the rawdata subdatasets, and other folders containing code, figures, etc. This has its own Gin repo.

More clarifying questions:

  • So inside of the rawdata subdataset on the acquisition computer you run:
    datalad save
    datalad update --merge 
    datalad push 
    
    correct?
  • Can I ask why you run the update --merge?
  • Are you making changes to the raw data subdataset at any other location/clone than the acquisition computer?
  • In case it is public, can you share the Gin repository, or could you hop into a video call with us either today until 2.30pm or during the next office hour?
  • Also, please share the set of commands that you ran, and also the dataset structure (super- and subdataset boundaries)

Next steps

  • Wait for user feedback to above questions.

TODO (not necessarily to be performed in this order)

  • Inform OP/Add reference to this issue at origin
  • Clarifying Qs asked or not needed
  • Nature of the issue is understood
  • Inform OP about resolution
@jsheunis jsheunis added the support-tracker Track a support event that occurred elsewhere label Feb 1, 2024
@jsheunis jsheunis self-assigned this Feb 1, 2024
@alejandrcastro
Copy link

Thanks again, I will answer the questions here.

More clarifying questions:

  • So inside of the rawdata subdataset on the acquisition computer you run:

    datalad save
    datalad update --merge 
    datalad push 
    

    correct?

  • Can I ask why you run the update --merge?

  • Are you making changes to the raw data subdataset at any other location/clone than the acquisition computer?

The only changes to this dataset are made in the acquisition computer, I was told to always update just in case to avoid
conflicts and assumed that worst case scenario this update would just be redundant.

@adswa
Copy link
Contributor

adswa commented Feb 2, 2024

Thanks for the additional info. Its still difficult to piece together precisely what happened. I have tried a few attempts at recreating the situation you describe (in a dataset hierarchy with a sibling on Gin, using mv and git mv and rm on directories or subdatasets, followed by save, update --merge, and push) but I did not observe this issue yet - but this is likely because there simply are some details missing for a reproducer. I'm looking forward to investigating this closer in an office hour, where we can exchange relevant information in real time!

@adswa
Copy link
Contributor

adswa commented Feb 13, 2024

Follow up in the office hour: We got to a productive screensharing session in which everyone got quite confused by what we saw. Here are a few facts:

Acquisition Computer (windows) saves and restructures files; Regular pushes to a Gin sibling; a clone on a mac pulls updates from Gin.

  • The Gin webinterface has a bug - folders created and pushed from a windows machine, and later renamed and pushed again do not get removed in the webinterface' index. In this minimal reproducer, "folder" was renamed to "newname" and "folder" should not exist in the webinterface, but lingers around. (overall: confusing, but with no impact on the the clone)
    image

  • The local clone on the mac was in a convoluted state (we couldn't figure out how it got there, but it was a mix of a very updated index, a detached HEAD, and unmerged branched - likely the Gin confusion contributed to that). Also, the repository reported on a background garbage collection process that looked a bit shady. And finally, an icloud backup process to the cloud created duplicated files (HEAD 2, index 2, ...) in the .git/ directory.

  • Recloning the repository from Gin fixed the issue

We left with the following recommendations:

General:

  • install datalad-next and enable it via config on the windows machine (because a status there is fast)
  • consider restructuring the dataset on the acquisition machine
  • Gin interface seems to be the issue - if in doubt, ignore

Helpers we recommended:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
support-tracker Track a support event that occurred elsewhere
Projects
None yet
Development

No branches or pull requests

3 participants