Skip to content

The data representation of tree is lossy and prevents round-tripping #1887

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pierrechevalier83 opened this issue Mar 14, 2025 · 5 comments
Assignees
Labels
acknowledged an issue is accepted as shortcoming to be fixed

Comments

@pierrechevalier83
Copy link
Contributor

pierrechevalier83 commented Mar 14, 2025

Current behavior 😯

The data representation of Tree is lossy compared to the git representation, which means that some git trees cannot round-trip through git-oxide while remaining byte-for-byte identical.

Specifically, the entry mode in git is represented as a 6 digit byte string which represents the octal representation of the mode.
For the EntryMode that corresponds to a Tree, the octal value is 0o040000u16, which can be represented as b"40000" or b"040000". Both representations occur in the wild, as shown in the test fixture: special-1.tree where both co-exist in the same tree representation.

In gitoxide, we parse this string as a u16 before getting to the EntryKind enum representation, where we loose information on whether or not the tree EntryMode was represented in git with a leading zero.

Expected behavior 🤔

EntryMode and EntryKind should hold on to at least one more bit of information (did the git byte string representation contain a leading zero?) so we can round-trip from git -> gitoxide -> git while preserving the byte representation.

Git behavior

The entry mode in git is represented as a 6 digit byte string which represents the octal representation of the mode.
For the EntryMode that corresponds to a Tree, the octal value is 0o040000u16, which can be represented as b"40000" or b"040000".

Steps to reproduce 🕹

See this change to the test that verifies we can round-trip.
It currently fails:
From gix-object,

cargo test

fails with:

---- tree::from_bytes::special_trees stdout ----

thread 'tree::from_bytes::special_trees' panicked at gix-object/tests/object/tree/from_bytes.rs:107:9:
assertion `left == right` failed
  left: [52, 48, 48, 48, 48, 32, 99, 104, 114, 111, 109, 101, 0, 126, 87, 230, 62, 203, 152, 244, 156, 233, 186, 220, 52, 58, 28, 34, 169, 230, 175, 227, 116, 52, 48, 48, 48, 48, 32, 101, 100, 105, 116, 111, 114, 115, 0, 87, 15, 194, 205, 208, 108, 98, 188, 56, 180, 7, 251, 45, 138, 241, 73, 89, 164, 134, 70, 49, 48, 48, 54, 52, 52, 32, 106, 115, 98, 105, 110, 46, 106, 115, 0, 126, 117, 18, 40, 234, 142, 9, 4, 82, 165, 64, 249, 109, 30, 33, 21, 76, 250, 170, 246, 52, 48, 48, 48, 48, 32, 114, 101, 110, 100, 101, 114, 0, 190, 206, 230, 168, 101, 95, 18, 40, 56, 73, 21, 0, 149, 163, 200, 188, 95, 208, 160, 82, 52, 48, 48, 48, 48, 32, 118, 101, 110, 100, 111, 114, 0, 97, 13, 187, 16, 54, 17, 224, 60, 151, 2, 230, 118, 36, 31, 214, 237, 68, 155, 190, 96]
 right: [48, 52, 48, 48, 48, 48, 32, 99, 104, 114, 111, 109, 101, 0, 126, 87, 230, 62, 203, 152, 244, 156, 233, 186, 220, 52, 58, 28, 34, 169, 230, 175, 227, 116, 52, 48, 48, 48, 48, 32, 101, 100, 105, 116, 111, 114, 115, 0, 87, 15, 194, 205, 208, 108, 98, 188, 56, 180, 7, 251, 45, 138, 241, 73, 89, 164, 134, 70, 49, 48, 48, 54, 52, 52, 32, 106, 115, 98, 105, 110, 46, 106, 115, 0, 126, 117, 18, 40, 234, 142, 9, 4, 82, 165, 64, 249, 109, 30, 33, 21, 76, 250, 170, 246, 48, 52, 48, 48, 48, 48, 32, 114, 101, 110, 100, 101, 114, 0, 190, 206, 230, 168, 101, 95, 18, 40, 56, 73, 21, 0, 149, 163, 200, 188, 95, 208, 160, 82, 48, 52, 48, 48, 48, 48, 32, 118, 101, 110, 100, 111, 114, 0, 97, 13, 187, 16, 54, 17, 224, 60, 151, 2, 230, 118, 36, 31, 214, 237, 68, 155, 190, 96]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    tree::from_bytes::special_trees
@Byron Byron added the acknowledged an issue is accepted as shortcoming to be fixed label Mar 21, 2025
@Byron Byron self-assigned this Mar 21, 2025
@Byron
Copy link
Member

Byron commented Mar 21, 2025

Thanks a lot for reporting, and for providing everything that's needed for a reproduction! Also, apologies for the late response.

It will be a while before I get to fixing it because I have plenty of unfinished in the queue right now, but I will get to it eventually.

My thinking here is twofold.
Borrowed trees can just hold on to the mode-string itself, backed by a data-buffer, and convert the entry mode on the fly.
The owned version would then probably get an additional field where it keeps track of these 6 bytes verbatim and inline, and the mode would be decoded from that on the fly as well, while being settable through a setter then. Or something like that 😁.

I do fear that there is more kinds of mode-byte sequences out there that can't be captured with a single bit, hence my tendency towards keeping the bytes themselves.

@pierrechevalier83
Copy link
Contributor Author

keep track of these 6 bytes verbatim and inline

I agree, this sounds like the soundest way to resolve this issue and similar issues if they arise in the future.

@Byron
Copy link
Member

Byron commented Mar 22, 2025

Thanks!
And please be my guest in tackling this before me - you may be faster, I am way behind everything now.

pierrechevalier83 added a commit to pierrechevalier83/gitoxide that referenced this issue Apr 1, 2025
We'll want to change it to fix issue GitoxideLabs#1887.

First step: don't expose it.
pierrechevalier83 added a commit to pierrechevalier83/gitoxide that referenced this issue Apr 2, 2025
Before this change, EntryMode wrapped a `u16` which didn't store enough
information to match the git representation as a `&[u8]` of len 5 or 6.

In particular, the mode that represents a `Tree` could be represented as
`"40000"` or `"040000"`, and the difference would get lost once it was
represented as `0o40000`.

We now have two ways to represent the `EntryMode`:
* `EntryMode` is backed by an owned `BString`.
* `EntryModeRef` is backed by a `&bstr` and bound to the lifetime of the
  owner of these bytes.

This allows call-sites to pick a trade-off between convenience
(`EntryMode` allows to not worry about lifetimes) vs performance
(`EntryModeRef` doesn't allocate) and fits the general paradigm used in
the wider gitoxide project.

Fixes [issue 1887](GitoxideLabs#1887)
pierrechevalier83 added a commit to pierrechevalier83/gitoxide that referenced this issue Apr 2, 2025
Before this change, EntryMode wrapped a `u16` which didn't store enough
information to match the git representation as a `&[u8]` of len 5 or 6.

In particular, the mode that represents a `Tree` could be represented as
`"40000"` or `"040000"`, and the difference would get lost once it was
represented as `0o40000`.

We now have two ways to represent the `EntryMode`:
* `EntryMode` is backed by an owned `BString`.
* `EntryModeRef` is backed by a `&bstr` and bound to the lifetime of the
  owner of these bytes.

This allows call-sites to pick a trade-off between convenience
(`EntryMode` allows to not worry about lifetimes) vs performance
(`EntryModeRef` doesn't allocate) and fits the general paradigm used in
the wider gitoxide project.

Fixes [issue 1887](GitoxideLabs#1887)
pierrechevalier83 added a commit to pierrechevalier83/gitoxide that referenced this issue Apr 2, 2025
Before this change, EntryMode wrapped a `u16` which didn't store enough
information to match the git representation as a `&[u8]` of len 5 or 6.

In particular, the mode that represents a `Tree` could be represented as
`"40000"` or `"040000"`, and the difference would get lost once it was
represented as `0o40000`.

We now have two ways to represent the `EntryMode`:
* `EntryMode` is backed by an owned `BString`.
* `EntryModeRef` is backed by a `&bstr` and bound to the lifetime of the
  owner of these bytes.

This allows call-sites to pick a trade-off between convenience
(`EntryMode` allows to not worry about lifetimes) vs performance
(`EntryModeRef` doesn't allocate) and fits the general paradigm used in
the wider gitoxide project.

Fixes [issue 1887](GitoxideLabs#1887)
pierrechevalier83 added a commit to pierrechevalier83/gitoxide that referenced this issue Apr 2, 2025
Before this change, EntryMode wrapped a `u16` which didn't store enough
information to match the git representation as a `&[u8]` of len 5 or 6.

In particular, the mode that represents a `Tree` could be represented as
`"40000"` or `"040000"`, and the difference would get lost once it was
represented as `0o40000`.

We now have two ways to represent the `EntryMode`:
* `EntryMode` is backed by an owned `BString`.
* `EntryModeRef` is backed by a `&bstr` and bound to the lifetime of the
  owner of these bytes.

This allows call-sites to pick a trade-off between convenience
(`EntryMode` allows to not worry about lifetimes) vs performance
(`EntryModeRef` doesn't allocate) and fits the general paradigm used in
the wider gitoxide project.

Fixes [issue 1887](GitoxideLabs#1887)
pierrechevalier83 added a commit to pierrechevalier83/gitoxide that referenced this issue Apr 2, 2025
Before this change, EntryMode wrapped a `u16` which didn't store enough
information to match the git representation as a `&[u8]` of len 5 or 6.

In particular, the mode that represents a `Tree` could be represented as
`"40000"` or `"040000"`, and the difference would get lost once it was
represented as `0o40000`.

We now have two ways to represent the `EntryMode`:
* `EntryMode` is backed by an owned `BString`.
* `EntryModeRef` is backed by a `&bstr` and bound to the lifetime of the
  owner of these bytes.

This allows call-sites to pick a trade-off between convenience
(`EntryMode` allows to not worry about lifetimes) vs performance
(`EntryModeRef` doesn't allocate) and fits the general paradigm used in
the wider gitoxide project.

Fixes [issue 1887](GitoxideLabs#1887)
pierrechevalier83 added a commit to pierrechevalier83/gitoxide that referenced this issue Apr 2, 2025
Before this change, EntryMode wrapped a `u16` which didn't store enough
information to match the git representation as a `&[u8]` of len 5 or 6.

In particular, the mode that represents a `Tree` could be represented as
`"40000"` or `"040000"`, and the difference would get lost once it was
represented as `0o40000`.

We now have two ways to represent the `EntryMode`:
* `EntryMode` is backed by an owned `BString`.
* `EntryModeRef` is backed by a `&bstr` and bound to the lifetime of the
  owner of these bytes.

This allows call-sites to pick a trade-off between convenience
(`EntryMode` allows to not worry about lifetimes) vs performance
(`EntryModeRef` doesn't allocate) and fits the general paradigm used in
the wider gitoxide project.

Fixes [issue 1887](GitoxideLabs#1887)
@pierrechevalier83
Copy link
Contributor Author

And please be my guest in tackling this before me

I gave it a go in PR 1917. Got CI to a green state.

The main issues I may have introduced would be to depend on EntryMode where EntryModeRef may be more advisable, which may hurt performance.

I tried to feel the vibe of the various call-sites and made a judgement call in each case. Happy to hear feedback if my vibe reading was poor in some instances.

@Byron
Copy link
Member

Byron commented Apr 3, 2025

Thanks a lot, that's awesome!

The main issues I may have introduced would be to depend on EntryMode where EntryModeRef may be more advisable, which may hurt performance.

cargo bench shows that this unfortunately is the case, but I am hopeful that these significant slowdown can be reduced to just a small percentage, particularly once EntryMode is Copy again.

pierrechevalier83 added a commit to pierrechevalier83/gitoxide that referenced this issue Apr 3, 2025
Before this change, EntryMode wrapped a `u16` which didn't store enough
information to match the git representation as a `&[u8]` of len 5 or 6.

In particular, the mode that represents a `Tree` could be represented as
`"40000"` or `"040000"`, and the difference would get lost once it was
represented as `0o40000`.

Change the backing type for `EntryMode` from a `u16` to a `[u8; 6]`.
This way, we can represent `b"40000"` as `b"40000 "` which differs from
`b"040000"`

Add a regression test that ensures `EntryMode` must roundtrip.

Fixes [issue 1887](GitoxideLabs#1887)

BREAKING CHANGE:
* Sha1 for certain git-objects will change (to match Git's)
* The API had to be updated to decouple callers from the internal data
  representation
pierrechevalier83 added a commit to pierrechevalier83/gitoxide that referenced this issue Apr 3, 2025
Before this change, EntryMode wrapped a `u16` which didn't store enough
information to match the git representation as a `&[u8]` of len 5 or 6.

In particular, the mode that represents a `Tree` could be represented as
`"40000"` or `"040000"`, and the difference would get lost once it was
represented as `0o40000`.

Change the backing type for `EntryMode` from a `u16` to a `[u8; 6]`.
This way, we can represent `b"40000"` as `b"40000 "` which differs from
`b"040000"`

Add a regression test that ensures `EntryMode` must roundtrip.

Fixes [issue 1887](GitoxideLabs#1887)

BREAKING CHANGE:
* Sha1 for certain git-objects will change (to match Git's)
* The API had to be updated to decouple callers from the internal data
  representation
pierrechevalier83 added a commit to pierrechevalier83/gitoxide that referenced this issue Apr 3, 2025
Before this change, EntryMode wrapped a `u16` which didn't store enough
information to match the git representation as a `&[u8]` of len 5 or 6.

In particular, the mode that represents a `Tree` could be represented as
`"40000"` or `"040000"`, and the difference would get lost once it was
represented as `0o40000`.

Change the backing type for `EntryMode` from a `u16` to a `[u8; 6]`.
This way, we can represent `b"40000"` as `b"40000 "` which differs from
`b"040000"`

Add a regression test that ensures `EntryMode` must roundtrip.

Fixes [issue 1887](GitoxideLabs#1887)

BREAKING CHANGE:
* Sha1 for certain git-objects will change (to match Git's)
* The API had to be updated to decouple callers from the internal data
  representation
pierrechevalier83 added a commit to pierrechevalier83/gitoxide that referenced this issue Apr 3, 2025
We'll want to change it to fix issue GitoxideLabs#1887.

First step: don't expose it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged an issue is accepted as shortcoming to be fixed
Projects
None yet
Development

No branches or pull requests

2 participants