Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SC-61223: fix sparse global order reader to do "repeatable reads" when there are multiple fragments with the same timestamp #5459

Merged
merged 14 commits into from
Feb 20, 2025

Conversation

rroelke
Copy link
Member

@rroelke rroelke commented Feb 14, 2025

When multiple fragments have the same timestamp, the sparse global order reader (and likely others) do not have "repeatable reads". That is, if you run the same read query multiple times, then you do not receive the same result.

The story provides a reproducer which does something like:

for i in range(0, 10):
    write a new fragment A[0] = i at timestamp t=1
    for j in range(0, 10):
        read A[0]

Below is the result:

python tmp/write_consistency.py 
OrderedDict([('a', array([], dtype=int32)), ('d1', array([], dtype=int64))])
Wrote 0, read [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Wrote 1, read [0, 1, 1, 1, 1, 1, 1, 1, 1, 0]
Wrote 2, read [0, 1, 0, 0, 1, 2, 1, 0, 1, 1]
Wrote 3, read [0, 0, 0, 0, 3, 2, 2, 1, 2, 1]
Wrote 4, read [0, 3, 4, 1, 4, 2, 1, 2, 1, 4]
Wrote 5, read [1, 1, 5, 4, 3, 0, 0, 4, 0, 0]
Wrote 6, read [4, 6, 0, 2, 0, 3, 4, 4, 4, 4]
Wrote 7, read [4, 4, 4, 4, 7, 4, 5, 4, 4, 7]
Wrote 8, read [5, 0, 0, 7, 7, 0, 0, 4, 4, 4]
Wrote 9, read [8, 4, 3, 5, 2, 7, 9, 8, 2, 7]

We can see that the reads do not all produce the same value.

In the story we discussed whether we would be satisfied with "repeatable read", in which each row from the reproducing script contains just one distinct value; or whether we need "strict write order", in which each row i contains only values i.

Repeatable read is sufficient, so that's what is implemented in this pull request. An example new result of the above script is:

OrderedDict([('a', array([], dtype=int32)), ('d1', array([], dtype=int64))])
Wrote 0, read [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Wrote 1, read [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Wrote 2, read [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Wrote 3, read [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Wrote 4, read [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Wrote 5, read [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Wrote 6, read [6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
Wrote 7, read [6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
Wrote 8, read [6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
Wrote 9, read [6, 6, 6, 6, 6, 6, 6, 6, 6, 6]

Implementation

The implementation ultimately is straightforward. The coordinates comparator breaks ties in coordinate values using the tile timestamp. We add an additional tiebreaker which compares the fragment index. The fragment index is itself determined by the lexicographic ordering of fragment names, which includes the timestamp and UUID; as such for a fixed array the fragment indices will always be the same for each read. The greatest fragment index in typical cases is the most recently written, so it wins the comparison.

("In typical cases"... meaning what? If two fragments are written at the same timestamp which is now(), then there is logic in the UUID generation to ensure that the first fragment's UUID is lexicographically less than the second fragment's UUID. However, this logic is incorrect if the array was opened at a fixed non-now() timestamp. Multiple fragments written at a fixed timestamp will have uncorrelated UUIDs and thus can appear in any order when reading the array. Hence we get a repeatable read order but not always one which matches the physical order of the writes..)

In any case, I spent a bit of time diving into the above, and there are some artifacts which I elected to leave in here because they are both harmless and tested - specifically some additional accessors to components of FragmentID.


TYPE: BUG
DESC: repeatable read for multiple fragments written at fixed timestamp

Copy link
Member

@teo-tsirpanis teo-tsirpanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Please remove product code used only by tests.

Comment on lines 87 to 89
/** The version in which the first word of the UUID became an equal timestamp
* tie-breaker */
static constexpr format_version_t SUBMILLI_PREFIX_FORMAT_VERSION = 22;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this as an implementation detail. Fragment IDs should always be sorted by timestamps first and UUID next, regardless of format version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is referring to #4800

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know. What I mean is that the fact that the first bytes of the UUID are not totally random is an implementation detail that does not have to be associated with a format version. Notice that that PR did not bump it, and changed only the UUID generation logic and not the comparison logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to version 22 these first bytes were fully random. #4800 and related PRs landed after the version was bumped to 21 but before it was bumped to 22.

If the version is 22 or later then we can be certain that the UUID was generated using the technique from #4800 .
If the version is 21, then... maybe? Was there a release after the UUID change went in but before the format version was bumped? It doesn't really matter, we can't tell when we're just looking at an array.
If the version is 20 or lesser then we know for sure that the UUID is fully random.

If in the future we want to use these bytes (as I was attempting to before reverting changes which never found their way into this PR) then it will be necessary to check the format version.

Copy link
Contributor

@bekadavis9 bekadavis9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rroelke
Copy link
Member Author

rroelke commented Feb 20, 2025

@teo-tsirpanis the FragmentID changes have been isolated to the whitebox test. I would prefer to leave them in. Whether or not a future developer may actually need to pull those functions into the production code, they do serve as some additional documentation as to how we resolve fragments with the same timestamp.

@rroelke rroelke merged commit ce93f0b into main Feb 20, 2025
59 checks passed
@rroelke rroelke deleted the rr/sc-61223-same-timestamp-repeatable-read branch February 20, 2025 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants