SC-61223: fix sparse global order reader to do "repeatable reads" when there are multiple fragments with the same timestamp #5459

rroelke · 2025-02-14T18:39:58Z

When multiple fragments have the same timestamp, the sparse global order reader (and likely others) do not have "repeatable reads". That is, if you run the same read query multiple times, then you do not receive the same result.

The story provides a reproducer which does something like:

for i in range(0, 10):
    write a new fragment A[0] = i at timestamp t=1
    for j in range(0, 10):
        read A[0]

Below is the result:

python tmp/write_consistency.py 
OrderedDict([('a', array([], dtype=int32)), ('d1', array([], dtype=int64))])
Wrote 0, read [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Wrote 1, read [0, 1, 1, 1, 1, 1, 1, 1, 1, 0]
Wrote 2, read [0, 1, 0, 0, 1, 2, 1, 0, 1, 1]
Wrote 3, read [0, 0, 0, 0, 3, 2, 2, 1, 2, 1]
Wrote 4, read [0, 3, 4, 1, 4, 2, 1, 2, 1, 4]
Wrote 5, read [1, 1, 5, 4, 3, 0, 0, 4, 0, 0]
Wrote 6, read [4, 6, 0, 2, 0, 3, 4, 4, 4, 4]
Wrote 7, read [4, 4, 4, 4, 7, 4, 5, 4, 4, 7]
Wrote 8, read [5, 0, 0, 7, 7, 0, 0, 4, 4, 4]
Wrote 9, read [8, 4, 3, 5, 2, 7, 9, 8, 2, 7]

We can see that the reads do not all produce the same value.

In the story we discussed whether we would be satisfied with "repeatable read", in which each row from the reproducing script contains just one distinct value; or whether we need "strict write order", in which each row i contains only values i.

Repeatable read is sufficient, so that's what is implemented in this pull request. An example new result of the above script is:

OrderedDict([('a', array([], dtype=int32)), ('d1', array([], dtype=int64))])
Wrote 0, read [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Wrote 1, read [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Wrote 2, read [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Wrote 3, read [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Wrote 4, read [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Wrote 5, read [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Wrote 6, read [6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
Wrote 7, read [6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
Wrote 8, read [6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
Wrote 9, read [6, 6, 6, 6, 6, 6, 6, 6, 6, 6]

Implementation

The implementation ultimately is straightforward. The coordinates comparator breaks ties in coordinate values using the tile timestamp. We add an additional tiebreaker which compares the fragment index. The fragment index is itself determined by the lexicographic ordering of fragment names, which includes the timestamp and UUID; as such for a fixed array the fragment indices will always be the same for each read. The greatest fragment index in typical cases is the most recently written, so it wins the comparison.

("In typical cases"... meaning what? If two fragments are written at the same timestamp which is now(), then there is logic in the UUID generation to ensure that the first fragment's UUID is lexicographically less than the second fragment's UUID. However, this logic is incorrect if the array was opened at a fixed non-now() timestamp. Multiple fragments written at a fixed timestamp will have uncorrelated UUIDs and thus can appear in any order when reading the array. Hence we get a repeatable read order but not always one which matches the physical order of the writes..)

In any case, I spent a bit of time diving into the above, and there are some artifacts which I elected to leave in here because they are both harmless and tested - specifically some additional accessors to components of FragmentID.

TYPE: BUG
DESC: repeatable read for multiple fragments written at fixed timestamp

teo-tsirpanis

Thanks! Please remove product code used only by tests.

tiledb/sm/fragment/fragment_identifier.h

teo-tsirpanis · 2025-02-14T18:57:58Z

tiledb/sm/fragment/fragment_identifier.h

+  /** The version in which the first word of the UUID became an equal timestamp
+   * tie-breaker */
+  static constexpr format_version_t SUBMILLI_PREFIX_FORMAT_VERSION = 22;


I see this as an implementation detail. Fragment IDs should always be sorted by timestamps first and UUID next, regardless of format version.

I believe this is referring to #4800

I know. What I mean is that the fact that the first bytes of the UUID are not totally random is an implementation detail that does not have to be associated with a format version. Notice that that PR did not bump it, and changed only the UUID generation logic and not the comparison logic.

Prior to version 22 these first bytes were fully random. #4800 and related PRs landed after the version was bumped to 21 but before it was bumped to 22.

If the version is 22 or later then we can be certain that the UUID was generated using the technique from #4800 .
If the version is 21, then... maybe? Was there a release after the UUID change went in but before the format version was bumped? It doesn't really matter, we can't tell when we're just looking at an array.
If the version is 20 or lesser then we know for sure that the UUID is fully random.

If in the future we want to use these bytes (as I was attempting to before reverting changes which never found their way into this PR) then it will be necessary to check the format version.

tiledb/sm/misc/comparators.h

test/src/unit-sparse-global-order-reader.cc

This reverts commit 8b62068.

bekadavis9

LGTM

rroelke · 2025-02-20T14:37:39Z

@teo-tsirpanis the FragmentID changes have been isolated to the whitebox test. I would prefer to leave them in. Whether or not a future developer may actually need to pull those functions into the production code, they do serve as some additional documentation as to how we resolve fragments with the same timestamp.

rroelke added 10 commits February 14, 2025 12:41

Add unit test which fails on unstable order

80d4551

const methods in FragmentID

55d4844

Fix unit test to actually fail on unstable order

ca71736

FragmentID::uuid, FragmentID::submillisecond_counter

c92a2eb

ResultTileCmpBase stabilize order of equal timestamps

123bd09

Fix FragmentID::uuid()

623e6b5

CSparseGlobalOrderFx::write_fragment calls finalize()

8b62068

Fix gen to be nonEmpty

33a61a5

Comments and format

6d7ac9e

Remove commented-out code

c55eccf

rroelke requested review from bekadavis9 and teo-tsirpanis February 14, 2025 18:42

teo-tsirpanis requested changes Feb 14, 2025

View reviewed changes

rroelke added 4 commits February 14, 2025 14:15

Fix variable-length array

6a1e6c8

WhiteboxFragmentID

dbce3cc

Use std::tuple in compare_timestamps

d0cec37

Revert "CSparseGlobalOrderFx::write_fragment calls finalize()"

27b2385

This reverts commit 8b62068.

bekadavis9 approved these changes Feb 19, 2025

View reviewed changes

teo-tsirpanis approved these changes Feb 20, 2025

View reviewed changes

rroelke merged commit ce93f0b into main Feb 20, 2025
59 checks passed

rroelke deleted the rr/sc-61223-same-timestamp-repeatable-read branch February 20, 2025 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SC-61223: fix sparse global order reader to do "repeatable reads" when there are multiple fragments with the same timestamp #5459

SC-61223: fix sparse global order reader to do "repeatable reads" when there are multiple fragments with the same timestamp #5459

rroelke commented Feb 14, 2025 •

edited

Loading

teo-tsirpanis left a comment

teo-tsirpanis Feb 14, 2025

ihnorton Feb 14, 2025

teo-tsirpanis Feb 14, 2025

rroelke Feb 17, 2025

bekadavis9 left a comment

rroelke commented Feb 20, 2025

SC-61223: fix sparse global order reader to do "repeatable reads" when there are multiple fragments with the same timestamp #5459

SC-61223: fix sparse global order reader to do "repeatable reads" when there are multiple fragments with the same timestamp #5459

Conversation

rroelke commented Feb 14, 2025 • edited Loading

Implementation

teo-tsirpanis left a comment

Choose a reason for hiding this comment

teo-tsirpanis Feb 14, 2025

Choose a reason for hiding this comment

ihnorton Feb 14, 2025

Choose a reason for hiding this comment

teo-tsirpanis Feb 14, 2025

Choose a reason for hiding this comment

rroelke Feb 17, 2025

Choose a reason for hiding this comment

bekadavis9 left a comment

Choose a reason for hiding this comment

rroelke commented Feb 20, 2025

rroelke commented Feb 14, 2025 •

edited

Loading