Skip to content

Commit 4cf2143

Browse files
peffgitster
authored andcommitted
pack-objects: break delta cycles before delta-search phase
We do not allow cycles in the delta graph of a pack (i.e., A is a delta of B which is a delta of A) for the obvious reason that you cannot actually access any of the objects in such a case. There's a last-ditch attempt to notice cycles during the write phase, during which we issue a warning to the user and write one of the objects out in full. However, this is "last-ditch" for two reasons: 1. By this time, it's too late to find another delta for the object, so the resulting pack is larger than it otherwise could be. 2. The warning is there because this is something that _shouldn't_ ever happen. If it does, then either: a. a pack we are reusing deltas from had its own cycle b. we are reusing deltas from multiple packs, and we found a cycle among them (i.e., A is a delta of B in one pack, but B is a delta of A in another, and we choose to use both deltas). c. there is a bug in the delta-search code So this code serves as a final check that none of these things has happened, warns the user, and prevents us from writing a bogus pack. Right now, (2b) should never happen because of the static ordering of packs in want_object_in_pack(). If two objects have a delta relationship, then they must be in the same pack, and therefore we will find them from that same pack. However, a future patch would like to change that static ordering, which will make (2b) a common occurrence. In preparation, we should be able to handle those kinds of cycles better. This patch does by introducing a cycle-breaking step during the get_object_details() phase, when we are deciding which deltas can be reused. That gives us the chance to feed the objects into the delta search as if the cycle did not exist. We'll leave the detection and warning in the write_object() phase in place, as it still serves as a check for case (2c). This does mean we will stop warning for (2a). That case is caused by bogus input packs, and we ideally would warn the user about it. However, since those cycles show up after picking reusable deltas, they look the same as (2b) to us; our new code will break the cycles early and the last-ditch check will never see them. We could do analysis on any cycles that we find to distinguish the two cases (i.e., it is a bogus pack if and only if every delta in the cycle is in the same pack), but we don't need to. If there is a cycle inside a pack, we'll run into problems not only reusing the delta, but accessing the object data at all. So when we try to dig up the actual size of the object, we'll hit that same cycle and kick in our usual complain-and-try-another-source code. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent ca79c98 commit 4cf2143

File tree

3 files changed

+206
-0
lines changed

3 files changed

+206
-0
lines changed

builtin/pack-objects.c

+84
Original file line numberDiff line numberDiff line change
@@ -1494,6 +1494,83 @@ static int pack_offset_sort(const void *_a, const void *_b)
14941494
(a->in_pack_offset > b->in_pack_offset);
14951495
}
14961496

1497+
/*
1498+
* Drop an on-disk delta we were planning to reuse. Naively, this would
1499+
* just involve blanking out the "delta" field, but we have to deal
1500+
* with some extra book-keeping:
1501+
*
1502+
* 1. Removing ourselves from the delta_sibling linked list.
1503+
*
1504+
* 2. Updating our size/type to the non-delta representation. These were
1505+
* either not recorded initially (size) or overwritten with the delta type
1506+
* (type) when check_object() decided to reuse the delta.
1507+
*/
1508+
static void drop_reused_delta(struct object_entry *entry)
1509+
{
1510+
struct object_entry **p = &entry->delta->delta_child;
1511+
struct object_info oi = OBJECT_INFO_INIT;
1512+
1513+
while (*p) {
1514+
if (*p == entry)
1515+
*p = (*p)->delta_sibling;
1516+
else
1517+
p = &(*p)->delta_sibling;
1518+
}
1519+
entry->delta = NULL;
1520+
1521+
oi.sizep = &entry->size;
1522+
oi.typep = &entry->type;
1523+
if (packed_object_info(entry->in_pack, entry->in_pack_offset, &oi) < 0) {
1524+
/*
1525+
* We failed to get the info from this pack for some reason;
1526+
* fall back to sha1_object_info, which may find another copy.
1527+
* And if that fails, the error will be recorded in entry->type
1528+
* and dealt with in prepare_pack().
1529+
*/
1530+
entry->type = sha1_object_info(entry->idx.sha1, &entry->size);
1531+
}
1532+
}
1533+
1534+
/*
1535+
* Follow the chain of deltas from this entry onward, throwing away any links
1536+
* that cause us to hit a cycle (as determined by the DFS state flags in
1537+
* the entries).
1538+
*/
1539+
static void break_delta_chains(struct object_entry *entry)
1540+
{
1541+
/* If it's not a delta, it can't be part of a cycle. */
1542+
if (!entry->delta) {
1543+
entry->dfs_state = DFS_DONE;
1544+
return;
1545+
}
1546+
1547+
switch (entry->dfs_state) {
1548+
case DFS_NONE:
1549+
/*
1550+
* This is the first time we've seen the object. We mark it as
1551+
* part of the active potential cycle and recurse.
1552+
*/
1553+
entry->dfs_state = DFS_ACTIVE;
1554+
break_delta_chains(entry->delta);
1555+
entry->dfs_state = DFS_DONE;
1556+
break;
1557+
1558+
case DFS_DONE:
1559+
/* object already examined, and not part of a cycle */
1560+
break;
1561+
1562+
case DFS_ACTIVE:
1563+
/*
1564+
* We found a cycle that needs broken. It would be correct to
1565+
* break any link in the chain, but it's convenient to
1566+
* break this one.
1567+
*/
1568+
drop_reused_delta(entry);
1569+
entry->dfs_state = DFS_DONE;
1570+
break;
1571+
}
1572+
}
1573+
14971574
static void get_object_details(void)
14981575
{
14991576
uint32_t i;
@@ -1511,6 +1588,13 @@ static void get_object_details(void)
15111588
entry->no_try_delta = 1;
15121589
}
15131590

1591+
/*
1592+
* This must happen in a second pass, since we rely on the delta
1593+
* information for the whole list being completed.
1594+
*/
1595+
for (i = 0; i < to_pack.nr_objects; i++)
1596+
break_delta_chains(&to_pack.objects[i]);
1597+
15141598
free(sorted_by_offset);
15151599
}
15161600

pack-objects.h

+9
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,15 @@ struct object_entry {
2727
unsigned no_try_delta:1;
2828
unsigned tagged:1; /* near the very tip of refs */
2929
unsigned filled:1; /* assigned write-order */
30+
31+
/*
32+
* State flags for depth-first search used for analyzing delta cycles.
33+
*/
34+
enum {
35+
DFS_NONE = 0,
36+
DFS_ACTIVE,
37+
DFS_DONE
38+
} dfs_state;
3039
};
3140

3241
struct packing_data {

t/t5314-pack-cycle-detection.sh

+113
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
#!/bin/sh
2+
3+
test_description='test handling of inter-pack delta cycles during repack
4+
5+
The goal here is to create a situation where we have two blobs, A and B, with A
6+
as a delta against B in one pack, and vice versa in the other. Then if we can
7+
persuade a full repack to find A from one pack and B from the other, that will
8+
give us a cycle when we attempt to reuse those deltas.
9+
10+
The trick is in the "persuade" step, as it depends on the internals of how
11+
pack-objects picks which pack to reuse the deltas from. But we can assume
12+
that it does so in one of two general strategies:
13+
14+
1. Using a static ordering of packs. In this case, no inter-pack cycles can
15+
happen. Any objects with a delta relationship must be present in the same
16+
pack (i.e., no "--thin" packs on disk), so we will find all related objects
17+
from that pack. So assuming there are no cycles within a single pack (and
18+
we avoid generating them via pack-objects or importing them via
19+
index-pack), then our result will have no cycles.
20+
21+
So this case should pass the tests no matter how we arrange things.
22+
23+
2. Picking the next pack to examine based on locality (i.e., where we found
24+
something else recently).
25+
26+
In this case, we want to make sure that we find the delta versions of A and
27+
B and not their base versions. We can do this by putting two blobs in each
28+
pack. The first is a "dummy" blob that can only be found in the pack in
29+
question. And then the second is the actual delta we want to find.
30+
31+
The two blobs must be present in the same tree, not present in other trees,
32+
and the dummy pathname must sort before the delta path.
33+
34+
The setup below focuses on case 2. We have two commits HEAD and HEAD^, each
35+
which has two files: "dummy" and "file". Then we can make two packs which
36+
contain:
37+
38+
[pack one]
39+
HEAD:dummy
40+
HEAD:file (as delta against HEAD^:file)
41+
HEAD^:file (as base)
42+
43+
[pack two]
44+
HEAD^:dummy
45+
HEAD^:file (as delta against HEAD:file)
46+
HEAD:file (as base)
47+
48+
Then no matter which order we start looking at the packs in, we know that we
49+
will always find a delta for "file", because its lookup will always come
50+
immediately after the lookup for "dummy".
51+
'
52+
. ./test-lib.sh
53+
54+
55+
56+
# Create a pack containing the the tree $1 and blob $1:file, with
57+
# the latter stored as a delta against $2:file.
58+
#
59+
# We convince pack-objects to make the delta in the direction of our choosing
60+
# by marking $2 as a preferred-base edge. That results in $1:file as a thin
61+
# delta, and index-pack completes it by adding $2:file as a base.
62+
#
63+
# Note that the two variants of "file" must be similar enough to convince git
64+
# to create the delta.
65+
make_pack () {
66+
{
67+
printf '%s\n' "-$(git rev-parse $2)"
68+
printf '%s dummy\n' "$(git rev-parse $1:dummy)"
69+
printf '%s file\n' "$(git rev-parse $1:file)"
70+
} |
71+
git pack-objects --stdout |
72+
git index-pack --stdin --fix-thin
73+
}
74+
75+
test_expect_success 'setup' '
76+
test-genrandom base 4096 >base &&
77+
for i in one two
78+
do
79+
# we want shared content here to encourage deltas...
80+
cp base file &&
81+
echo $i >>file &&
82+
83+
# ...whereas dummy should be short, because we do not want
84+
# deltas that would create duplicates when we --fix-thin
85+
echo $i >dummy &&
86+
87+
git add file dummy &&
88+
test_tick &&
89+
git commit -m $i ||
90+
return 1
91+
done &&
92+
93+
make_pack HEAD^ HEAD &&
94+
make_pack HEAD HEAD^
95+
'
96+
97+
test_expect_success 'repack' '
98+
# We first want to check that we do not have any internal errors,
99+
# and also that we do not hit the last-ditch cycle-breaking code
100+
# in write_object(), which will issue a warning to stderr.
101+
>expect &&
102+
git repack -ad 2>stderr &&
103+
test_cmp expect stderr &&
104+
105+
# And then double-check that the resulting pack is usable (i.e.,
106+
# we did not fail to notice any cycles). We know we are accessing
107+
# the objects via the new pack here, because "repack -d" will have
108+
# removed the others.
109+
git cat-file blob HEAD:file >/dev/null &&
110+
git cat-file blob HEAD^:file >/dev/null
111+
'
112+
113+
test_done

0 commit comments

Comments
 (0)