-
Notifications
You must be signed in to change notification settings - Fork 301
Add a new method roaring_bitmap_range_bool_array.
#764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
173ef04 to
57a91cd
Compare
57a91cd to
ff677b9
Compare
b15d97d to
c957dfa
Compare
|
I am not 100% convinced about your motivation. We already support conversion to a bitset: roaring_bitmap_t *r1 = roaring_bitmap_create();
for (uint32_t i = 100; i < 100000; i+= 1 + (i%5)) {
roaring_bitmap_add(r1, i);
}
for (uint32_t i = 100000; i < 500000; i+= 100) {
roaring_bitmap_add(r1, i);
}
roaring_bitmap_add_range(r1, 500000, 600000);
bitset_t * bitset = bitset_create();
bool success = roaring_bitmap_to_bitset(r1, bitset);
assert(success); // could fail due to memory allocation.
assert(bitset_count(bitset) == roaring_bitmap_get_cardinality(r1));
// You can then query the bitset:
for (uint32_t i = 100; i < 100000; i+= 1 + (i%5)) {
assert(bitset_get(bitset,i));
}
for (uint32_t i = 100000; i < 500000; i+= 100) {
assert(bitset_get(bitset,i));
}
// you must free the memory:
bitset_free(bitset);
roaring_bitmap_free(r1);A bitset instance is quite simple: struct bitset_s {
uint64_t *CROARING_CBITSET_RESTRICT array;
/* For simplicity and performance, we prefer to have a size and a capacity
* that is a multiple of 64 bits. Thus we only track the size and the
* capacity in terms of 64-bit words allocated */
size_t arraysize;
size_t capacity;
};
typedef struct bitset_s bitset_t;I guess we can always argue that it is good to have more ways to get the same work done, but expanding our API is not free. If this is related to something ClickHouse needs, then sure... but do you have a related ClickHouse issue ? Or discussion thread ? |
|
Hi @lemire, thanks for the response. I also found that we can convert to
There are already conversion from a roaring bitmap to a uint8 vector in ClickHouse, but the conversion is not efficient yet: Current implementation is using a uint32 iterator to find all values in the roaring bitmap within a specific range, and fill the coresponding position to Apache Doris also implement a similar way to iterate the roaring bitmap to fill a uint8 vector: |
include/roaring/roaring.h
Outdated
| * e.g. | ||
| * ans = malloc(roaring_bitmap_maximum(bitmap) * sizeof(bool)); | ||
| * | ||
| * This function always returns `true` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the function always return true, why have a return value at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the function always return true, why have a return value at all?
I followed the style of existing APIs like this one:
CRoaring/include/roaring/roaring.h
Lines 558 to 573 in d5f8433
| /** | |
| * Convert the bitmap to a sorted array from `offset` by `limit`, output in | |
| * `ans`. | |
| * | |
| * Caller is responsible to ensure that there is enough memory allocated, e.g. | |
| * | |
| * ans = malloc(roaring_bitmap_get_cardinality(limit) * sizeof(uint32_t)); | |
| * | |
| * This function always returns `true` | |
| * | |
| * For more control, see `roaring_uint32_iterator_skip` and | |
| * `roaring_uint32_iterator_read`, which can be used to e.g. tell how many | |
| * values were actually read. | |
| */ | |
| bool roaring_bitmap_range_uint32_array(const roaring_bitmap_t *r, size_t offset, | |
| size_t limit, uint32_t *ans); |
Should we remove them all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function originally returned false on error, at least as per its specification. We updated the documentation a few months ago to reflect the fact that it cannot, in fact, return false. Given that the function has been around for a long time, it is easier to keep the bool return (doing otherwise might break existing code).
I am not sure that it is a sensible pattern that we should reproduce.
Right?
|
@RinChanNOWWW That's an excellent answer, but I am not sure what you have implemented would work for ClickHouse. Look at the use case: roaring::api::roaring_uint32_iterator_t it;
roaring_iterator_init(data.bitmap32, &it);
if (!roaring_uint32_iterator_move_equalorlarger(&it, starting_row))
return false;
bool has_value = false;
while (it.current_value < ending_row)
{
has_value = true;
pos[it.current_value - starting_row] = 1;
if (!roaring_uint32_iterator_advance(&it))
break;
}
return has_value;So it seems that it needs to operate over a range, doesn't it ? Of course, you can always dump the whole thing to a temporary buffer and copy it over, but that's not efficient. |
Yes.
I would like to implement a range API |
|
@RinChanNOWWW I understand, but I am concerned about adding functions for which I see no obvious use. Your analysis is excellent, but it suggests that the range functions are what we want to have. If we had the ranged function, then your version would become unnecessary. |
|
@lemire You are right. So let me implement the range function in this PR first. |
c957dfa to
ffa4b35
Compare
roaring_bitmap_to_bool_array.roaring_bitmap_range_bool_array.
|
Hi @lemire. This PR is ready for review now. It mainly adds three APIs:
As the PR is quite big, SIMD optimization and the same API for roaring64 is not included. |
|
I wrote a benchmark and found that the performance is already better than iterating by uint32 iterator. The benchmark codes: https://pastebin.com/wb72djVn |
| * positions will remain to be false. | ||
| * - after function returns, iterator is positioned at the next element | ||
| */ | ||
| void roaring_uint32_iterator_read_into_bool(roaring_uint32_iterator_t *it, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function roaring_bitmap_range_bool_array(r1, range_start, range_end, bool_array) is fine and easy to understand, but I don't understand the use case here, and I don't understand from the description what it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume that the user is expected to have an iterator that points at some initial value, but I don't understand how they would do it cleanly and what the purpose is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is used to iterate the giving it until max_value. When the function returns, the status of final it will be:
it->has_value == false- or
it->has_value == true && it->current_value >= max_value && prev(it)->current_value < max_value
There is a diagram to show what this function does:
final_it(8)
it(4) max_value(8)
│ │
▼ ▼
Values: 1 2 3 4 5 6 7 8 9
Roaring: x x x x x
The result bool array: [1 0 0 1]
Size of the bool array: 4 ▲
│
Start of the bool array
I will improve the comments and make it more clear.
|
My concern at this point is the user of the iterator in the function signatures. It seems to me that it makes the code unnecessarily complicated. |
Yes. I am not concerned with the efficiency of your implementation. Your code will be fast, that's fine. |
@lemire The iterater function (
bool *ans = new bool[SIZE];
roaring_uint32_iterator_read_into_bool(it, ans + it->current_value, max_value=200);
// Some other operations...
// Continue to iterator `it`
roaring_uint32_iterator_read_into_bool(it, ans + it->current_value, max_value=300);to skip unnecessary seeks for their "range start"s. I will fix some of your code reviews and give better comments for the function. |
4099c0e to
553bdc1
Compare
|
@RinChanNOWWW My recommendation at this stage is that we tune the change in CRoaring to what is useful to ClickHouse's design. It would be a shame to add a specialized function in CRoaring that then goes unused. We ended having to support and maintain an unused function. So let us make sure that there is a very good chance that ClickHouse will accept your PR, and then let us make CRoaring work according to this PR. In other words, I am all for adopting a new function if ClickHouse will adopt it, but let us first make sure that it really meets the needs of ClickHouse. My concern is that it is a highly specialized and not generally useful idea. It is a bit silly to extract to a bool array. There are very few cases where this is a good idea. |
|
I am going to run tests. When you are ready to issue a ClickHouse pull request, we can prepare a tentative release, and see how it works out. |
|
Fixed sanitizer fail in unit test. |
@lemire Sure. I'm going to use the new API in ClickHouse/ClickHouse#90266 |
Currently, we can only convert the roaring bitmap to a sparse uint32 array or a dense bitset. In some case, we may need to get a dense boolean/uint8 array. For example, some database systems like ClickHouse, will use a uint8 array as a mask column to filter data.
This PR introduces
roaring_bitmap_range_bool_arrayto convert a roaring bitmap to a bool array to meet the requirement. The implementation is quite simple, and optimizations (like utilize SIMD) could be introduced in later PRs if this PR is accepted by the community.