-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace func trie with hashmap #179
base: master
Are you sure you want to change the base?
Conversation
@visitorckw, can you comment this? |
Looks good as is. However, as mentioned, a large number of functions may cause excessive collisions and slow down performance. For smaller function counts, the default 512 buckets might be overkill. Therefore, a radix tree with dynamic memory allocation could still be a method worth exploring in the future. |
I'm concerning that dynamic memory allocation at this moment is not reliable and potentially flawed, I've attempted to implement rehashing algorithm before, but on stage 2 the compilation will fail, while the GCC and stage 1 are fine. |
1f5ea61
to
0c48824
Compare
This comment was marked as resolved.
This comment was marked as resolved.
0c48824
to
400e3ae
Compare
1bacbd7
to
cb82f7a
Compare
cb82f7a
to
4a5192e
Compare
|
||
for (; *key; key++) { | ||
hash ^= *key; | ||
hash *= 0x01000193; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multiplication here may cause overflow, leading to undefined behavior. Signed integer overflow is undefined, while unsigned integer overflow is not. Since shecc currently lacks support for unsigned integers, we might consider adding it to address this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just simply add unsigned
type at this moment to simplify the effort of new type? This way, unsigned can still uses current signed's arithmetic algorithm (due to the fact that they both based on 2's complement), since in ARMv7 and RISC-V 32bit assembly, signed overflow is well-defined, the only difference would be intrepretation of most-significant bit.
Edit: I just realized we still need to handle comparison, but I prefer to defer unsigned integer feature since we have an ongoing project which requires full resolution of shecc's specification, and which doesn't include unsigned types at this moment. I think this addition would alters the simplicity of project. @jserv should we postpone this hashmap implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we postpone this hashmap implementation?
You can simply convert this pull request to draft state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this pull request should be implemented as soon as possible, the reason is that I'm currently working on type_t
refactor, but I have encountered the weird free(): invalid pointer
issue when adding functions in globals.c
, which I assume the reason is that the function trie cannot hold more than certain numbers of functions and probably function name's length could also contribute to this issue, these 2 factors and the flaw is already described here:
Lines 103 to 109 in 09bb918
if (!trie->next[fc]) { | |
/* FIXME: The func_tries_idx variable may exceed the maximum number, | |
* which can lead to a segmentation fault. This issue is affected by | |
* the number of functions and the length of their names. The proper | |
* way to handle this is to dynamically allocate a new element. | |
*/ | |
trie->next[fc] = func_tries_idx++; |
But after cherry-picked this branch without any changes to function structures, the issue immediately gone.
One possible solution towards this is to add -fwrapv
compilation flag to gcc
to instruct compiler to wrap signed integer overflow result according to the 2's compliment representation, this ensures defined behavior at least when compiling with gcc
. Meanwhile in shecc, it's fine at this moment since both ARM 32bit and RISC-V 32bit assembly wraps the overflow value according to the 2's compliment representation as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could enlarging limitation in refs.h
temporarily fix this problem? I remember the trie count is almost reaching the limitation in the last change.
This workaround could be remove after applying this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing macro MAX_FUNC_TRIES
in defs.h
to 3000 does the trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this pull request should be implemented as soon as possible, the reason is that I'm currently working on
type_t
refactor, but I have encountered the weirdfree(): invalid pointer
issue when adding functions inglobals.c
, which I assume the reason is that the function trie cannot hold more than certain numbers of functions and probably function name's length could also contribute to this issue, these 2 factors and the flaw is already described here:Lines 103 to 109 in 09bb918
if (!trie->next[fc]) { /* FIXME: The func_tries_idx variable may exceed the maximum number, * which can lead to a segmentation fault. This issue is affected by * the number of functions and the length of their names. The proper * way to handle this is to dynamically allocate a new element. */ trie->next[fc] = func_tries_idx++;
I don't necessarily oppose this PR. However, if the issue is that MAX_FUNC_TRIES
is too small, causing an out-of-bounds array access, it seems unrelated to switching to a hash table instead of a trie. A hash table can also use an array, and a trie can use dynamic memory allocation. This feels more like adding a new feature unrelated to fixing the bug itself. But I'm fine if we decide to switch to a hash table as a workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The major reason that I would like to replace trie is that some errors are not straightforward to be realized, in this case, it generates free(): invalid pointer
, which isn't that obvious to do with insufficient trie size and is not friendly to new-coming contributors in my opinion.
4a5192e
to
08380c3
Compare
Previously, trie implementation is not consistent, mainly because of using index to point the referencing func_t to FUNCS, additionally, trie's advantage is that enables prefix lookup, but in shecc, it hasn't been used in this way, furthur more, it takes 512 bytes per trie node, while in this implementation, it 24 + W (W stands for key length including NULL character) bytes per hashmap bucket node, which significantly reduces memory usage. This also allows for future refactoring of additional structures using a hashmap implementation. Notice that currently FNV-1a hashing function uses signed integer to hash keys, which would lead to undefined behavior, instead of adding unsigned integer to resolve this, we add "-fwrapv" compiler flag to instruct gcc to wrap overflow result according to 2's complement representation. Meanwhile in shecc, it's guaranteed to be always wrap around according to 2's complement representation.
08380c3
to
b14a10d
Compare
Code Review Agent Run #86353dActionable Suggestions - 9
Additional Suggestions - 2
Review Details
|
Changelist by BitoThis pull request implements the following key changes.
|
hashmap_t *map = malloc(sizeof(hashmap_t)); | ||
map->size = round_up_pow2(size); | ||
map->buckets = malloc(size * sizeof(hashmap_node_t *)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding NULL checks after malloc()
calls to handle out of memory conditions gracefully.
Code suggestion
Check the AI-generated fix before applying
hashmap_t *map = malloc(sizeof(hashmap_t)); | |
map->size = round_up_pow2(size); | |
map->buckets = malloc(size * sizeof(hashmap_node_t *)); | |
hashmap_t *map = malloc(sizeof(hashmap_t)); | |
if (!map) | |
return NULL; | |
map->size = round_up_pow2(size); | |
map->buckets = malloc(size * sizeof(hashmap_node_t *)); | |
if (!map->buckets) { | |
free(map); | |
return NULL; | |
} |
Code Review Run #86353d
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
{ | ||
hashmap_t *map = malloc(sizeof(hashmap_t)); | ||
map->size = round_up_pow2(size); | ||
map->buckets = malloc(size * sizeof(hashmap_node_t *)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using map->size
instead of size
when allocating memory for buckets to ensure consistent sizing. The current code allocates memory based on the original size parameter rather than the rounded up power of 2 value stored in map->size
.
Code suggestion
Check the AI-generated fix before applying
map->buckets = malloc(size * sizeof(hashmap_node_t *)); | |
map->buckets = malloc(map->size * sizeof(hashmap_node_t *)); |
Code Review Run #86353d
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
fn = malloc(sizeof(func_t)); | ||
hashmap_put(FUNCS_MAP, name, fn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding memory allocation failure check after malloc()
. The malloc()
call could return NULL
if memory allocation fails, which should be handled gracefully.
Code suggestion
Check the AI-generated fix before applying
fn = malloc(sizeof(func_t)); | |
hashmap_put(FUNCS_MAP, name, fn); | |
fn = malloc(sizeof(func_t)); | |
if (!fn) { | |
return NULL; | |
} | |
hashmap_put(FUNCS_MAP, name, fn); |
Code Review Run #86353d
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
if (hashmap_contains(FUNCS_MAP, name)) { | ||
fn = hashmap_get(FUNCS_MAP, name); | ||
} else { | ||
fn = malloc(sizeof(func_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider initializing the newly allocated func_t
structure using memset()
or explicit initialization. Currently, the memory allocated for fn
is not initialized which could lead to undefined behavior when accessing uninitialized fields.
Code suggestion
Check the AI-generated fix before applying
fn = malloc(sizeof(func_t)); | |
fn = malloc(sizeof(func_t)); | |
memset(fn, 0, sizeof(func_t)); |
Code Review Run #86353d
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
{ | ||
int len = strlen(key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a NULL check for the key
parameter to prevent undefined behavior if a NULL key is passed.
Code suggestion
Check the AI-generated fix before applying
{ | |
int len = strlen(key); | |
{ | |
if (!key) | |
return NULL; | |
int len = strlen(key); |
Code Review Run #86353d
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
hashmap_node_t *node = malloc(sizeof(hashmap_node_t)); | ||
node->key = calloc(len + 1, sizeof(char)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding NULL checks after memory allocation calls. If either malloc()
or calloc()
fails, dereferencing node->key
could lead to a segmentation fault.
Code suggestion
Check the AI-generated fix before applying
hashmap_node_t *node = malloc(sizeof(hashmap_node_t)); | |
node->key = calloc(len + 1, sizeof(char)); | |
hashmap_node_t *node = malloc(sizeof(hashmap_node_t)); | |
if (!node) | |
return NULL; | |
node->key = calloc(len + 1, sizeof(char)); | |
if (!node->key) { | |
free(node); | |
return NULL; | |
} |
Code Review Run #86353d
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
*/ | ||
bool hashmap_contains(hashmap_t *map, char *key) | ||
{ | ||
return hashmap_get(map, key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hashmap_contains()
function directly returns the result of hashmap_get()
which returns a void pointer. This could lead to incorrect boolean evaluation since a non-NULL pointer doesn't necessarily mean true. Consider explicitly comparing the return value with NULL.
Code suggestion
Check the AI-generated fix before applying
return hashmap_get(map, key); | |
return hashmap_get(map, key) != NULL; |
Code Review Run #86353d
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
@@ -600,8 +679,7 @@ void global_init() | |||
BLOCKS.head = NULL; | |||
BLOCKS.tail = NULL; | |||
MACROS = malloc(MAX_ALIASES * sizeof(macro_t)); | |||
FUNCS = malloc(MAX_FUNCS * sizeof(func_t)); | |||
FUNC_TRIES = malloc(MAX_FUNC_TRIES * sizeof(trie_t)); | |||
FUNCS_MAP = hashmap_create(MAX_FUNCS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider validating the return value of hashmap_create()
to handle potential memory allocation failures. The function could return NULL
if malloc()
fails inside it.
Code suggestion
Check the AI-generated fix before applying
FUNCS_MAP = hashmap_create(MAX_FUNCS); | |
FUNCS_MAP = hashmap_create(MAX_FUNCS); | |
if (!FUNCS_MAP) { | |
fprintf(stderr, "Failed to create functions hashmap\n"); | |
exit(1); | |
} |
Code Review Run #86353d
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
@@ -630,8 +709,7 @@ void global_release() | |||
BLOCKS.head = next; | |||
} | |||
free(MACROS); | |||
free(FUNCS); | |||
free(FUNC_TRIES); | |||
hashmap_free(FUNCS_MAP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider checking if FUNCS_MAP
is not NULL before calling hashmap_free()
. The current implementation might lead to undefined behavior if FUNCS_MAP
is NULL.
Code suggestion
Check the AI-generated fix before applying
hashmap_free(FUNCS_MAP); | |
if (FUNCS_MAP) { | |
hashmap_free(FUNCS_MAP); | |
} |
Code Review Run #86353d
Is this a valid issue, or was it incorrectly flagged by the Agent?
- it was incorrectly flagged
Previously, trie implementation is not consistent, mainly because of using index to point the referencing
func_t
toFUNCS
, additionally, it lacks of dynamic allocation which might cause segmentation fault and results more technical debt to debug on eitherFUNCS
orFUNCS_TRIE
. Thus, in this PR, we can resolve this issue by introducing a dynamic hashmap.Current implementation is using FNV-1a hashing algorithm (32-bit edition to be precise), and due to lack of unsigned integer implementation, hashing result ranges from
0
to2,147,483,647
.Notice that current implementation may suffer from lookup issue when the function amount keeps increasing since current hashmap implementation doesn't offer rehashing based on load factor (which ideally, 0.75 would be best and currently shecc does not support floating number).
This also enables us to refactor more structures later with hashmap implementation in shecc.
Benchmark for ./tests/hello.c compilation
Before
After
Summary by Bito
This PR implements a new hashmap-based function lookup system replacing the existing trie implementation. The changes introduce a FNV-1a hashing algorithm for efficient key distribution and include core hashmap operations (create, put, get, free). The implementation improves code maintainability and addresses previous segmentation fault issues, though lacks rehashing functionality.Unit tests added: False
Estimated effort to review (1-5, lower is better): 2