Replace func trie with hashmap #179

ChAoSUnItY · 2025-01-18T18:01:41Z

Previously, trie implementation is not consistent, mainly because of using index to point the referencing func_t to FUNCS, additionally, it lacks of dynamic allocation which might cause segmentation fault and results more technical debt to debug on either FUNCS or FUNCS_TRIE. Thus, in this PR, we can resolve this issue by introducing a dynamic hashmap.

Current implementation is using FNV-1a hashing algorithm (32-bit edition to be precise), and due to lack of unsigned integer implementation, hashing result ranges from 0 to 2,147,483,647.

Notice that current implementation may suffer from lookup issue when the function amount keeps increasing since current hashmap implementation doesn't offer rehashing based on load factor (which ideally, 0.75 would be best and currently shecc does not support floating number).

This also enables us to refactor more structures later with hashmap implementation in shecc.

Benchmark for ./tests/hello.c compilation

Before

Command being timed: "./out/shecc tests/hello.c"
        User time (seconds): 0.00
        System time (seconds): 0.02
        Percent of CPU this job got: 76%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 52112
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 12220
        Voluntary context switches: 0
        Involuntary context switches: 0
        Swaps: 0
        File system inputs: 0
        File system outputs: 32
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

After

Command being timed: "./out/shecc tests/hello.c"
        User time (seconds): 0.00
        System time (seconds): 0.02
        Percent of CPU this job got: 71%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 49916
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 12224
        Voluntary context switches: 1
        Involuntary context switches: 0
        Swaps: 0
        File system inputs: 8
        File system outputs: 32
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Summary by Bito

This pull request replaces the trie structure for function lookups with a dynamic hashmap, improving memory management and addressing segmentation faults. The new implementation uses the FNV-1a hashing algorithm but currently lacks dynamic resizing, which may impact performance as function count increases. Overall, it simplifies function storage and sets the stage for future enhancements.

Unit tests added: False

Estimated effort to review (1-5, lower is better): 2

jserv · 2025-01-19T07:20:56Z

@visitorckw, can you comment this?

src/globals.c

visitorckw · 2025-01-19T07:51:35Z

Looks good as is.

However, as mentioned, a large number of functions may cause excessive collisions and slow down performance. For smaller function counts, the default 512 buckets might be overkill. Therefore, a radix tree with dynamic memory allocation could still be a method worth exploring in the future.

ChAoSUnItY · 2025-01-19T08:05:21Z

I'm concerning that dynamic memory allocation at this moment is not reliable and potentially flawed, I've attempted to implement rehashing algorithm before, but on stage 2 the compilation will fail, while the GCC and stage 1 are fine.

src/globals.c

visitorckw · 2025-01-19T14:48:13Z

src/globals.c

+
+    for (; *key; key++) {
+        hash ^= *key;
+        hash *= 0x01000193;


The multiplication here may cause overflow, leading to undefined behavior. Signed integer overflow is undefined, while unsigned integer overflow is not. Since shecc currently lacks support for unsigned integers, we might consider adding it to address this issue.

I think we can just simply add unsigned type at this moment to simplify the effort of new type? This way, unsigned can still uses current signed's arithmetic algorithm (due to the fact that they both based on 2's complement), since in ARMv7 and RISC-V 32bit assembly, signed overflow is well-defined, the only difference would be intrepretation of most-significant bit.

Edit: I just realized we still need to handle comparison, but I prefer to defer unsigned integer feature since we have an ongoing project which requires full resolution of shecc's specification, and which doesn't include unsigned types at this moment. I think this addition would alters the simplicity of project. @jserv should we postpone this hashmap implementation?

should we postpone this hashmap implementation?

You can simply convert this pull request to draft state.

I think this pull request should be implemented as soon as possible, the reason is that I'm currently working on type_t refactor, but I have encountered the weird free(): invalid pointer issue when adding functions in globals.c, which I assume the reason is that the function trie cannot hold more than certain numbers of functions and probably function name's length could also contribute to this issue, these 2 factors and the flaw is already described here:

shecc/src/globals.c

Lines 103 to 109 in 09bb918

if (!trie->next[fc]) {

/* FIXME: The func_tries_idx variable may exceed the maximum number,

* which can lead to a segmentation fault. This issue is affected by

* the number of functions and the length of their names. The proper

* way to handle this is to dynamically allocate a new element.

*/

trie->next[fc] = func_tries_idx++;

But after cherry-picked this branch without any changes to function structures, the issue immediately gone.

One possible solution towards this is to add -fwrapv compilation flag to gcc to instruct compiler to wrap signed integer overflow result according to the 2's compliment representation, this ensures defined behavior at least when compiling with gcc. Meanwhile in shecc, it's fine at this moment since both ARM 32bit and RISC-V 32bit assembly wraps the overflow value according to the 2's compliment representation as well.

Could enlarging limitation in refs.h temporarily fix this problem? I remember the trie count is almost reaching the limitation in the last change.

This workaround could be remove after applying this patch.

Changing macro MAX_FUNC_TRIES in defs.h to 3000 does the trick.

I think this pull request should be implemented as soon as possible, the reason is that I'm currently working on type_t refactor, but I have encountered the weird free(): invalid pointer issue when adding functions in globals.c, which I assume the reason is that the function trie cannot hold more than certain numbers of functions and probably function name's length could also contribute to this issue, these 2 factors and the flaw is already described here:

shecc/src/globals.c

Lines 103 to 109 in 09bb918

if (!trie->next[fc]) {

/* FIXME: The func_tries_idx variable may exceed the maximum number,

* which can lead to a segmentation fault. This issue is affected by

* the number of functions and the length of their names. The proper

* way to handle this is to dynamically allocate a new element.

*/

trie->next[fc] = func_tries_idx++;

I don't necessarily oppose this PR. However, if the issue is that MAX_FUNC_TRIES is too small, causing an out-of-bounds array access, it seems unrelated to switching to a hash table instead of a trie. A hash table can also use an array, and a trie can use dynamic memory allocation. This feels more like adding a new feature unrelated to fixing the bug itself. But I'm fine if we decide to switch to a hash table as a workaround.

The major reason that I would like to replace trie is that some errors are not straightforward to be realized, in this case, it generates free(): invalid pointer, which isn't that obvious to do with insufficient trie size and is not friendly to new-coming contributors in my opinion.

src/globals.c

bito-code-review · 2025-02-15T07:12:28Z

Code Review Agent Run #66bd40

Actionable Suggestions - 4

src/globals.c - 4
- Consider adding buckets allocation error check · Line 125-125
- Missing format specifier in printf call · Line 157-157
- Consider adding error handling for FUNCS_MAP · Line 713-714
- Consider adding null check in hashmap_free · Line 746-747

Review Details

Files reviewed - 1 · Commit Range: b14a10d..35671a8
- src/globals.c
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- Fb Infer (Static Code Analysis) - ✖︎ Failed

AI Code Review powered by

src/globals.c

bito-code-review · 2025-02-15T07:14:54Z

src/globals.c

+    node->key = calloc(len + 1, sizeof(char));
+
+    if (!node->key) {
+        printf("Failed to allocate hashmap_node_t key with size %d\n");


Missing format specifier in printf call

The printf format string is missing the format specifier for len. Consider updating to include %d in the format string.

Code suggestion

Check the AI-generated fix before applying

Suggested change

printf("Failed to allocate hashmap_node_t key with size %d\n");

printf("Failed to allocate hashmap_node_t key with size %d\n", len);

Code Review Run #66bd40

Is this a valid issue, or was it incorrectly flagged by the Agent?

it was incorrectly flagged

src/globals.c

bito-code-review · 2025-02-15T08:02:28Z

Code Review Agent Run #4b564e

Actionable Suggestions - 7

src/globals.c - 7
- Missing overflow checks before allocations · Line 137-159
- Consider adding memory cleanup on failure · Line 127-130
- Consider freeing map on allocation failure · Line 127-130
- Consider using unsigned integers for hashing · Line 82-92
- Consider handling duplicate keys in hashmap_put · Line 185-193
- Consider restructuring hashmap node freeing logic · Line 244-246
- Consider adding null check for FUNCS_MAP · Line 480-480

Additional Suggestions - 5

src/globals.c - 3
- Consider validating size parameter · Line 82-82
- Consider initializing hashmap pointer to NULL · Line 24-24
- Consider using NULL for pointer init · Line 132-133
src/defs.h - 1
- Consider adding capacity field to hashmap · Line 316-318
Makefile - 1
- Consider separating security compiler flags · Line 4-4

Review Details

Files reviewed - 3 · Commit Range: f1fb5df..f1fb5df
- Makefile
- src/defs.h
- src/globals.c
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- Fb Infer (Static Code Analysis) - ✖︎ Failed

AI Code Review powered by

src/globals.c

Previously, trie implementation is not consistent, mainly because of using index to point the referencing func_t to FUNCS, additionally, trie's advantage is that enables prefix lookup, but in shecc, it hasn't been used in this way, furthur more, it takes 512 bytes per trie node, while in this implementation, it 24 + W (W stands for key length including NULL character) bytes per hashmap bucket node, which significantly reduces memory usage. This also allows for future refactoring of additional structures using a hashmap implementation. Notice that currently FNV-1a hashing function uses signed integer to hash keys, which would lead to undefined behavior, instead of adding unsigned integer to resolve this, we add "-fwrapv" compiler flag to instruct gcc to wrap overflow result according to 2's complement representation. Meanwhile in shecc, it's guaranteed to be always wrap around according to 2's complement representation.

jserv · 2025-02-27T05:49:03Z

Thank @ChAoSUnItY for contributing!

jserv reviewed Jan 19, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

jserv reviewed Jan 19, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

ChAoSUnItY force-pushed the refactor/hashmap branch from 1f5ea61 to 0c48824 Compare January 19, 2025 08:25

jserv reviewed Jan 19, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

jserv reviewed Jan 19, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

jserv reviewed Jan 19, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

ChAoSUnItY force-pushed the refactor/hashmap branch from 0c48824 to 400e3ae Compare January 19, 2025 10:04

jserv reviewed Jan 19, 2025

View reviewed changes

src/globals.c Show resolved Hide resolved

jserv reviewed Jan 19, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

jserv reviewed Jan 19, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

jserv reviewed Jan 19, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

jserv reviewed Jan 19, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

ChAoSUnItY force-pushed the refactor/hashmap branch 4 times, most recently from 1bacbd7 to cb82f7a Compare January 19, 2025 10:27

jserv reviewed Jan 19, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

ChAoSUnItY force-pushed the refactor/hashmap branch from cb82f7a to 4a5192e Compare January 19, 2025 11:12

visitorckw reviewed Jan 19, 2025

View reviewed changes

ChAoSUnItY marked this pull request as draft January 19, 2025 19:55

ChAoSUnItY force-pushed the refactor/hashmap branch from 4a5192e to 08380c3 Compare January 24, 2025 06:19

ChAoSUnItY marked this pull request as ready for review January 24, 2025 06:22

jserv requested a review from visitorckw January 24, 2025 06:42

jserv reviewed Feb 3, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

jserv reviewed Feb 3, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

ChAoSUnItY force-pushed the refactor/hashmap branch from 08380c3 to b14a10d Compare February 7, 2025 17:24

bito-code-review bot reviewed Feb 7, 2025

View reviewed changes

src/globals.c Show resolved Hide resolved

bito-code-review bot reviewed Feb 7, 2025

View reviewed changes

src/globals.c Show resolved Hide resolved

bito-code-review bot reviewed Feb 7, 2025

View reviewed changes

src/globals.c Show resolved Hide resolved

bito-code-review bot reviewed Feb 7, 2025

View reviewed changes

src/globals.c Show resolved Hide resolved

bito-code-review bot reviewed Feb 7, 2025

View reviewed changes

src/globals.c Show resolved Hide resolved

bito-code-review bot reviewed Feb 7, 2025

View reviewed changes

src/globals.c Show resolved Hide resolved

bito-code-review bot reviewed Feb 15, 2025

View reviewed changes

src/globals.c Show resolved Hide resolved

bito-code-review bot reviewed Feb 15, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

bito-code-review bot reviewed Feb 15, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

ChAoSUnItY force-pushed the refactor/hashmap branch 2 times, most recently from 4fbb41a to f1fb5df Compare February 15, 2025 08:01

bito-code-review bot reviewed Feb 15, 2025

View reviewed changes

src/globals.c Show resolved Hide resolved

bito-code-review bot reviewed Feb 15, 2025

View reviewed changes

src/globals.c Show resolved Hide resolved

bito-code-review bot reviewed Feb 15, 2025

View reviewed changes

src/globals.c Show resolved Hide resolved

bito-code-review bot reviewed Feb 15, 2025

View reviewed changes

src/globals.c Show resolved Hide resolved

bito-code-review bot reviewed Feb 15, 2025

View reviewed changes

src/globals.c Show resolved Hide resolved

bito-code-review bot reviewed Feb 15, 2025

View reviewed changes

src/globals.c Show resolved Hide resolved

ChAoSUnItY force-pushed the refactor/hashmap branch 2 times, most recently from 0d1a957 to bfec486 Compare February 17, 2025 11:03

jserv reviewed Feb 26, 2025

View reviewed changes

src/globals.c Outdated Show resolved Hide resolved

jserv reviewed Feb 26, 2025

View reviewed changes

src/globals.c Show resolved Hide resolved

ChAoSUnItY force-pushed the refactor/hashmap branch from bfec486 to c50f8c1 Compare February 27, 2025 05:33

jserv merged commit 7a53b43 into sysprog21:master Feb 27, 2025
6 checks passed

ChAoSUnItY mentioned this pull request Mar 20, 2025

Fix dereference behavior on mixed subscript and arrow / dot operators #182

Merged

ChAoSUnItY deleted the refactor/hashmap branch April 28, 2025 18:06

	if (!trie->next[fc]) {
	/* FIXME: The func_tries_idx variable may exceed the maximum number,
	* which can lead to a segmentation fault. This issue is affected by
	* the number of functions and the length of their names. The proper
	* way to handle this is to dynamically allocate a new element.
	*/
	trie->next[fc] = func_tries_idx++;

	printf("Failed to allocate hashmap_node_t key with size %d\n");
	printf("Failed to allocate hashmap_node_t key with size %d\n", len);

Replace func trie with hashmap #179

Replace func trie with hashmap #179

Uh oh!

Conversation

ChAoSUnItY commented Jan 18, 2025 • edited by bito-code-review bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Summary by Bito

Uh oh!

jserv commented Jan 19, 2025

Uh oh!

Uh oh!

Uh oh!

visitorckw commented Jan 19, 2025

Uh oh!

ChAoSUnItY commented Jan 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

visitorckw Jan 19, 2025

Choose a reason for hiding this comment

Uh oh!

ChAoSUnItY Jan 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jserv Jan 19, 2025

Choose a reason for hiding this comment

Uh oh!

ChAoSUnItY Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vacantron Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChAoSUnItY Jan 24, 2025

Choose a reason for hiding this comment

Uh oh!

visitorckw Jan 27, 2025

Choose a reason for hiding this comment

Uh oh!

ChAoSUnItY Jan 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bito-code-review bot commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Agent Run #66bd40

Uh oh!

Uh oh!

bito-code-review bot Feb 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bito-code-review bot commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Agent Run #4b564e

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChAoSUnItY commented Jan 18, 2025 •

edited by bito-code-review bot

Loading

ChAoSUnItY Jan 19, 2025 •

edited

Loading

ChAoSUnItY Jan 23, 2025 •

edited

Loading

vacantron Jan 24, 2025 •

edited

Loading

bito-code-review bot commented Feb 15, 2025 •

edited

Loading

bito-code-review bot commented Feb 15, 2025 •

edited

Loading