-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case insensitive prefix search, data is case sensitive. #7
Comments
Hi, Yes currently only case-sensitive lookups are supported. I would like to add something like the The main problem is that when I go down the trie, I go down byte by byte and not character by character. It's the same for a simple ASCII alphabet but different for multi-bytes UTF-8 characters. So even if I provide a way for the user to use a custom comparator (though in my implementation I don't directly compare two bytes, I convert a Also if I allow the trie to be case sensitive while providing case insensitive lookups, it means that lookups will have to go in different branches of the trie (possible but more complicate to implement). I would like to add this feature but I don't really know what would be the best way to proceed to keep an intuitive interface that supports UTF-8 strings. |
Hi, It sounds like the current way (casting char to int ad using that as array index) is a way that has to go if you want to consider case insensitive as well (or rather, any for of compare). What would that do for performance? |
I tried with and But it's possible to work that out, instead of converting the |
Yeah, trying both would work. But that will cripple your performance of this nice trie container :) (or cut it in half at most i suppose) |
This should not be a problem, as long as it is implemented in a separate function. |
UTF-8 is a mess to work with. For example, a single accented letter like "á" can be represented in at least two ways: by a single codepoint (U+00E1) or by two codepoints (U+0061 and U+0301). It will be HARD! |
Hi,
I admit, i haven't tried it. But looking at the code it seems to be doing case-sensitive searching.
Imagine i put a bunch of files paths in the trie with structures like:
I'm guessing a prefix search of:
note the lowercase d in the equal_prefix_range call, it will probably not match "/home/a/Desktop" and "/home/a/Downloads" while that would be desirable. Just lowercase everything would fix this, but then the data does not represent the actual paths anymore (both pats could exists in a linux/unix world as it is case sensitive). Using a map and have a lowercase -> uppercase mapping could potentially be a solution as well, but that also doubles the data usage (at the very least) thus kinda defeating the point of using a nice HAT trie in the first place (in terms of memory efficiency).
I do realize that there is a bit of trouble in making the equal_prefix_range case-insensitive (ideally optional, case-sensitive and case-insensitive). You don't know the type i'm putting in as string. It might be a std::string, might be a std::string_view, a byte array, a QString.. You just don't know therefore can't expect a call on a character (like toLowercase() for example) to exist.
So i have a bit of a request. Could you add a function that allows me to set how to compare a prefix?
I'm thinking of a:
tsl::htrie_set::set_character_compare(...);
Then in the compare functions you use whatever is set by the user.
Again, i could be completely wrong and it's already possible, but it doesn't look like that from reading the code ;)
I'm curious about your opinion!
Best regards,
Mark
The text was updated successfully, but these errors were encountered: