-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(feature request) Insert at prefix position #8
Comments
Hi, I created a branch to implement the feature. You can use it this way: tsl::htrie_map<char, int> map;
map.insert_with_prefix("/home/user/", {{"test", 1}, {"test2", 2},
{"test3", 1}, {"test4", 2},
{"test5", 1}, {"test6", 2},
{"test7", 1}, {"test8", 2},
{"test9", 1}, {"test10", 2},
{"test11", 1}, {"test12", 2},
});
for(auto it = map.begin(); it != map.end(); ++it) {
std::cout << it.key() << " " << it.value() << std::endl;
} Note that it's still a work in progress, I didn't test it much yet. I still hesitate to add it as I wonder if it's really worth it. I have to see a bit regarding how much of a performance boost it provides as the HAT-trie is already quite fast on ordered insertions. |
Hi, Thank you very much! I was trying to compile it, but i keep getting a compile error... I must be doing something wrong. tsl::htrie_map<char, int> map;
std::map<char, int> entries;
entries.emplace("one", 1);
entries.emplace("two", 1);
entries.emplace("three", 1);
map.insert_with_prefix("aa/bb/cc", entries.begin(), entries.end()); note the above without the "insert_with_prefix" function compiles thus provided here as illustration example, but the the std::map type is very wrong as that will never give the correct result if it even runs. It should be a char array which cannot be set in there directly (by the standard). tsl::htrie_map<char, int> map;
std::map<std::string, int> entries;
entries.emplace("one", 1);
entries.emplace("two", 1);
entries.emplace("three", 1);
map.insert_with_prefix("aa/bb/cc", entries.begin(), entries.end()); See the std::string in there. That could also be a std::string_view. |
The problem is if part of the end of the prefix end-up in a hash node. To avoid having to concatenate this end of the prefix with all the values in the iterators (which is expensive) to insert them in a hash node, we have to burst all the hash nodes in the way of the prefix so that each letter of the prefix will be in a trie node. Otherwise I made the correction. My solution was a bit too naive and only supported |
I pulled your changes, but it's not compiling for me (GCC 7.2 on Linux, x64) tsl::htrie_map<char, int> map;
std::map<std::string, int> entries;
entries.emplace("one", 1);
entries.emplace("two", 1);
entries.emplace("three", 1);
map.insert_with_prefix("aa/bb/cc", entries.begin(), entries.end()); The errors i get:
|
Sorry forgot something, could you try again with the latest commit? I'll see to test it a bit more when I have more time. |
Ha :) Now it works. Now for some performance numbers (as that is what i was trying to do in the first place). Note that is completely naive! I could optimize it a lot more. Also, this is just for insertion compared to insertion with prefix! My case: index a single folder of 500.000 entries (dump entries, named from 1.txt till 500000.txt) Anyhow, take that for granted and as 0 line benchmark. Indexing files (no hat-trie, no storage): ~365ms That is impressive! Nearly twice as fast. The code (incase you're interested). std::string currentPrefix = m_url.toStdString();
for (const KIO::UDSEntry &entry: list)
{
m_trie.insert(currentPrefix + "/" + entry.stringValue(KIO::UDSEntry::UDS_NAME).toStdString(), entry);
} insert_with_prefix std::map<std::string, KIO::UDSEntry> entries;
std::string currentPrefix = m_url.toStdString();
for (const KIO::UDSEntry &entry: list)
{
entries.emplace(entry.stringValue(KIO::UDSEntry::UDS_NAME).toStdString(), entry);
}
m_trie.insert_with_prefix(currentPrefix, entries.begin(), entries.end()); Merely changing std::map to std::unordered_map gives a benchmark result of ~1362ms! (well over twice as fast as a normal insert) I'm verifying if this works by counting the number of entries const auto [begin, end] = trie->prefix("file:///home/mark/massive_files/");
qDebug() << "Entry count:" << std::distance(begin, end); If that is 500000 then i call it OK :) And it is for the above tests. Now this is still tremendously wasteful as i get the file entries as QList object, putting it in a std::map<std::string, KIO::UDSEntry> which only serves to be inserted in hat-trie... A more ideal scenario would be to have some kind of magical proxy iterator that would iterate over my list and send back the data on the fly as needed. No, this is not std::transform. That does it partly but puts it in a resulting container. Which is not what i want (unless inserting it directly in hat-trie, but that might be too far fetched). |
Hi, I did some quick tests on my side using the Wikipedia title dataset used in the README.md using "/home/tessil/" as pefix. Results:
So a ~14% improvement. Code: #include <iostream>
#include <fstream>
#include <string>
#include <iterator>
#include <chrono>
#include <tsl/htrie_map.h>
#include <map>
int main(int argc, char** argv) {
const std::string prefix = "/home/tessil/";
const std::string file_path = argv[1];
std::ifstream file(file_path);
if(!file.is_open()) {
throw std::runtime_error("Couldn't read " + file_path + ".");
}
{
std::map<std::string, int> lines;
for(auto it = std::istream_iterator<std::string>(file); it != std::istream_iterator<std::string>(); ++it) {
lines.insert({*it, 1});
}
const auto chrono_start = std::chrono::high_resolution_clock::now();
tsl::htrie_map<char, int> map;
map.insert_with_prefix(prefix, lines.begin(), lines.end());
const auto chrono_end = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(chrono_end - chrono_start).count() <<
" ms to insert_with_prefix " << map.size() << " elements." << std::endl;
}
file.clear();
file.seekg(0);
{
std::map<std::string, int> lines;
for(auto it = std::istream_iterator<std::string>(file); it != std::istream_iterator<std::string>(); ++it) {
lines.insert({prefix + *it, 1});
}
const auto chrono_start = std::chrono::high_resolution_clock::now();
tsl::htrie_map<char, int> map;
map.insert(lines.begin(), lines.end());
const auto chrono_end = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(chrono_end - chrono_start).count() <<
" ms to insert " << map.size() << " elements." << std::endl;
}
} As you said, your code could be improved by feeding the paths directly to the HAT-trie, but you could also avoid the creation of a Note also that the code only provides the basic exception safety guarantee for now. I have to see to add the strong guarantee before merging the code to master. |
That's tricky on my part. My strings are QString objects, i have to make std::string objects to put them in your container. In fact, if you would support other character types (like QChar) then i don't have to do a conversion there. On the other hand, there are many conversion in my code.. The KIO side that reads the directory is - at it's lowest level - just a readdir with char arrays. From there on it gets converted to QString and stays that way (much more happens, but lets keep it simple). The option that could work is providing a std::string_view implementation that can read from a QString. It would serve both your tree and keep my code as is. As for your code, i'm surprised the prefix version is that much faster with you. If i run your benchmark the win is much less (still a win though). And if i implement batching (just adding it to the tree in batches of 100.000 lines) then everything slows down (i'm guessing due to the large std::map container which i clean up after every 100.000 lines, it's even worse with std::unordered_map). So i just don't include those numbers here. Timings using your benchmark |
EDIT: Yes, I could try to support other char types. I have plans to support In the meantime, there should be a way to get a raw buffer of Regarding the performance you only get a ~10% improvement (compared to my ~14%, both with GCC 7.2 and Clang 5.0). Our environment must play quite a bit (my CPU is not really the best out there). |
I see what you mean, yes that could work. Anyhow, i think we have now - somewhat - concluded that this feature request can have benefits. In the case of indexing an already hierarchical data structure (like a filesystem tree to this trie), it can provide quite big performance improvements. For non hierarchical data structures (say a dictionary) it probably won't provide any benefit or a very small one at best. |
Hi,
Imagine the case of inserting a whole filesystem structure in the hat-trie. You have folders with a bunch of files, more folders and more files. At the moment in the current code each entry is looking up the appropriate point to insert the entry. For example a folder structure like:
the insertion logic has to iterate the tree to find the correct spot. That is fine for the first entry, but could be optimized for the second entry and specially for all entries after the second.
I'd like to propose a insert function with a "fixed prefix" that can then take a list (vector or map, depending on the type of container you used for the hat-trie) to insert a bunch of elements at a given prefix. For instance something like this:
This in only really beneficial for filesystem like structures, not for wordbook purposes. But it would still be a neat addition imho.
Best regards,
Mark
The text was updated successfully, but these errors were encountered: