Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GFD mining #465

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Add GFD mining #465

wants to merge 5 commits into from

Conversation

AntonChern
Copy link
Contributor

@AntonChern AntonChern commented Sep 25, 2024

This PR implements an algorithm for mining graph functional dependencies based on article "Discovering Graph Functional Dependencies" by Fan Wenfei, Hu Chunming, Liu Xueli, and Lu Ping. Algorithm, given an input graph, returns a set of dependencies satisfied on this graph. The algorithm also has two configurable parameters: k is the maximum number of vertices in the pattern of the mined dependency and sigma is its minimum frequency.
In addition, the PR implements the ability to run the algorithm in Python, and also contains examples of its use.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 25 out of 29. Check the log or trigger a new build to see more.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@AntonChern AntonChern force-pushed the gfd branch 2 times, most recently from c3eeecf to a59e3cc Compare October 3, 2024 08:42
: query_(query_), iso_(iso_), res_(res_) {}

template <typename CorrespondenceMap1To2, typename CorrespondenceMap2To1>
bool operator()(CorrespondenceMap1To2 f, CorrespondenceMap2To1) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second parameter is not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boost Graph Library requires such syntax from callback function. Unfortunately, if I change the function signature, the code will not compile

@xJoskiy
Copy link
Contributor

xJoskiy commented Oct 8, 2024

Add PR description

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@xJoskiy
Copy link
Contributor

xJoskiy commented Oct 9, 2024

You are doing great! Please don't forget to notify when you are done with changes requested. Also please mark conversations if they are resolved

@AntonChern AntonChern force-pushed the gfd branch 2 times, most recently from 60aeadc to 9ce90a3 Compare October 16, 2024 15:46
@AntonChern AntonChern requested a review from xJoskiy October 16, 2024 15:47
Token name_token = Token(i, name.first);
for (auto& value : name.second) {
Token value_token = Token(-1, value);
result.push_back(Literal(name_token, value_token));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result.reserve(name.second.size() * attrs_info.at(label).size())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, name.second can have different number of values, depending on name.first. The reserve may not work correctly.

@AntonChern AntonChern force-pushed the gfd branch 2 times, most recently from 11d6a2e to 13b76ae Compare November 5, 2024 16:38
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@AntonChern AntonChern force-pushed the gfd branch 2 times, most recently from b5fd5d6 to ea596c9 Compare March 16, 2025 14:42
@AntonChern AntonChern force-pushed the gfd branch 2 times, most recently from 9e67961 to 8cb290f Compare March 24, 2025 19:05
@AntonChern AntonChern requested a review from xJoskiy March 24, 2025 19:06
@xJoskiy
Copy link
Contributor

xJoskiy commented Mar 24, 2025

LGTM

Copy link
Collaborator

@ol-imorozko ol-imorozko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you fix those minor issues, it will be great. However, I think the code is already good enough to be approved

Comment on lines +516 to +515
for (std::size_t i = 0; i < patterns.size(); ++i) {
auto pattern = patterns[i];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we have std::views::enumerate support, but if we do, than it's better to use it in cases where we need both an index and an element:

Suggested change
for (std::size_t i = 0; i < patterns.size(); ++i) {
auto pattern = patterns[i];
for (auto [i, pattern] : std::views::enumerate(patterns)) {

Comment on lines +618 to +617
for (std::size_t i = 0; i < patterns.size(); ++i) {
graph_t pattern = patterns.at(i);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here if we have std::views::enumerate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, the project only supports C++20

Copy link
Collaborator

@vs9h vs9h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look at almost anything, the logic part was hopefully checked by other reviewers. I only looked at the tests.

I described the most important comment in the test file itself. But I would also like to draw attention to two things:

  1. Why do we have access to types specific to this primitive (Gfd, Literal, Token) in the algos namespace? Usually the model namespace is used for this. Literal and Token must definitely not be in this namespace, and Gfd is still acceptable, but also not very good.
  2. Some tests run for quite a long time in debug mode:
https://github.com/Desbordante/desbordante-core/actions/runs/14045967659/job/39326756580?pr=465
Run core-tests on ubuntu-latest with llvm-clang, Debug

270/719 Test #273: GfdMiningTest.TestMovies ...........................................................................................................................................................................................................................................................................................................................   Passed   13.92 sec
        Start 274: GfdMiningTest.TestSymbols
271/719 Test #274: GfdMiningTest.TestSymbols ..........................................................................................................................................................................................................................................................................................................................   Passed  129.41 sec
        Start 275: GfdMiningTest.TestShapes
272/719 Test #275: GfdMiningTest.TestShapes ...........................................................................................................................................................................................................................................................................................................................   Passed   61.20 sec

Is this normal? Are there known limits that are acceptable? It seems that this is quite long. Such tests are usually marked with an additional prefix (HeavyDatasets), so as not to run too expensive tests every time.

Add class for GFD miner
Add minimal GFD and GFD with multiple conclusion tests
Added two examples with searching for dependencies in small graphs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants