Skip to content

Conversation

gitbuda
Copy link
Member

@gitbuda gitbuda commented Jun 13, 2025

TODOs 🛠️

  • Implement import of 1M nodes and 1M edges
  • Create the Graph API around the arrow+rocks primitives (no transactions)
    • Create abstract pages load+store node&edges as arrow files on disk
    • Testing and benchmarking
  • Add support for Arrow and Parquet (DURABILITY)
    • Implement all existing graph features for both
    • Implement latency measurements
    • Refactor testing (main is for quick feedback, units are the key, stress ~= benchmark)
  • Add the notion of transactions (ISOLATION & ATOMICITY)
    • Implement G0 from https://github.com/ept/hermitage (memgraph implementation)
    • Fix all under ninja && ./test_graph && ./test_transactional_graph && ./test_hermitage
    • Add proper logging (INFO + TRACE), rerun all available benchmarks
    • Go over all TODOs and make improvements (e.g., having multiple copies of Transaction is 🤯)
    • Implement G0, G1a, G1b, G1c, OTV from https://github.com/ept/hermitage
    • Reiterate benchmarks and plotting of the results
  • Add INDEXES
    • Add basic/slow index implementation under graph + correctness testing
    • Add concurrent index data structure (folly)
    • Add updating what's indexed ("reconfigure indexes")
    • Add basic/slow implementation under the graph_transaction + correctness testing
    • Write the mixed workload benchmark
    • Optimize the index memory usage and performance
  • Refactor time: try to use std::expected instead of the arrow return type
  • Add GC
  • Add RECOVERY
  • Add support for different ordering/PARTITIONING under the primary storage files (data|time-based)
  • Add CONSTRAINTS (👉 full ACID support 👈)
  • Benchmark and optimize
    • Implement BFS
    • Implement ShortestPath
    • Implement AllShortestPath
    • Implement mixed workload
  • Make benchmark/stress test utility to estimate what's the best graph layout / runtime config for a given person
  • Implement continuous integration tests
  • Polish/publish v1 of the embedded graph storage embedded library

  • Integrate with memgraph (an example / previous attempt is under Experiment alternative storage gitbuda/memgraph#12), make it fully transparent (fully behind memgraph's query engine)
  • Parquet seems to be a dominant strategy, can we somehow better utilize Arrow so that the disk overhead is paying off?

Testing and Benchmarking 📈👀

2025-06-08 non-transactional, no-indexing, storage on local disk, Mac M1

  • NOTE: Used Arrow as input format for Parquet
  • NOTE: Similar performance while Parquet is ~4x smaller in file sizes (for bigger batch sizes)

2025-06-15 transactional (RU and RC are the same), no-indexing, storage on local disk, Mac M1


Ideas 🤔

  • Use Arrow for serializing data (the same Node serialization could be used under WAL and primary storage files), Parquet for storing data on disk. NOTE: WAL is not required for on-disk systems.
  • It's probably possible to detect serialization errors by storing additional metadata about updates at the WAL creation time (those could be later deleted by GC).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant