This repository contains the implementations for our shared-memory parallel and sequential algorithms for correlation and modularity clustering. Note that the repository uses the Graph-Based Benchmark Suite (GBBS) for parallel primitives and benchmarks.
clustering
contains the main executable, cluster-in-memory_main
.
It serves as a central runner for all of our implementations.
Compiler:
- g++ >= 7.4.0 with support for Cilk Plus
- g++ >= 7.4.0 with pthread support for the ParlayLib scheduler
Build system:
- Bazel >= 3.5.0 and < 4.0.0
To build:
$ bazel build //clustering:cluster-in-memory_main
Most optionality from the Graph Based Benchmark Suite (GBBS)
applies. In particular, to compile benchmarks for graphs with
more than 2^32 edges, the LONG
command-line parameter should be set.
Also, the parallel scheduler can be chosen using comand-line parameters. The default compilation is the ParlayLib scheduler (Homegrown), and the parameters HOMEGROWN, CILK, OPENMP, and SERIAL switch between the Homegrown scheduler, Cilk Plus, OpenMP, and serial respectively.
The applications take as input the adjacency graph format used by Graph Based Benchmark Suite (GBBS).
All clusterers take configs given
in a protobuf, as detailed in config.proto, which can be passed in by setting
the input clusterer_config. Note that the proto should be passed in text format.
For both correlation and modularity clustering, a CorrelationClustererConfig
should be passed in.
The options available in CorrelationClustererConfig
are:
resolution
: double indicating resolution parameterlouvain_config
:LouvainConfig
object, withnum_iterations
: int specifying the number of iterations of best moves and compression stepsnum_inner_iterations
: int specifying the number of iterations of local best move steps
refine
: bool indicating whether to perform multi-level refinementasync
: bool indicating whether to use the asynchronous settingmove_method
:MoveMethod
enum, that sets the subset of vertices to consider for best moves to be:NBHR_CLUSTER_MOVE
: neighbors of clusters that have been modifiedALL_MOVE
: all verticesNBHR_MOVE
: neighbors of vertices that have moved
all_iter
: bool solely for the sequential implementations; indicates if the restriction on the number of iterations should be ignored and if the program should run to convergencepermute
: bool solely for the parallel implementations; indicates whether to randomly permute vertices before performing best moves, which may improve performance when there is greater contention from the asynchronous setting
The main executable is cluster-in-memory_main
in the clustering
directory. It
will run the clusterer given by the input clusterer_name: "ParallelCorrelationClusterer",
"ParallelModularityClusterer", "CorrelationClusterer", or "ModularityClusterer".
A template command is:
bazel run -c opt //clustering:cluster-in-memory_main -- --input_graph=</path/to/graph> --output_clustering=</path/to/output> --clusterer_name=<clusterer name> --clusterer_config='<clusterer proto>'
An example <clusterer proto>
is correlation_clusterer_config {resolution: 0.01, refine: true, async: true, move_method: NBHR_MOVE}
.
The full version of the paper can be found here.