A Java library for the Mapper Algorithm from TDA, working on any metric space.
To install just clone this repo, open up a terminal and move in the clones directory, then run mvn clean install -DskipTests.
A MapperPipeline object encapsulates all the steps to compute a mapper graph:
Cover.Builder<S> cover = ...
Clustering.Builder<R> clustering = ...
Lens<R, S> lens = ...
MapperPipeline<R> mapperPipeline = MapperPipeline.<R, S>newBuilder()
    .withCover(cover)
    .withLens(lens)
    .withClustering(clustering)
    .build();
In this snippet, cover and clustering are expected to be builders defining the cover and the clustering algorithms. Once built, just run:
Collection<S> dataset = ... ;
MapperGraph graph = mapperPipeline.run(dataset);
The Mapper Algorithm is a powerful tool from Topological Data Analysis (TDA), which is able to give insights about the shape of data. It was originally presented in [1] and today is one of the backbones of some industrial application too. In its simplest form, the output of the Mapper is a graph network, representing a topological summary for the shape of the input dataset. The steps are the following:
- 
Identify a lens $f \colon X \to Z$ , or filter, which maps any point in the dataset$X$ to a parameter space$Z$ . Usually the space$Z$ has lower dimension than the dataset$X$ .
- 
Cover the image of the lens $f$ with open sets. This can be done in many possible ways
- 
Take the pullback cover under $f$ , and split each pullback by using a clustering algorithm.
- 
Build a network graph with a node for each local cluster, and an arc for any couple of intersecting clusters. 
A metric is any class implementing the Metric<S> interface, which has only one method: public double eval(T x, T y).
A lens represents a continuous map Lens<S, T>, which has only one method public T evaluate(S x).
In the TopologyUtils class you can find methods to define pullbacks for metrics and lenses.
A cover algorithm is any class implementing the Cover<S> interface, which requires only one method: public Collection<Collection<S>> run(Collection<S> dataset);
This method takes the whole dataset as input, and returns a collection of possibly overlapping subsets.
- 
TrivialCover<S>: skip the cover step, returning a singleton collection containing the whole dataset;
- 
SearchCover<S>: train a search algorithm on the dataset, and cover the points by using local neighborhoods. ASearchCoverloops through the points, when a non-covered point is found, its neighbors are queried (using the suppliedSearch<S>object), then those are added to the results as a new open cover. For theSearch<S>algorithm you can chose from two options:2.1. BallSearch: for any point$p$ , returns a collection of those points within the given radius$r$ from$p$ ;2.2. KNNSearch: for any point$p$ , returns a collection of the first$k$ nearest neighbors of$p$ .2.3. CubicSearch: this is the original type of cover presented in [1]. You need to specify the number of intervals and the overlap in the range$(0.0, 1.0)$ , and a lens which maps points to coordinates.
Both BallSearch and KNNSearch require a metric to be built. The CubicSearch requires a lens instead. This is because the cubic cover works with point coordinates.
Any implementation of Cover has a companion builder class, which is expected to implement a method pullback which, given a lens Cover<R>.
A clustering algorithm is any class implementing the Clustering<S> interface, which requires only one method: public Collection<Collection<S>> run(Collection<S> dataset);
Concerning the clustering step, you have two choices:
- 
TrivialClustering: skip the clustering step, returning a singleton collection containing the whole dataset;
- 
CoverGraphClustering: any given cover algorithm can induce a clustering algorithm by taking as clusters the connected components of the cover graph;
- 
DBSCAN: an experimental implementation of the DBSCAN algorithm. Two variants are implemented:DBSCANSimpleandDBSCANFaster.
Any implementation of Cluster must have a companion builder class.
Many of the classes used in this implementation use some kind of mutable state, in order to cache and preserve settings/indices. This has the unfortunate downside of making parallel computations harder to run consistently, and harder to debug when things go wrong. For this reason, each stateful class has a dedicated builder class, which can be used to pass around its configuration. This is enough to easily prevent runtime collisions of shared mutable state in a concurrent environment.
- 
G. Singh , F. Mémoli, G. Carlsson. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. https://research.math.osu.edu/tgda/mapperPBG.pdf 
- 
N. Saul, D. L. Arendt. Machine Learning Explanations with Topological Data Analysis. https://sauln.github.io/blog/tda_explanations/ 
- 
P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. SODA '93: Proceedings of the fourth annual ACM-SIAM symposium on Discrete algorithms, Jan 1993, pp. 311-321. https://dl.acm.org/doi/10.5555/313559.313789 
- 
S. Brin. Near Neighbor Search in Large Metric Spaces. VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases. Zurich, Switzerland: Morgan Kaufmann Publishers Inc.: 574–584. http://www.vldb.org/conf/1995/P574.PDF 
- 
Y. Zhou, N. Chalapathi, A. Rathore, Y. Zhao, B. Wang. Mapper Interactive: A Scalable, Extendable, and Interactive Toolbox for the Visual Exploration of High-Dimensional Data. https://arxiv.org/abs/2011.03209