Skip to content

Commit

Permalink
- clean up documentation, and local reference images, and add back dy…
Browse files Browse the repository at this point in the history
…namic-circlepacking.html
  • Loading branch information
chrismattmann committed Mar 13, 2023
1 parent 0c1b219 commit 3f60d00
Show file tree
Hide file tree
Showing 2 changed files with 117 additions and 64 deletions.
71 changes: 7 additions & 64 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Installation
===
```
git clone https://github.com/chrismattmann/tika-img-similarity
pip install -r requirements.txt
pip install tika-python editdistance
```
You can also check out [ETLlib](https://github.com/chrismattmann/etllib/tree/master/etl/imagesimilarity.py)

Expand Down Expand Up @@ -100,49 +100,6 @@ python psykey.py --inputDir INPUTDIR --outCSV OUTCSV --wordlists WRODLIST_FOLDER
```

Metalevenshtein string distance
-------------------------------
- This calculates Metalevenshtein (Inspired by the paper : Robust Similarity Measures for Named Entities Matching by Erwan et al.) distance between two strings.

```
#!/usr/bin/env python3.7
Usage:
import metalevenshtein as metalev
print metalev.meta_levenshtein('abacus1cat','cat1cus')
To use all the argument options in this function:
def meta_levenshtein(string1,string2,Sim='levenshtein',theta=0.5,strict=-1,idf=dict()):
Implements ideas from the paper : Robust Similarity Measures for Named Entities Matching by Erwan et al.
Sim = jaro_winkler, levenshtein : can be chosen as the secondary matching function.
theta is the secondary similarity threshold: If set higher it will be more difficult for the strings to match.
strict=-1 for doing all permutations of the substrings
strict=1 for no permutations
idf=provide a dictionary for {string(word),float(idf od the word)}: More useful when mathings multi word entities (And word importances are very important)
like: 'harry potter', 'the wizard harry potter'
```

Bell Curve fitting and overlap
------------------------------
- Fits two datasets into bel curves and finds the area of overlap between the bell curves.


```
#!/usr/bin/env python3.7
import features as feat
data1=[1,2,3,3,2,1]
data2=[4,5,6,6,5,4]
area,error=feat.gaussian_overlap(data1,data2)
print area
```



D3 visualization
----------------
Expand All @@ -165,8 +122,8 @@ D3 visualization

Default **threshold** value is 0.01.

<img src="https://github.com/dongnizh/tika-img-similarity/blob/refactor/snapshots/cluster.png" width = "200px" height = "200px" style = "float:left">
<img src="https://github.com/dongnizh/tika-img-similarity/blob/refactor/snapshots/interactive-cluster.png" width = "200px" height = "200px" style = "float:right">
<img src="docs/figs/cluster.png" width = "200px" height = "200px" style = "float:left">
<img src="docs/figs/interactive-cluster.png" width = "200px" height = "200px" style = "float:right">

### Circlepacking viz
- Jaccard Similarity
Expand All @@ -184,24 +141,17 @@ Default **threshold** value is 0.01.
* open circlepacking.html(or dynamic-circlepacking.html for interactive viz) in your browser
```
<img src="https://github.com/dongnizh/tika-img-similarity/blob/refactor/snapshots/circlepacking.png" width = "200px" height = "200px" style = "float:left">
<img src="https://github.com/dongnizh/tika-img-similarity/blob/refactor/snapshots/interactive-circlepacking.png" width = "200px" height = "200px" style = "float:right">
<img src="docs/figs/circlepacking.png" width = "200px" height = "200px" style = "float:left">
<img src="docs/figs/interactive-circlepacking.png" width = "200px" height = "200px" style = "float:right">

### Composite viz
This is a combination of cluster viz and circle packing viz.
The deeper color, the more the same attributes in the cluster.
```
* open compositeViz.html in your browser
```
![Image of composite viz](https://github.com/dongnizh/tika-img-similarity/blob/refactor/snapshots/composite.png)
![Image of composite viz](docs/figs/composite.png)

### Sunburst viz
Visualization of clustering from Jaccard Similarity result
```
* python sunburst.py (for generating circlepacking viz)
* open sunburst.html
```
![Image of sunburst viz](https://github.com/chrismattmann/tika-img-similarity/blob/master/snapshots/sunburst.png)

### Big data way
if you are dealing with big data, you can use it this way:
Expand All @@ -210,14 +160,7 @@ if you are dealing with big data, you can use it this way:
* open levelCluster-d3.html in your browser
```
You can set max number for each node **_maxNumNode**(default _maxNumNode = 10) in generateLevelCluster.py
![Image of level composite viz](https://github.com/dongnizh/tika-img-similarity/blob/refactor/snapshots/level-composite.png)

### Treemap viz
```
* python tree_map.py (for generating treemap viz)
* open tree_map.html in your browser
```
![Image of treemap viz](https://github.com/chrismattmann/tika-similarity/blob/master/snapshots/treemap.png)
![Image of level composite viz](docs/figs/level-composite.png)

Questions, comments?
===================
Expand Down
110 changes: 110 additions & 0 deletions html/dynamic-circlepacking.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
<!DOCTYPE html>
<meta charset="utf-8">
<style>

.node {
cursor: pointer;
}

.node:hover {
stroke: #000;
stroke-width: 1.5px;
}

.node--leaf {
fill: white;
}

.label {
font: 11px "Helvetica Neue", Helvetica, Arial, sans-serif;
text-anchor: middle;
text-shadow: 0 1px 0 #fff, 1px 0 0 #fff, -1px 0 0 #fff, 0 -1px 0 #fff;
}

.label,
.node--root,
.node--leaf {
pointer-events: none;
}

</style>
<body>
<script src="http://d3js.org/d3.v3.min.js"></script>
<script>

var margin = 20,
diameter = 960;

var color = d3.scale.linear()
.domain([-1, 5])
.range(["hsl(152,80%,80%)", "hsl(228,30%,40%)"])
.interpolate(d3.interpolateHcl);

var pack = d3.layout.pack()
.padding(2)
.size([diameter - margin, diameter - margin])
.value(function(d) { return d.size; })

var svg = d3.select("body").append("svg")
.attr("width", diameter)
.attr("height", diameter)
.append("g")
.attr("transform", "translate(" + diameter / 2 + "," + diameter / 2 + ")");

d3.json("circle.json", function(error, root) {
if (error) return console.error(error);

var focus = root,
nodes = pack.nodes(root),
view;

var circle = svg.selectAll("circle")
.data(nodes)
.enter().append("circle")
.attr("class", function(d) { return d.parent ? d.children ? "node" : "node node--leaf" : "node node--root"; })
.style("fill", function(d) { return d.children ? color(d.depth) : null; })
.on("click", function(d) { if (focus !== d) zoom(d), d3.event.stopPropagation(); });

var text = svg.selectAll("text")
.data(nodes)
.enter().append("text")
.attr("class", "label")
.style("fill-opacity", function(d) { return d.parent === root ? 1 : 0; })
.style("display", function(d) { return d.parent === root ? null : "none"; })
.text(function(d) { return d.name; });

var node = svg.selectAll("circle,text");

d3.select("body")
.style("background", color(-1))
.on("click", function() { zoom(root); });

zoomTo([root.x, root.y, root.r * 2 + margin]);

function zoom(d) {
var focus0 = focus; focus = d;

var transition = d3.transition()
.duration(d3.event.altKey ? 7500 : 750)
.tween("zoom", function(d) {
var i = d3.interpolateZoom(view, [focus.x, focus.y, focus.r * 2 + margin]);
return function(t) { zoomTo(i(t)); };
});

transition.selectAll("text")
.filter(function(d) { return d.parent === focus || this.style.display === "inline"; })
.style("fill-opacity", function(d) { return d.parent === focus ? 1 : 0; })
.each("start", function(d) { if (d.parent === focus) this.style.display = "inline"; })
.each("end", function(d) { if (d.parent !== focus) this.style.display = "none"; });
}

function zoomTo(v) {
var k = diameter / v[2]; view = v;
node.attr("transform", function(d) { return "translate(" + (d.x - v[0]) * k + "," + (d.y - v[1]) * k + ")"; });
circle.attr("r", function(d) { return d.r * k; });
}
});

d3.select(self.frameElement).style("height", diameter + "px");

</script>

0 comments on commit 3f60d00

Please sign in to comment.