|
1 | 1 | # Knowledge Graph Analysis
|
2 | 2 | Code accompanying our paper "One Knowledge Graph to Rule them All? Analyzing the Differences between DBpedia, YAGO, Wikidata & co."
|
3 | 3 |
|
4 |
| -Quantitative analysis of the following Knowledge Graphs (KG): |
5 |
| -* DBpedia |
6 |
| -* YAGO |
7 |
| -* Wikidata |
8 |
| -* NELL |
9 |
| -* OpenCyc |
| 4 | +Quantitative analysis of the following Knowledge Graphs (KGs): |
| 5 | +* DBpedia (D) |
| 6 | +* YAGO (Y) |
| 7 | +* Wikidata (W) |
| 8 | +* NELL (N) |
| 9 | +* OpenCyc (O) |
10 | 10 |
|
11 |
| -Approach: |
| 11 | +## Approach: |
12 | 12 | * Get top 10 classes for each KG
|
13 | 13 | * Calculation of class indegree and outdegree
|
14 | 14 | * Get all instances for each class
|
15 | 15 | * Calculation of minimum, average, median, and maximum indegree and outdegree for the instances of each class
|
16 | 16 | * Create a combined list with all top 10 classes and equal classes in other KGs (e.g. with owl:sameAs properties)
|
17 | 17 | * Calculate all degree values for the new classes as well
|
18 | 18 | * Calculate the instance overlap of the classes using different string similarity measures
|
| 19 | + |
| 20 | +## Instructions: |
| 21 | +1. **/LinkedInstances/*.py** creates files with all linked instances between two KGs. |
| 22 | + * Input: |
| 23 | + * KG files containing instances and/or links to other instances. |
| 24 | + * Output: |
| 25 | + * Files containing the combined links between two KGs (e.g. *DO_sameAs_union.nt* for the links between DBpedia and OpenCyc) that are denoted as **#o1**. |
| 26 | + * Move those **#o1** files to the */InstanceOverlap/owlSameAs/* folder. |
| 27 | +2. **/GetInstances/src/GetInstances.java** creates files that contain all instances of a class including all English labels. |
| 28 | + * Input: |
| 29 | + * Array with class names for each KG. |
| 30 | + * Full KG or just the files containing the instances and labels. |
| 31 | + * Output: |
| 32 | + * Textfiles containing all instances with all English labels for each class in each KG. |
| 33 | + * Saved as *<k_className>InstancesWithLabels.txt* where *k* stands for the abbreviation of the KG (e.g. *d_ActorInstancesWithLabels.txt* for the actor instances in DBpedia). All those files are denoted as **#o2**. |
| 34 | + * Move these **#o2** files to the */InstanceOverlap/InstanceLabels/* folder. |
| 35 | +3. **/InstanceOverlap/src/InstanceOverlapMain.java** executes the following three steps for each class in the className array for calculating the estimated overlap: |
| 36 | + 1. **CountSameAs.java** creates files with the linked instances of two classes by e.g. using the *owl:sameAs* property. |
| 37 | + * Input: |
| 38 | + * Class name. |
| 39 | + * **#o1** files with the linked instances in the */InstanceOverlap/owlSameAs/* folder. |
| 40 | + * **#o2** files with all English instance labels for the respective class and for each KG in the */InstanceOverlap/InstanceLabels/* folder. |
| 41 | + * Output: |
| 42 | + * Links between instances for each class1-class2 combination that is used as gold standard (there might be multiple classes that describe the same concept in a single KG, e.g. wordnet_actor_109765278 and wordnet_actor_109767197 in the YAGO KG). These files are saved as *<className1_className2>.tsv* in the */InstanceOverlap/owlSameAs/x2y/* folder (e.g. *Actor_wordnet_actor_109765278.tsv* in the *d2y* folder). These files are denoted as **#o3**. |
| 43 | + 2. **CountStringSimilarity.java** creates files that contain all found links between two classes using the different string similarity measures (e.g. Jaro, Levenshtein) and different thresholds. |
| 44 | + * Input: |
| 45 | + * Class name. |
| 46 | + * **#o2** files. |
| 47 | + * Output: |
| 48 | + * Links between the instances of two classes that are found using a specific similarity measure and threshold. The results are saved as *<fromK_2_toK_fromClass_toClass_simMeasure_threshold>.tsv* in the */InstanceOverlap/simMeasureResults/* folder (e.g. *d2y_Actor_wordnet_actor_109765278_jaro_1.0.tsv*). These files are denoted as **#o4**. |
| 49 | + 3. **EstimatedInstanceOverlap.java** |
| 50 | + * Input: |
| 51 | + * Class name. |
| 52 | + * **#o3** containing linked instances that is used as gold standard. |
| 53 | + * **#o4** containing the instances that should be linked based on the respective similarity measure and threshold. |
| 54 | + * Output: |
| 55 | + * *estimatedOverlap_<className_parameter_timestamp>.csv* files in the */InstanceOverlap/estimatedOverlap/* folder containing instance counts, precision, recall, f-measure, estimatedOverlap, number of links, count of matching alignment, count of partial matching alignment, and true positives for each class1-class2 combination for each class and each KG combination (e.g. *estimatedInstanceOverlap_Actor_wBlockingMax1000000_tokenBk4_2017_02_17_13_35_52.csv*). |
| 56 | + |
0 commit comments