Skip to content

Commit a104aa3

Browse files
committed
Merge branch 'release/0.7' into main
2 parents 0bf585f + 5c52bf7 commit a104aa3

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+1257
-328
lines changed

README.md

Lines changed: 198 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,10 +93,207 @@ while (iterator.hasNext()) {
9393
String metrics = output.toString();
9494
```
9595

96+
## Defining schema with a configuration file
97+
98+
It is possible to define the schema with a YAML or JSON
99+
configuration file.
100+
101+
```
102+
Schema schema = ConfigurationReader
103+
.readYaml("path/to/some/configuration.yaml")
104+
.asSchema();
105+
```
106+
107+
A YAML example
108+
```yaml
109+
format: json
110+
fields:
111+
- name: edm:ProvidedCHO/@about
112+
path: $.['providedCHOs'][0]['about']
113+
categories:
114+
- MANDATORY
115+
- name: Proxy/dc:title
116+
path: $.['proxies'][?(@['europeanaProxy'] == false)]['dcTitle']
117+
categories:
118+
- DESCRIPTIVENESS
119+
- SEARCHABILITY
120+
- IDENTIFICATION
121+
- MULTILINGUALITY
122+
- CUSTOM
123+
- name: Proxy/dcterms:alternative
124+
path: $.['proxies'][?(@['europeanaProxy'] == false)]['dctermsAlternative']
125+
categories:
126+
- DESCRIPTIVENESS
127+
- SEARCHABILITY
128+
- IDENTIFICATION
129+
- MULTILINGUALITY
130+
groups:
131+
- fields:
132+
- Proxy/dc:title
133+
- Proxy/dc:description
134+
categories:
135+
- MANDATORY
136+
```
137+
138+
The same in JSON:
139+
```json
140+
{
141+
"format": "json",
142+
"fields": [
143+
{
144+
"name": "edm:ProvidedCHO/@about",
145+
"path": "$.['providedCHOs'][0]['about']",
146+
"categories": ["MANDATORY"]
147+
},
148+
{
149+
"name": "Proxy/dc:title",
150+
"path": "$.['proxies'][?(@['europeanaProxy'] == false)]['dcTitle']",
151+
"categories": [
152+
"DESCRIPTIVENESS",
153+
"SEARCHABILITY",
154+
"IDENTIFICATION",
155+
"MULTILINGUALITY"
156+
]
157+
},
158+
{
159+
"name": "Proxy/dcterms:alternative",
160+
"path": "$.['proxies'][?(@['europeanaProxy'] == false)]['dctermsAlternative']",
161+
"categories": [
162+
"DESCRIPTIVENESS",
163+
"SEARCHABILITY",
164+
"IDENTIFICATION",
165+
"MULTILINGUALITY"
166+
]
167+
}
168+
],
169+
"groups": [
170+
{
171+
"fields": [
172+
"Proxy/dc:title",
173+
"Proxy/dc:description"
174+
],
175+
"categories": [
176+
"MANDATORY"
177+
]
178+
}
179+
]
180+
}
181+
```
182+
183+
184+
Optionaly you can set the "canonical list" of categories. It provides
185+
two additional functionalities
186+
* if a field contains a category which is not listed in the list,
187+
that will be excluded (with a warning in the log)
188+
* the order of the categories in the output follows the order set
189+
in the configuration.
190+
191+
Here is an example (in YAML):
192+
193+
```yaml
194+
format: json
195+
...
196+
categories:
197+
- MANDATORY
198+
- DESCRIPTIVENESS
199+
- SEARCHABILITY
200+
- IDENTIFICATION
201+
- CUSTOM
202+
- MULTILINGUALITY
203+
204+
```
205+
## Constraints
206+
207+
One can add constraints to the fields. There are content rules, which
208+
the tool will check. In this version the tool mimin SHACL constraints.
209+
210+
### Cardinality
211+
* `minCount <number>` - specifies the minimum number of field occurence (API: `setMinCount()` or `withMinCount()`)
212+
* `maxCount <number>` - specifies the maximum number of field occurence (API: `setMaxCount()` or `withMaxCount()`)
213+
214+
### Value Range
215+
216+
* `minExclusive <number>` - The minimum exclusive value ([field value] > limit, API: `setMinExclusive(Double)` or `withMinExclusive(Double)`)
217+
* `minInclusive <number>` - The minimum inclusive value ([field value] >= limit, API: `setMinInclusive(Double)` or `withMinExclusive(Double)`)
218+
* `maxExclusive <number>` - The maximum exclusive value ([field value] < limit, API: `setMaxExclusive(Double)` or `withMaxExclusive(Double)`)
219+
* `maxInclusive <number>` - The maximum inclusive value ([field value] <= limit, API: `setMaxInclusive(Double)` or `withMaxInclusive(Double)`)
220+
221+
### String constraints
222+
223+
* `minLength <number>` - The minimum string length of each field value (API: `setMinLength(Integer)` or `withMinLength(Integer)`)
224+
* `maxLength <number>` - The maximum string length of each field value (API: `setMinLength(Integer)` or `withMaxLength(Integer)`)
225+
* `pattern <regular expression>` - A regular expression that each field value matches to satisfy the condition (API: `setPattern(String)` or `withPattern(String)`)
226+
227+
### Property pair
228+
229+
* `equals <field label>` - The set of all values of a field is equal to the set of all values of another field
230+
(API: `setEquals(String)` or `withEquals(String)`)
231+
* `disjoint <field label>` - The set of values of a field is disjoint (not equal) with the set of all values of another field
232+
(API: `setDisjoint(String)` or `withDisjoint(String)`)
233+
* `lessThan <field label>` - Each values of a field is smaller than each values of another field
234+
(API: `setLessThan(String)` or `withLessThan(String)`)
235+
* `lessThanOrEquals <field label>` - Each values of a field is smaller than or equals to each values of another field
236+
(API: `setLessThanOrEquals(String)` or `withLessThanOrEquals(String)`)
237+
238+
### Set rules
239+
240+
Via API
241+
```java
242+
Schema schema = new BaseSchema()
243+
.setFormat(Format.CSV)
244+
.addField(
245+
new JsonBranch("title", "title")
246+
.setRule(
247+
new Rule()
248+
.withDisjoint("description")
249+
)
250+
)
251+
.addField(
252+
new JsonBranch("url", "url")
253+
.setRule(
254+
new Rule()
255+
.withMinCount(1)
256+
.withMaxCount(1)
257+
.withPattern("^https?://.*$")
258+
)
259+
)
260+
;
261+
```
262+
263+
Via configuration file (a YAML example):
264+
265+
```yaml
266+
format: csv
267+
fields:
268+
- name: title
269+
categories: [MANDATORY]
270+
rules:
271+
disjoint: description
272+
- name: url
273+
categories: [MANDATORY]
274+
extractable: true
275+
rules:
276+
minCount: 1
277+
maxCount: 1
278+
pattern: ^https?://.*$
279+
```
280+
281+
In both cases we defined two fields. `title` has one constraints: it should not be equal to
282+
the value of `description` field (which is masked out from the example). Note: if this hypothetical
283+
`description` field is not available the API drops an error message into the log. `url` should have
284+
one and only one instance, and its value should start with "http://" or "https://".
285+
286+
As you can see there are two types of setters in the API: setSomething and withSomething. The
287+
difference is that setSomething returs with void, but withSomething returns with the Rule object,
288+
so you can use it in a chain such as `new Rule().withMinCount(1).withMaxCount(3)`
289+
(while `new Rule().setMinCount(1).setMaxCount(3)` doesn't work).
290+
291+
## More info
292+
96293
For the usage and implementation of the API see https://github.com/pkiraly/europeana-qa-api.
97294

98295
Java doc for the actual development version of the API: https://pkiraly.github.io/metadata-qa-api.
99296

100297
[![Build Status](https://travis-ci.org/pkiraly/metadata-qa-api.svg?branch=master)](https://travis-ci.org/pkiraly/metadata-qa-api)
101-
[![Coverage Status](https://coveralls.io/repos/github/pkiraly/metadata-qa-api/badge.svg?branch=master)](https://coveralls.io/github/pkiraly/metadata-qa-api?branch=master)
298+
[![Coverage Status](https://coveralls.io/repos/github/pkiraly/metadata-qa-api/badge.svg?branch=master)](https://coveralls.io/github/pkiraly/metadata-qa-api?branch=develop)
102299
[![javadoc](https://javadoc.io/badge2/de.gwdg.metadataqa/metadata-qa-api/javadoc.svg)](https://javadoc.io/doc/de.gwdg.metadataqa/metadata-qa-api)

pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
<groupId>de.gwdg.metadataqa</groupId>
66
<artifactId>metadata-qa-api</artifactId>
77
<packaging>jar</packaging>
8-
<version>0.7-SNAPSHOT</version>
8+
<version>0.7</version>
99
<name>Metadata Quality Assurance Framework API</name>
1010
<description>
1111
A metadata quality assurance framework. It checks some metrics of

src/main/java/de/gwdg/metadataqa/api/calculator/TfIdfCalculator.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
import java.io.ByteArrayOutputStream;
1212
import java.io.IOException;
1313
import java.io.Serializable;
14-
import java.nio.charset.Charset;
14+
import java.nio.charset.StandardCharsets;
1515
import java.util.ArrayList;
1616
import java.util.LinkedHashMap;
1717
import java.util.List;
@@ -103,7 +103,7 @@ private String getSolrResponse(String recordId) {
103103
IOUtils.copy(method.getResponseBodyAsStream(), baos);
104104
byte[] responseBody = baos.toByteArray();
105105

106-
jsonString = new String(responseBody, Charset.forName("UTF-8"));
106+
jsonString = new String(responseBody, StandardCharsets.UTF_8);
107107
} catch (HttpException e) {
108108
LOGGER.severe("Fatal protocol violation: " + e.getMessage());
109109
} catch (IOException e) {

src/main/java/de/gwdg/metadataqa/api/calculator/UniquenessCalculator.java

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,13 @@
11
package de.gwdg.metadataqa.api.calculator;
22

33
import de.gwdg.metadataqa.api.model.pathcache.PathCache;
4-
import de.gwdg.metadataqa.api.uniqueness.*;
54
import de.gwdg.metadataqa.api.counter.FieldCounter;
65
import de.gwdg.metadataqa.api.interfaces.Calculator;
76
import de.gwdg.metadataqa.api.schema.Schema;
7+
import de.gwdg.metadataqa.api.uniqueness.SolrClient;
8+
import de.gwdg.metadataqa.api.uniqueness.UniquenessExtractor;
9+
import de.gwdg.metadataqa.api.uniqueness.UniquenessField;
10+
import de.gwdg.metadataqa.api.uniqueness.UniquenessFieldCalculator;
811
import de.gwdg.metadataqa.api.util.CompressionLevel;
912
import org.apache.commons.lang3.StringUtils;
1013

@@ -13,7 +16,6 @@
1316
import java.util.List;
1417
import java.util.Map;
1518
import java.util.LinkedHashMap;
16-
import java.util.logging.Logger;
1719

1820
/**
1921
*
@@ -23,16 +25,13 @@ public class UniquenessCalculator implements Calculator, Serializable {
2325

2426
public static final String CALCULATOR_NAME = "uniqueness";
2527

26-
private static final Logger LOGGER = Logger.getLogger(
27-
UniquenessCalculator.class.getCanonicalName()
28-
);
2928
public static final String SUFFIX = "_txt";
3029
public static final int SUFFIX_LENGTH = SUFFIX.length();
3130

3231
private UniquenessExtractor extractor;
3332
private List<UniquenessField> solrFields;
3433

35-
private SolrClient solrClient;
34+
private final SolrClient solrClient;
3635

3736
private FieldCounter<Double> resultMap;
3837

@@ -108,13 +107,13 @@ public String getTotals() {
108107
}
109108

110109
@Override
111-
public Map<String, ? extends Object> getResultMap() {
110+
public Map<String, ?> getResultMap() {
112111
return resultMap.getMap();
113112
}
114113

115114
@Override
116-
public Map<String, Map<String, ? extends Object>> getLabelledResultMap() {
117-
Map<String, Map<String, ? extends Object>> labelledResultMap = new LinkedHashMap<>();
115+
public Map<String, Map<String, ?>> getLabelledResultMap() {
116+
Map<String, Map<String, ?>> labelledResultMap = new LinkedHashMap<>();
118117
labelledResultMap.put(getCalculatorName(), resultMap.getMap());
119118
return labelledResultMap;
120119
}

src/main/java/de/gwdg/metadataqa/api/configuration/ConfigurationReader.java

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,19 @@
44
import org.yaml.snakeyaml.Yaml;
55
import org.yaml.snakeyaml.constructor.Constructor;
66

7-
import java.io.*;
7+
import java.io.File;
8+
import java.io.FileNotFoundException;
9+
import java.io.IOException;
10+
import java.io.InputStream;
11+
import java.io.FileInputStream;
812

913
public class ConfigurationReader {
1014

1115
public static Configuration readJson(String fileName) throws FileNotFoundException {
1216
ObjectMapper objectMapper = new ObjectMapper();
1317

1418
File file = new File(fileName);
15-
Configuration config = null;
19+
Configuration config;
1620
try {
1721
config = objectMapper.readValue(file, Configuration.class);
1822
} catch (IOException e) {
@@ -24,8 +28,6 @@ public static Configuration readJson(String fileName) throws FileNotFoundExcepti
2428
public static Configuration readYaml(String fileName) throws FileNotFoundException {
2529
Yaml yaml = new Yaml(new Constructor(Configuration.class));
2630
InputStream inputStream = new FileInputStream(new File(fileName));
27-
Configuration config = (Configuration) yaml.load(inputStream);
28-
return config;
31+
return yaml.load(inputStream);
2932
}
30-
3133
}

src/main/java/de/gwdg/metadataqa/api/configuration/Rule.java

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,8 @@ public class Rule {
1818
private Integer maxInclusive;
1919
private Integer minLength;
2020
private Integer maxLength;
21-
private Integer lessThan;
22-
private Integer lessThanOrEquals;
21+
private String lessThan;
22+
private String lessThanOrEquals;
2323
private String hasValue;
2424

2525
public String getPattern() {
@@ -204,28 +204,28 @@ public Rule withMaxLength(int maxLength) {
204204
return this;
205205
}
206206

207-
public Integer getLessThan() {
207+
public String getLessThan() {
208208
return lessThan;
209209
}
210210

211-
public void setLessThan(int lessThan) {
211+
public void setLessThan(String lessThan) {
212212
this.lessThan = lessThan;
213213
}
214214

215-
public Rule withLessThan(int lessThan) {
215+
public Rule withLessThan(String lessThan) {
216216
setLessThan(lessThan);
217217
return this;
218218
}
219219

220-
public Integer getLessThanOrEquals() {
220+
public String getLessThanOrEquals() {
221221
return lessThanOrEquals;
222222
}
223223

224-
public void setLessThanOrEquals(int lessThanOrEquals) {
224+
public void setLessThanOrEquals(String lessThanOrEquals) {
225225
this.lessThanOrEquals = lessThanOrEquals;
226226
}
227227

228-
public Rule withLessThanOrEquals(int lessThanOrEquals) {
228+
public Rule withLessThanOrEquals(String lessThanOrEquals) {
229229
setLessThanOrEquals(lessThanOrEquals);
230230
return this;
231231
}

0 commit comments

Comments
 (0)