The DIT (the harvesting component of the Data Interoperability Toolkit - https://github.com/openminted/omtd-publisher-connector-harvester) will continuously update a ‘omtd-resourcesync’ index into ES. This index will contain two different types:
- resource: keeps track of the current state of the resources to be sync
- change: logs changes of the resources to by sync
Every time the DIT updates one of its resources (i.e. download a new file => change = ‘created’), it will post two new documents into ES:
- resource document:
- If change == ‘created’: create new document with F’s metadata
- If change == ‘updated’: update F’s metadata
- If change == ‘deleted’: delete F’s document in ES
- change document: create a new document logging the occurred change
In this way, the resource type will always contain a snapshot of the current state of the resources, in order to easily generate a resource list from it. Likewise, change lists can be created and updated querying Elasticsearch, providing a time interval to retrieve the changes we are interested in. The ResourceSync source will just refer to the Elasticsearch index as reference for the resources' state.
ResourceSync is a flexible and powerful tool to synchronize very large sets of resources, which may be physical files or not. Elasticsearch or other data storage systems, assisted by an update layer on top of them, allows to keep track of the state of the resources without time consuming processes (i.e. checking changes on several million files). Moreover, this enables more sophisticated pagination techniques, avoiding to regenerate a whole resourcelist when few changes have occurred (i.e. it may be sufficient to regenerate a single sitemap instead of 50k)
Here's the mapping for the resource type of the ‘omtd-resourcesync’ index:
{
"resource": {
"properties": {
"resource_set": {
"type": "string",
"index": "not_analyzed"
},
"location": {
"type": "nested",
"properties": {
"value":{
"type":"string",
"index":"not_analyzed"
},
"type":{
"type":"string",
"index":"not_analyzed"
}
}
},
"length": {
"type": "integer",
"index": "not_analyzed"
},
"md5": {
"type": "string",
"index": "not_analyzed"
},
"mime": {
"type": "string",
"index": "not_analyzed"
},
"lastmod": {
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ssZ"
},
"ln": {
"type": "nested",
"index_name": "link",
"properties": {
"href": {
"type": "nested",
"properties": {
"value":{
"type":"string",
"index":"not_analyzed"
},
"type":{
"type":"string",
"index":"not_analyzed"
}
}
},
"rel": {
"type": "string",
"index": "not_analyzed"
},
"mime": {
"type": "string",
"index": "not_analyzed"
}
}
},
"timestamp": {
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ssZ"
}
}
}
}
For each resource, the following fields will be filled out:
resource_set
: the name of the resource set the resource will belong totimestamp
: timestamp automatically generated by Elasticsearch when the document is created/updatedlocation
: can be aurl
: complete resource address, theurl_prefix
parameter won't be usedabs_path
: absolute path, which will be resolved wrt theresource_root_dir
parameter and then attached to theurl_prefix
rel_path
: relative path, which will be attached to theurl_prefix
length
: length of the resourcemd5
: md5 hash of the resource (may become an array of hashes to support different hashing techniquesmime
: mime type of the resourcelastmod
: last modification time of the resourceln
: links to other resources, each one of them can have three fieldsrel
: relationships description (i.e.describes
,described by
)href
: link to the resource, similar tolocation
(url
/abs_path
/rel_path
)mime
: mime type of the linked resource
Here's the mapping for the change type:
{
"change": {
"properties": {
"resource_set": {
"type": "string",
"index": "not_analyzed"
},
"location": {
"type": "nested",
"properties": {
"value":{
"type":"string",
"index":"not_analyzed"
},
"type":{
"type":"string",
"index":"not_analyzed"
}
}
},
"change": {
"type": "string",
"index": "not_analyzed"
},
"lastmod": {
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ssZ"
},
"datetime": {
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ssZ"
},
"timestamp": {
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ssZ"
}
}
}
}
resource_set
: the name of the resource set the resource will belong totimestamp
: timestamp automatically generated by elasticsearch when the document is created/updatedlocation
: can be aurl
: complete resource address, theurl_prefix
parameter won't be usedabs_path
: absolute path, which will be resolved wrt theresource_root_dir
parameter and then attached to theurl_prefix
rel_path
: relative path, which will be attached to theurl_prefix
change
: type of the occurred change, can becreated
/updated
/deleted
lastmod
: last modification time of the resource
Note: the current mapping will be extended with further metadata and updated according to new versions of the ResourceSync specification