DataShare aims at allowing for valuable knowledge about people and companies
locked within hundreds of pages of documents inside a computer to be sieved
into indexes and shared securely within a network of trusted individuals,
fostering unforeseen collaboration and prompting new and better investigations
that uncover corruption, transnational crime and abuse of power.

DataShare: connecting local data with a global collective intelligence

Current Features

An Open-ended Multilingual Information Extraction and Search Platform

Extract Text from Files;
Extract Organizations, Persons and Locations from Text;
Index and Search all

Multithreaded processings

Distributed processings

Remote or Embedded Index

Web API

Extract Text from Files

API

org.icij.datashare.text.extraction.FileParser
org.icij.datashare.text.SourcePath
org.icij.datashare.text.Document

Implementations

org.icij.datashare.text.extraction.tika.TikaFileParser

Apache Tika v1.14 (Apache Licence v2.0)

with Tess4J v3.3.0 (Apache Licence v2.0),
Tesseract v4.0 alpha compiled for arch x86-64

Support

Tika File Formats

Extract Persons, Organizations or Locations from Text

API

org.icij.datashare.text.nlp.NlpPipeline
org.icij.datashare.text.Document
org.icij.datashare.text.Language
org.icij.datashare.text.nlp.Annotation
org.icij.datashare.text.NamedEntity

Implementations

org.icij.datashare.text.nlp.core.CoreNlpPipeline

Stanford CoreNLP v3.7.0, (Conditional Random Fields),
Composite GPL v3+
org.icij.datashare.text.nlp.gate.GateNlpPipeline

OEG UPM Entity Extractor v1.1, (JAPE Rules Grammar),
based on EPSRC Gate v8.11, LGPL v3
org.icij.datashare.text.nlp.ixa.IxaNlpPipeline

Ixa Pipes Nerc v1.6.1, (Perceptron),
Apache Licence v2.0
org.icij.datashare.text.nlp.mitie.MitieNlpPipeline

MIT Information Extraction v0.8, (Structural Support Vector Machines),
Boost Software License v1.0
org.icij.datashare.text.nlp.open.OpenNlpPipeline

Apache OpenNLP v1.7.2, (Maximum Entropy),
Apache Licence v2.0

Natural Language Processing Stages Support

`NlpStage`
`TOKEN`
`SENTENCE`
`POS`
`NER`

Named Entity Recognition Language Support

`NlpStage.NER`	`ENGLISH`	`SPANISH`	`FRENCH`	`GERMAN`
`NlpPipeline.Type.GATE`	X	X	X	X
`NlpPipeline.Type.CORE`	X	X	-	X
`NlpPipeline.Type.OPEN`	X	X	X	-
`NlpPipeline.Type.IXA`	X	X	-	X
`NlpPipeline.Type.MITIE`	X	X	-	-

Named Entity Categories Support

`NamedEntity.Category`
`ORGANIZATION`
`PERSON`
`LOCATION`

Parts-of-Speech Language Support

`NlpStage.POS`	`ENGLISH`	`SPANISH`	`FRENCH`	`GERMAN`
`NlpPipeline.Type.GATE`	-	-	-	-
`NlpPipeline.Type.CORE`	X	X	X	X
`NlpPipeline.Type.OPEN`	X	X	X	X
`NlpPipeline.Type.IXA`	X	X	X	X
`NlpPipeline.Type.MITIE`	-	-	-	-

Store and Search Documents and Named Entities

API

org.icij.datashare.text.indexing.Indexer
org.icij.datashare.text.Document
org.icij.datashare.text.NamedEntity

Implementations

org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer

Elasticsearch v5.3.0, Apache Licence v2.0

Compilation / Build

Requires
JDK 8 and
Maven 3

From datashare root directory, type: mvn package

Usage

Distribution Directory Structure

Build process yields the following structure

datashare-dist-<VERSION>-all

|__ lib

|__ logs

|__ opt

|__ src

|__ start-cli

|__ start-cli-with-idx

|__ start-idx

|__ stop-idx

|__ start-ws

|__ start-ws-with-idx

|__ stop-ws

Execution

Requirements:

Version JRE8+
Memory 8+GB

#### Command-Line Interface

./start-cli

--stages, -s:
Processing stages to be run.
Defaults to all: {SCANNING, PARSING, NLP}.

--node, -n:
Run as a cluster node.

SCANNING

--scanning-input-dir, -i:
Path towards source directory containing documents to be processed.

PARSING

--parsing-ocr, -ocr:
Enable OCR when parsing source documents.

--parsing-parallelism, -prst:
Number of file parser threads.
Defaults to 1.

NLP

--nlp-pipelines, -nlpp:
NLP pipelines to be run; in {GATE,CORE,OPEN,MITIE,IXA}.
Defaults to GATE.

--nlp-parallelism, -nlpt:
Number of threads per NLP pipeline.
Defaults to 1.

--nlp-stages, -nlps:
NLP stages to be run by pipelines; in {POS,NER}.
Defaults to NER.

--nlp-ner-categories, -nlpnerc:
Named entity categories to be extracted.
Defaults to all: {ORGANIZATION,PERSON,LOCATION}.

--nlp-no-caching, -nlpnocach:
Disable caching of pipeline's models and annotators.

INDEXING

--indexing-node-type, -idxtype:
Index node type ; in {LOCAL,REMOTE}.
Defaults to LOCAL.

--indexing-hostnames, -idxhosts:
Remote indexing nodes hostnames to connect to.

--indexing-hostports, -idxports:
Remote indexing nodes ports to connect on.

Command examples:

Stand-alone

./start-cli-with-idx --input-dir path/to/source/docs/
./start-cli-with-idx --scanning-input-dir path/to/source/docs/ --ocr --nlp-pipelines OPEN,CORE --nlp-stages NER -cat PERS,ORG

Node

./start-cli --node --stages SCANNING,PARSING --input-dir path/to/source/docs/ --ocr --index-hostnames http://192.168.0.1 --index-hostports 9300
./start-cli --node --stages NLP -pipelines OPEN,CORE --nlp-stages NER -cat PERS,ORG --ocr --index-hostnames http://192.168.0.1 --index-hostports 9300

Web Server

./start-ws

See all routes at datashare/datashare-web/datashare-web-play/conf/routes

Processing examples:

curl -XPOST 'localhost:9000/datashare/process/<INPUT_DIR>'
curl -XPOST 'localhost:9000/datashare/process/<INPUT_DIR>?parallelism=2'

NB: concrete INPUT_DIR is evaluated on web server and must be escaped, eg %2Fpath%2Fto%2Fsource%2Fdocs

TODO: pass options as JSON

Indexing examples:

list all indices: curl -XGET 'localhost:9000/datashare/index'
commit index: curl -XPUT 'localhost:9000/datashare/index/<INDEX>'
delete index: curl -XDELETE 'localhost:9000/datashare/index/<INDEX>'
search all indices: curl -XPOST 'localhost:9000/datashare/index?<QUERY_STRING>'
search index/type/query: curl -XPOST 'localhost:9000/datashare/index/<INDEX>/<TYPE>?<QUERY_STRING>'

See Query String syntax

TODO: pass options as JSON

Index

./start-idx

Starts an index instance on the local machine.

Documentation

Browse the JavaDoc from datashare/doc/index.html

License

DataShare is released under the GNU General Public License

Feedback

We welcome feedback as well as contributions!

For any bug, question, comment or (pull) request,

please contact us at [email protected] or [email protected]

What's next

Test suite
Integrate Extract
Web graphical user interface
Data Sharing module
- Networking module
- Content Management module
- User Management module
- Request and Exchange Protocol

Assets 2

21 Dec 16:18

Julm

v0.5

7bb91ba

DataShare

DataShare: connecting local data with a global collective intelligence

Features

An Open-ended Multilingual Information Extraction and Search Platform

Data Sharing module to come...

Extract Text from Files

API

org.icij.datashare.text.extraction.FileParser

Implementations

org.icij.datashare.text.extraction.tika.TikaFileParser

Apache Tika v1.14 (Apache licence)

Support

Tika File Formats

Data Structures

org.icij.datashare.text.Language
org.icij.datashare.text.Document

Extract Persons, Organizations or Locations from Text

API

org.icij.datashare.text.nlp.NlpPipeline

Implementations

org.icij.datashare.text.nlp.core.CoreNlpPipeline

Stanford CoreNLP v3.6.0, (Conditional Random Fields), Composite GPL Version 3+ Licence
org.icij.datashare.text.nlp.open.OpenNlpPipeline

Apache OpenNLP v1.6.0, (Maximum Entropy), Apache Licence Version 2.0
org.icij.datashare.text.nlp.gate.GateNlpPipeline

OEG UPM Entity Extractor, v1.1, (JAPE Rules Grammar), based on EPSRC Gate v8.11, LGPL v3
org.icij.datashare.text.nlp.mitie.MitieNlpPipeline

MIT Information Extraction v0.8, (Structural Support Vector Machines), Boost Software License Version 1.0
org.icij.datashare.text.nlp.ixa.IxaNlpPipeline

Ixa Pipes Nerc v1.6.1, (Perceptron), Apache Licence Version 2.0

Natural Language Processing Stages Support

`NlpStage`
`TOKEN`
`SENTENCE`
`POS`
`NER`

Named Entity Recognition Language Support

`NlpStage.NER`	`Language.ENGLISH`	`Language.SPANISH`	`Language.FRENCH`	`Language.GERMAN`
`NlpPipeline.Type.CORE`	X	X	-	X
`NlpPipeline.Type.OPEN`	X	X	X	-
`NlpPipeline.Type.GATE`	X	X	X	X
`NlpPipeline.Type.MITIE`	X	X	-	-
`NlpPipeline.Type.IXA`	X	X	-	X

Named Entity Categories Support

`NamedEntity.Category`
`ORGANIZATION`
`PERSON`
`LOCATION`

Parts-of-Speech Language Support

`NlpStage.POS`	`Language.ENGLISH`	`Language.SPANISH`	`Language.FRENCH`	`Language.GERMAN`
`NlpPipeline.Type.CORE`	X	X	X	X
`NlpPipeline.Type.OPEN`	X	X	X	X
`NlpPipeline.Type.IXA`	X	X	X	X

Data Structures

org.icij.datashare.text.Language
org.icij.datashare.text.Document
org.icij.datashare.text.NamedEntity
org.icij.datashare.text.nlp.NlpStage
org.icij.datashare.text.nlp.NlpPipeline
org.icij.datashare.text.nlp.Tag
org.icij.datashare.text.nlp.Annotation

Store and Search Documents and Named Entities

API

org.icij.datashare.text.indexing.Indexer

Implementations
org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer

Elasticsearch, v5.1.1 (Apache licence v2)

Data Structures

org.icij.datashare.text.NamedEntity
org.icij.datashare.text.Document

Usage

Distribution Directory Structure

Build process yields the following structure

datashare-dist-<VERSION>-all

|__dist

|__lib

|__logs

|__scr

|__src

|__ start-cli

|__ start-idx

|__ start-ws

|__ stop-idx

|__ stop-ws

Execution

Requirements:

Version JRE8+,
File encoding UTF-8,
Memory 8+GB

#### Command-Line Interface

./start-cli

--input-dir, -i:
Path towards source directory containing documents to be processed.
Required

--output-dir, -o:
Path towards directory where to write result files.
Defaults to system /tmp directory

--pipeline, -p:
NLP pipelines to be run; in {GATE,CORE,OPEN,MITIE, IXA}.
Defaults to GATE

--parallelism, -t:
Number of threads per NLP pipeline.
Defaults to 1

--stages, -s:
NLP stages to be run by pipelines; in {POS,NER}.
Defaults to NER

--entities, -e:
Named entity categories to be extracted.
Defaults to all: {ORGANIZATION,PERSON,LOCATION}

--no-caching:
Disable caching of pipeline's models and annotators.
Default is --caching

--ocr:
Enable OCR when parsing source documents.
Install Tesseract beforehand; very slow currently.
Defaults to --no-ocr

examples:

start-cli --input-dir path/to/source/docs/
start-cli --input-dir path/to/source/docs/ -p OPEN,CORE -s POS,NER -e PERS,ORG --ocr

Web Server

./start-ws

See all routes at datashare/datashare-web/datashare-web-play/conf/routes

Processing examples:

curl -XPOST 'localhost:9000/datashare/process/local/<INPUT_DIR>'
curl -XPOST 'localhost:9000/datashare/process/local/<INPUT_DIR>?parallelism=2'

NB: concrete INPUT_DIR is evaluated on web server and must be escaped, eg %2Fpath%2Fto%2Fsource%2Fdocs

Indexing examples:

list all indices: curl -XGET 'localhost:9000/datashare/index'
commit index: curl -XPUT 'localhost:9000/datashare/index/<INDEX>'
delete index: curl -XDELETE 'localhost:9000/datashare/index/<INDEX>'
search all indices: curl -XPOST 'localhost:9000/datashare/index/<QUERY_STRING>'
search index/type/query: curl -XPOST 'localhost:9000/datashare/index/<INDEX>/<TYPE>/<QUERY_STRING>'

See Query String syntax

Index

./start-idx

Compilation / Build

Requires
JDK 8 and
Maven 3

From datashare root directory, type: mvn package

Source Directory Stucture

datashare

|__datashare-api

|__datashare-cli

|__datashare-dist

|__datashare-extract

|__ datashare-extract-tika

|__datashare-index

|__ datashare-index-elasticsearch

|__datashare-nlp

|__ datashare-nlp-corenlp

|__ datashare-nlp-gate

|__ datashare-nlp-ixapipe

|__ datashare-nlp-mitie

|__ datashare-nlp-opennlp

|__datashare-web

|__ datashare-web-play

Documentation

Browse the JavaDoc from datashare/doc/index.html

License

DataShare is released under the GNU General Public License

Feedback

We would be happy to get your feedback as well as your contributions!

For any bug, question, comment or (pull) request,

please contact us at [email protected] or [email protected]

What's next

Test suite
Handle Embedded documents with Tika
Embed Tesseract (Tess4J)
Web module graphical user interface
Web module Security
User Management module
Networking module
Data Sharing module

Assets 2

23 May 20:23

Julm

v0.1-alpha

465b535

DataShare Pre-release

Pre-release

DataShare

DataShare allows for valuable knowledge about people and companies
locked within hundreds of pages of documents inside a computer
to be sieved into indexes and shared securely within a network of
trusted individuals, fostering unforeseen collaboration and prompting
new and better investigations that uncover corruption, transnational
crime and abuse of power.

DataShare: connecting local data with a global collective intelligence

Release Overview

This program extracts named entities from documents contained in specified input-dir

Program is controlled through command line interface.

Program name: datashare

Arguments:

--input-dir: Path towards source directory containing documents to be processed
--output-dir: Path towards directory where to write result files. Defaults to system /tmp directory.
--nlp-pipeline: Pipelines to be run; any combination of {GATENLP, CORENLP, OPENNLP}.
--enable-ocr: Run OCR when parsing source documents. Very slow for now.

For each input-dir/document.ext, processing yields at most one CSV (semicolon separated ;) result file

output-dir/document.ext.csv that has the following columns:

named_entity_mention
named_entity_category
mention_offset
source_document_path
nlp_pipeline
mention_normal_form
mention_hash
source_document_hash

Usage Example

Input directory is a required argument:

datashare --input-dir /path/to/source/docs/directory/

Run GATENLP and CORENLP pipelines only:

datashare --input-dir /path/to/source/docs/directory/ --nlp-pipeline GATENLP,CORENLP

Recognize PERSON and ORGANIZATION entities only:

datashare --input-dir /path/to/source/docs/directory/ --entity-cat PERSON,ORGANIZATION

Activate OCR (very slow):

datashare --input-dir /path/to/source/docs/directory/ --enable-ocr

Feedback

We would be happy to get your feedback!

For any bug, remark, question or comment, please contact [email protected]

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataShare

Current Features

Extract Text from Files

Extract Persons, Organizations or Locations from Text

Store and Search Documents and Named Entities

Compilation / Build

Usage

Distribution Directory Structure

Execution

Web Server

Index

Documentation

License

Feedback

What's next

DataShare

Features

Extract Text from Files

Extract Persons, Organizations or Locations from Text

Store and Search Documents and Named Entities

Usage

Distribution Directory Structure

Execution

Web Server

Index

Compilation / Build

Source Directory Stucture

Documentation

License

Feedback

What's next

DataShare

Release Overview

Usage Example

Feedback

Releases: ICIJ/datashare

5.7.2

5.7.1

5.7.0

DataShare v0.6

DataShare

Current Features

Extract Text from Files

Extract Persons, Organizations or Locations from Text

Store and Search Documents and Named Entities

Compilation / Build

Usage

Distribution Directory Structure

Execution

Web Server

Index

Documentation

License

Feedback

What's next

DataShare

DataShare

Features

Extract Text from Files

Extract Persons, Organizations or Locations from Text

Store and Search Documents and Named Entities

Usage

Distribution Directory Structure

Execution

Web Server

Index

Compilation / Build

Source Directory Stucture

Documentation

License

Feedback

What's next

DataShare

DataShare

Release Overview

Usage Example

Feedback