Releases: ICIJ/datashare
5.7.2
5.7.1
5.7.0
DataShare v0.6
DataShare
DataShare aims at allowing for valuable knowledge about people and companies
locked within hundreds of pages of documents inside a computer to be sieved
into indexes and shared securely within a network of trusted individuals,
fostering unforeseen collaboration and prompting new and better investigations
that uncover corruption, transnational crime and abuse of power.
DataShare: connecting local data with a global collective intelligence
Current Features
An Open-ended Multilingual Information Extraction and Search Platform
- Extract Text from Files;
- Extract Organizations, Persons and Locations from Text;
- Index and Search all
Multithreaded processings
Distributed processings
Remote or Embedded Index
Web API
Extract Text from Files
API
-
org.icij.datashare.text.extraction.FileParser
-
org.icij.datashare.text.SourcePath
-
org.icij.datashare.text.Document
Implementations
-
org.icij.datashare.text.extraction.tika.TikaFileParser
Apache Tika v1.14 (Apache Licence v2.0)
with Tess4J v3.3.0 (Apache Licence v2.0),
Tesseract v4.0 alpha compiled for arch x86-64
Support
Extract Persons, Organizations or Locations from Text
API
-
org.icij.datashare.text.nlp.NlpPipeline
-
org.icij.datashare.text.Document
-
org.icij.datashare.text.Language
-
org.icij.datashare.text.nlp.Annotation
-
org.icij.datashare.text.NamedEntity
Implementations
-
org.icij.datashare.text.nlp.core.CoreNlpPipeline
Stanford CoreNLP v3.7.0, (Conditional Random Fields),
Composite GPL v3+ -
org.icij.datashare.text.nlp.gate.GateNlpPipeline
OEG UPM Entity Extractor v1.1, (JAPE Rules Grammar),
based on EPSRC Gate v8.11, LGPL v3 -
org.icij.datashare.text.nlp.ixa.IxaNlpPipeline
Ixa Pipes Nerc v1.6.1, (Perceptron),
Apache Licence v2.0 -
org.icij.datashare.text.nlp.mitie.MitieNlpPipeline
MIT Information Extraction v0.8, (Structural Support Vector Machines),
Boost Software License v1.0 -
org.icij.datashare.text.nlp.open.OpenNlpPipeline
Apache OpenNLP v1.7.2, (Maximum Entropy),
Apache Licence v2.0
Natural Language Processing Stages Support
NlpStage |
---|
TOKEN |
SENTENCE |
POS |
NER |
Named Entity Recognition Language Support
NlpStage.NER |
ENGLISH |
SPANISH |
FRENCH |
GERMAN |
---|---|---|---|---|
NlpPipeline.Type.GATE |
X | X | X | X |
NlpPipeline.Type.CORE |
X | X | - | X |
NlpPipeline.Type.OPEN |
X | X | X | - |
NlpPipeline.Type.IXA |
X | X | - | X |
NlpPipeline.Type.MITIE |
X | X | - | - |
Named Entity Categories Support
NamedEntity.Category |
---|
ORGANIZATION |
PERSON |
LOCATION |
Parts-of-Speech Language Support
NlpStage.POS |
ENGLISH |
SPANISH |
FRENCH |
GERMAN |
---|---|---|---|---|
NlpPipeline.Type.GATE |
- | - | - | - |
NlpPipeline.Type.CORE |
X | X | X | X |
NlpPipeline.Type.OPEN |
X | X | X | X |
NlpPipeline.Type.IXA |
X | X | X | X |
NlpPipeline.Type.MITIE |
- | - | - | - |
Store and Search Documents and Named Entities
API
-
org.icij.datashare.text.indexing.Indexer
-
org.icij.datashare.text.Document
-
org.icij.datashare.text.NamedEntity
Implementations
-
org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer
Elasticsearch v5.3.0, Apache Licence v2.0
Compilation / Build
From datashare
root directory, type: mvn package
Usage
Distribution Directory Structure
Build process yields the following structure
datashare-dist-<VERSION>-all
|__
lib
|__
logs
|__
opt
|__
src
|__ start-cli
|__ start-cli-with-idx
|__ start-idx
|__ stop-idx
|__ start-ws
|__ start-ws-with-idx
|__ stop-ws
Execution
Requirements:
- Version
JRE8+
- Memory
8+GB
#### Command-Line Interface
./start-cli
--stages
, -s
:
Processing stages to be run.
Defaults to all: {SCANNING
, PARSING
, NLP
}.
--node
, -n
:
Run as a cluster node.
SCANNING
--scanning-input-dir
, -i
:
Path towards source directory containing documents to be processed.
PARSING
--parsing-ocr
, -ocr
:
Enable OCR when parsing source documents.
--parsing-parallelism
, -prst
:
Number of file parser threads.
Defaults to 1
.
NLP
--nlp-pipelines
, -nlpp
:
NLP pipelines to be run; in {GATE
,CORE
,OPEN
,MITIE
,IXA
}.
Defaults to GATE
.
--nlp-parallelism
, -nlpt
:
Number of threads per NLP pipeline.
Defaults to 1
.
--nlp-stages
, -nlps
:
NLP stages to be run by pipelines; in {POS
,NER
}.
Defaults to NER
.
--nlp-ner-categories
, -nlpnerc
:
Named entity categories to be extracted.
Defaults to all: {ORGANIZATION
,PERSON
,LOCATION
}.
--nlp-no-caching
, -nlpnocach
:
Disable caching of pipeline's models and annotators.
INDEXING
--indexing-node-type
, -idxtype
:
Index node type ; in {LOCAL
,REMOTE
}.
Defaults to LOCAL
.
--indexing-hostnames
, -idxhosts
:
Remote indexing nodes hostnames to connect to.
--indexing-hostports
, -idxports
:
Remote indexing nodes ports to connect on.
Command examples:
Stand-alone
-
./start-cli-with-idx --input-dir path/to/source/docs/
-
./start-cli-with-idx --scanning-input-dir path/to/source/docs/ --ocr --nlp-pipelines OPEN,CORE --nlp-stages NER -cat PERS,ORG
Node
-
./start-cli --node --stages SCANNING,PARSING --input-dir path/to/source/docs/ --ocr --index-hostnames http://192.168.0.1 --index-hostports 9300
-
./start-cli --node --stages NLP -pipelines OPEN,CORE --nlp-stages NER -cat PERS,ORG --ocr --index-hostnames http://192.168.0.1 --index-hostports 9300
Web Server
./start-ws
See all routes at datashare/datashare-web/datashare-web-play/conf/routes
Processing examples:
-
curl -XPOST 'localhost:9000/datashare/process/<INPUT_DIR>'
-
curl -XPOST 'localhost:9000/datashare/process/<INPUT_DIR>?parallelism=2'
NB: concrete INPUT_DIR
is evaluated on web server and must be escaped, eg %2Fpath%2Fto%2Fsource%2Fdocs
TODO: pass options as JSON
Indexing examples:
-
list all indices:
curl -XGET 'localhost:9000/datashare/index'
-
commit index:
curl -XPUT 'localhost:9000/datashare/index/<INDEX>'
-
delete index:
curl -XDELETE 'localhost:9000/datashare/index/<INDEX>'
-
search all indices:
curl -XPOST 'localhost:9000/datashare/index?<QUERY_STRING>'
-
search index/type/query:
curl -XPOST 'localhost:9000/datashare/index/<INDEX>/<TYPE>?<QUERY_STRING>'
TODO: pass options as JSON
Index
./start-idx
Starts an index instance on the local machine.
Documentation
Browse the JavaDoc from datashare/doc/index.html
License
DataShare is released under the GNU General Public License
Feedback
We welcome feedback as well as contributions!
For any bug, question, comment or (pull) request,
please contact us at [email protected] or [email protected]
What's next
-
Test suite
-
Integrate Extract
-
Web graphical user interface
-
Data Sharing module
-
Networking module
-
Content Management module
-
User Management module
-
Request and Exchange Protocol
-
DataShare
DataShare
DataShare aims at allowing for valuable knowledge about people and companies
locked within hundreds of pages of documents inside a computer to be sieved
into indexes and shared securely within a network of trusted individuals,
fostering unforeseen collaboration and prompting new and better investigations
that uncover corruption, transnational crime and abuse of power.
DataShare: connecting local data with a global collective intelligence
Features
An Open-ended Multilingual Information Extraction and Search Platform
Data Sharing module to come...
Extract Text from Files
API
org.icij.datashare.text.extraction.FileParser
Implementations
-
org.icij.datashare.text.extraction.tika.TikaFileParser
Apache Tika v1.14 (Apache licence)
Support
Data Structures
org.icij.datashare.text.Language
org.icij.datashare.text.Document
Extract Persons, Organizations or Locations from Text
API
org.icij.datashare.text.nlp.NlpPipeline
Implementations
-
org.icij.datashare.text.nlp.core.CoreNlpPipeline
Stanford CoreNLP v3.6.0, (Conditional Random Fields), Composite GPL Version 3+ Licence
-
org.icij.datashare.text.nlp.open.OpenNlpPipeline
Apache OpenNLP v1.6.0, (Maximum Entropy), Apache Licence Version 2.0
-
org.icij.datashare.text.nlp.gate.GateNlpPipeline
OEG UPM Entity Extractor, v1.1, (JAPE Rules Grammar), based on EPSRC Gate v8.11, LGPL v3
-
org.icij.datashare.text.nlp.mitie.MitieNlpPipeline
MIT Information Extraction v0.8, (Structural Support Vector Machines), Boost Software License Version 1.0
-
org.icij.datashare.text.nlp.ixa.IxaNlpPipeline
Ixa Pipes Nerc v1.6.1, (Perceptron), Apache Licence Version 2.0
Natural Language Processing Stages Support
NlpStage |
---|
TOKEN |
SENTENCE |
POS |
NER |
Named Entity Recognition Language Support
NlpStage.NER |
Language.ENGLISH |
Language.SPANISH |
Language.FRENCH |
Language.GERMAN |
---|---|---|---|---|
NlpPipeline.Type.CORE |
X | X | - | X |
NlpPipeline.Type.OPEN |
X | X | X | - |
NlpPipeline.Type.GATE |
X | X | X | X |
NlpPipeline.Type.MITIE |
X | X | - | - |
NlpPipeline.Type.IXA |
X | X | - | X |
Named Entity Categories Support
NamedEntity.Category |
---|
ORGANIZATION |
PERSON |
LOCATION |
Parts-of-Speech Language Support
NlpStage.POS |
Language.ENGLISH |
Language.SPANISH |
Language.FRENCH |
Language.GERMAN |
---|---|---|---|---|
NlpPipeline.Type.CORE |
X | X | X | X |
NlpPipeline.Type.OPEN |
X | X | X | X |
NlpPipeline.Type.IXA |
X | X | X | X |
Data Structures
org.icij.datashare.text.Language
org.icij.datashare.text.Document
org.icij.datashare.text.NamedEntity
org.icij.datashare.text.nlp.NlpStage
org.icij.datashare.text.nlp.NlpPipeline
org.icij.datashare.text.nlp.Tag
org.icij.datashare.text.nlp.Annotation
Store and Search Documents and Named Entities
API
-
org.icij.datashare.text.indexing.Indexer
Implementations
-
org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer
Elasticsearch, v5.1.1 (Apache licence v2)
Data Structures
org.icij.datashare.text.NamedEntity
org.icij.datashare.text.Document
Usage
Distribution Directory Structure
Build process yields the following structure
datashare-dist-<VERSION>-all
|__
dist
|__
lib
|__
logs
|__
scr
|__
src
|__ start-cli
|__ start-idx
|__ start-ws
|__ stop-idx
|__ stop-ws
Execution
Requirements:
- Version
JRE8+
, - File encoding
UTF-8
, - Memory
8+GB
#### Command-Line Interface
./start-cli
--input-dir
, -i
:
Path towards source directory containing documents to be processed.
Required
--output-dir
, -o
:
Path towards directory where to write result files.
Defaults to system /tmp
directory
--pipeline
, -p
:
NLP pipelines to be run; in {GATE
,CORE
,OPEN
,MITIE
, IXA
}.
Defaults to GATE
--parallelism
, -t
:
Number of threads per NLP pipeline.
Defaults to 1
--stages
, -s
:
NLP stages to be run by pipelines; in {POS
,NER
}.
Defaults to NER
--entities
, -e
:
Named entity categories to be extracted.
Defaults to all: {ORGANIZATION
,PERSON
,LOCATION
}
--no-caching
:
Disable caching of pipeline's models and annotators.
Default is --caching
--ocr
:
Enable OCR when parsing source documents.
Install Tesseract beforehand; very slow currently.
Defaults to --no-ocr
examples:
start-cli --input-dir path/to/source/docs/
start-cli --input-dir path/to/source/docs/ -p OPEN,CORE -s POS,NER -e PERS,ORG --ocr
Web Server
./start-ws
See all routes at datashare/datashare-web/datashare-web-play/conf/routes
Processing examples:
curl -XPOST 'localhost:9000/datashare/process/local/<INPUT_DIR>'
curl -XPOST 'localhost:9000/datashare/process/local/<INPUT_DIR>?parallelism=2'
NB: concrete INPUT_DIR
is evaluated on web server and must be escaped, eg %2Fpath%2Fto%2Fsource%2Fdocs
Indexing examples:
- list all indices:
curl -XGET 'localhost:9000/datashare/index'
- commit index:
curl -XPUT 'localhost:9000/datashare/index/<INDEX>'
- delete index:
curl -XDELETE 'localhost:9000/datashare/index/<INDEX>'
- search all indices:
curl -XPOST 'localhost:9000/datashare/index/<QUERY_STRING>'
- search index/type/query:
curl -XPOST 'localhost:9000/datashare/index/<INDEX>/<TYPE>/<QUERY_STRING>'
Index
./start-idx
Compilation / Build
From datashare
root directory, type: mvn package
Source Directory Stucture
datashare
|__
datashare-api
|__
datashare-cli
|__
datashare-dist
|__
datashare-extract
|__ datashare-extract-tika
|__
datashare-index
|__ datashare-index-elasticsearch
|__
datashare-nlp
|__ datashare-nlp-corenlp
|__ datashare-nlp-gate
|__ datashare-nlp-ixapipe
|__ datashare-nlp-mitie
|__ datashare-nlp-opennlp
|__
datashare-web
|__ datashare-web-play
Documentation
Browse the JavaDoc from datashare/doc/index.html
License
DataShare is released under the GNU General Public License
Feedback
We would be happy to get your feedback as well as your contributions!
For any bug, question, comment or (pull) request,
please contact us at [email protected] or [email protected]
What's next
- Test suite
- Handle Embedded documents with Tika
- Embed Tesseract (Tess4J)
- Web module graphical user interface
- Web module Security
- User Management module
- Networking module
- Data Sharing module
DataShare
DataShare
DataShare allows for valuable knowledge about people and companies
locked within hundreds of pages of documents inside a computer
to be sieved into indexes and shared securely within a network of
trusted individuals, fostering unforeseen collaboration and prompting
new and better investigations that uncover corruption, transnational
crime and abuse of power.
DataShare: connecting local data with a global collective intelligence
Release Overview
This program extracts named entities from documents contained in specified input-dir
Program is controlled through command line interface.
Program name: datashare
Arguments:
--input-dir
: Path towards source directory containing documents to be processed--output-dir
: Path towards directory where to write result files. Defaults to system /tmp directory.--nlp-pipeline
: Pipelines to be run; any combination of {GATENLP, CORENLP, OPENNLP}.--enable-ocr
: Run OCR when parsing source documents. Very slow for now.
For each input-dir/document.ext
, processing yields at most one CSV (semicolon separated ;
) result file
output-dir/document.ext.csv
that has the following columns:
- named_entity_mention
- named_entity_category
- mention_offset
- source_document_path
- nlp_pipeline
- mention_normal_form
- mention_hash
- source_document_hash
Usage Example
Input directory is a required argument:
datashare --input-dir /path/to/source/docs/directory/
Run GATENLP and CORENLP pipelines only:
datashare --input-dir /path/to/source/docs/directory/ --nlp-pipeline GATENLP,CORENLP
Recognize PERSON and ORGANIZATION entities only:
datashare --input-dir /path/to/source/docs/directory/ --entity-cat PERSON,ORGANIZATION
Activate OCR (very slow):
datashare --input-dir /path/to/source/docs/directory/ --enable-ocr
Feedback
We would be happy to get your feedback!
For any bug, remark, question or comment, please contact [email protected]