Skip to content

Releases: ICIJ/datashare

5.7.2

25 Feb 17:32
Compare
Choose a tag to compare

release 5.7.2

5.7.1

25 Feb 17:19
Compare
Choose a tag to compare

release 5.7.1

5.7.0

25 Feb 16:00
Compare
Choose a tag to compare

release 5.7.0

DataShare v0.6

10 Apr 10:25
Compare
Choose a tag to compare

DataShare

DataShare aims at allowing for valuable knowledge about people and companies
locked within hundreds of pages of documents inside a computer to be sieved
into indexes and shared securely within a network of trusted individuals,
fostering unforeseen collaboration and prompting new and better investigations
that uncover corruption, transnational crime and abuse of power.

DataShare: connecting local data with a global collective intelligence

Current Features

An Open-ended Multilingual Information Extraction and Search Platform

  • Extract Text from Files;
  • Extract Organizations, Persons and Locations from Text;
  • Index and Search all

Multithreaded processings

Distributed processings

Remote or Embedded Index

Web API

Extract Text from Files

API

  • org.icij.datashare.text.extraction.FileParser

  • org.icij.datashare.text.SourcePath

  • org.icij.datashare.text.Document

Implementations

  • org.icij.datashare.text.extraction.tika.TikaFileParser

    Apache Tika v1.14 (Apache Licence v2.0)

    with Tess4J v3.3.0 (Apache Licence v2.0),
    Tesseract v4.0 alpha compiled for arch x86-64

Support

Tika File Formats

Extract Persons, Organizations or Locations from Text

API

  • org.icij.datashare.text.nlp.NlpPipeline

  • org.icij.datashare.text.Document

  • org.icij.datashare.text.Language

  • org.icij.datashare.text.nlp.Annotation

  • org.icij.datashare.text.NamedEntity

Implementations

  • org.icij.datashare.text.nlp.core.CoreNlpPipeline

    Stanford CoreNLP v3.7.0, (Conditional Random Fields),
    Composite GPL v3+

  • org.icij.datashare.text.nlp.gate.GateNlpPipeline

    OEG UPM Entity Extractor v1.1, (JAPE Rules Grammar),
    based on EPSRC Gate v8.11, LGPL v3

  • org.icij.datashare.text.nlp.ixa.IxaNlpPipeline

    Ixa Pipes Nerc v1.6.1, (Perceptron),
    Apache Licence v2.0

  • org.icij.datashare.text.nlp.mitie.MitieNlpPipeline

    MIT Information Extraction v0.8, (Structural Support Vector Machines),
    Boost Software License v1.0

  • org.icij.datashare.text.nlp.open.OpenNlpPipeline

    Apache OpenNLP v1.7.2, (Maximum Entropy),
    Apache Licence v2.0

Natural Language Processing Stages Support

NlpStage
TOKEN
SENTENCE
POS
NER

Named Entity Recognition Language Support

NlpStage.NER ENGLISH SPANISH FRENCH GERMAN
NlpPipeline.Type.GATE X X X X
NlpPipeline.Type.CORE X X - X
NlpPipeline.Type.OPEN X X X -
NlpPipeline.Type.IXA X X - X
NlpPipeline.Type.MITIE X X - -

Named Entity Categories Support

NamedEntity.Category
ORGANIZATION
PERSON
LOCATION

Parts-of-Speech Language Support

NlpStage.POS ENGLISH SPANISH FRENCH GERMAN
NlpPipeline.Type.GATE - - - -
NlpPipeline.Type.CORE X X X X
NlpPipeline.Type.OPEN X X X X
NlpPipeline.Type.IXA X X X X
NlpPipeline.Type.MITIE - - - -

Store and Search Documents and Named Entities

API

  • org.icij.datashare.text.indexing.Indexer

  • org.icij.datashare.text.Document

  • org.icij.datashare.text.NamedEntity

Implementations

  • org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer

    Elasticsearch v5.3.0, Apache Licence v2.0

Compilation / Build

Requires
JDK 8 and
Maven 3

From datashare root directory, type: mvn package

Usage

Distribution Directory Structure

Build process yields the following structure

datashare-dist-<VERSION>-all

|__ lib

|__ logs

|__ opt

|__ src

|__ start-cli

|__ start-cli-with-idx

|__ start-idx

|__ stop-idx

|__ start-ws

|__ start-ws-with-idx

|__ stop-ws

Execution

Requirements:

  • Version JRE8+
  • Memory 8+GB

#### Command-Line Interface

./start-cli

--stages, -s:
Processing stages to be run.
Defaults to all: {SCANNING, PARSING, NLP}.

--node, -n:
Run as a cluster node.

SCANNING

--scanning-input-dir, -i:
Path towards source directory containing documents to be processed.

PARSING

--parsing-ocr, -ocr:
Enable OCR when parsing source documents.

--parsing-parallelism, -prst:
Number of file parser threads.
Defaults to 1.

NLP

--nlp-pipelines, -nlpp:
NLP pipelines to be run; in {GATE,CORE,OPEN,MITIE,IXA}.
Defaults to GATE.

--nlp-parallelism, -nlpt:
Number of threads per NLP pipeline.
Defaults to 1.

--nlp-stages, -nlps:
NLP stages to be run by pipelines; in {POS,NER}.
Defaults to NER.

--nlp-ner-categories, -nlpnerc:
Named entity categories to be extracted.
Defaults to all: {ORGANIZATION,PERSON,LOCATION}.

--nlp-no-caching, -nlpnocach:
Disable caching of pipeline's models and annotators.

INDEXING

--indexing-node-type, -idxtype:
Index node type ; in {LOCAL,REMOTE}.
Defaults to LOCAL.

--indexing-hostnames, -idxhosts:
Remote indexing nodes hostnames to connect to.

--indexing-hostports, -idxports:
Remote indexing nodes ports to connect on.

Command examples:

Stand-alone

  • ./start-cli-with-idx --input-dir path/to/source/docs/

  • ./start-cli-with-idx --scanning-input-dir path/to/source/docs/ --ocr --nlp-pipelines OPEN,CORE --nlp-stages NER -cat PERS,ORG

Node

  • ./start-cli --node --stages SCANNING,PARSING --input-dir path/to/source/docs/ --ocr --index-hostnames http://192.168.0.1 --index-hostports 9300

  • ./start-cli --node --stages NLP -pipelines OPEN,CORE --nlp-stages NER -cat PERS,ORG --ocr --index-hostnames http://192.168.0.1 --index-hostports 9300

Web Server

./start-ws

See all routes at datashare/datashare-web/datashare-web-play/conf/routes

Processing examples:

  • curl -XPOST 'localhost:9000/datashare/process/<INPUT_DIR>'

  • curl -XPOST 'localhost:9000/datashare/process/<INPUT_DIR>?parallelism=2'

NB: concrete INPUT_DIR is evaluated on web server and must be escaped, eg %2Fpath%2Fto%2Fsource%2Fdocs

TODO: pass options as JSON

Indexing examples:

  • list all indices: curl -XGET 'localhost:9000/datashare/index'

  • commit index: curl -XPUT 'localhost:9000/datashare/index/<INDEX>'

  • delete index: curl -XDELETE 'localhost:9000/datashare/index/<INDEX>'

  • search all indices: curl -XPOST 'localhost:9000/datashare/index?<QUERY_STRING>'

  • search index/type/query: curl -XPOST 'localhost:9000/datashare/index/<INDEX>/<TYPE>?<QUERY_STRING>'

See Query String syntax

TODO: pass options as JSON

Index

./start-idx

Starts an index instance on the local machine.

Documentation

Browse the JavaDoc from datashare/doc/index.html

License

DataShare is released under the GNU General Public License

Feedback

We welcome feedback as well as contributions!

For any bug, question, comment or (pull) request,

please contact us at [email protected] or [email protected]

What's next

  • Test suite

  • Integrate Extract

  • Web graphical user interface

  • Data Sharing module

    • Networking module

    • Content Management module

    • User Management module

    • Request and Exchange Protocol

DataShare

21 Dec 16:18
Compare
Choose a tag to compare

DataShare

DataShare aims at allowing for valuable knowledge about people and companies
locked within hundreds of pages of documents inside a computer to be sieved
into indexes and shared securely within a network of trusted individuals,
fostering unforeseen collaboration and prompting new and better investigations
that uncover corruption, transnational crime and abuse of power.

DataShare: connecting local data with a global collective intelligence

Features

An Open-ended Multilingual Information Extraction and Search Platform

Data Sharing module to come...

Extract Text from Files

API

  • org.icij.datashare.text.extraction.FileParser

Implementations

  • org.icij.datashare.text.extraction.tika.TikaFileParser

    Apache Tika v1.14 (Apache licence)

Support

Tika File Formats

Data Structures

  • org.icij.datashare.text.Language
  • org.icij.datashare.text.Document

Extract Persons, Organizations or Locations from Text

API

  • org.icij.datashare.text.nlp.NlpPipeline

Implementations

  • org.icij.datashare.text.nlp.core.CoreNlpPipeline

    Stanford CoreNLP v3.6.0, (Conditional Random Fields), Composite GPL Version 3+ Licence

  • org.icij.datashare.text.nlp.open.OpenNlpPipeline

    Apache OpenNLP v1.6.0, (Maximum Entropy), Apache Licence Version 2.0

  • org.icij.datashare.text.nlp.gate.GateNlpPipeline

    OEG UPM Entity Extractor, v1.1, (JAPE Rules Grammar), based on EPSRC Gate v8.11, LGPL v3

  • org.icij.datashare.text.nlp.mitie.MitieNlpPipeline

    MIT Information Extraction v0.8, (Structural Support Vector Machines), Boost Software License Version 1.0

  • org.icij.datashare.text.nlp.ixa.IxaNlpPipeline

    Ixa Pipes Nerc v1.6.1, (Perceptron), Apache Licence Version 2.0

Natural Language Processing Stages Support

NlpStage
TOKEN
SENTENCE
POS
NER

Named Entity Recognition Language Support

NlpStage.NER Language.ENGLISH Language.SPANISH Language.FRENCH Language.GERMAN
NlpPipeline.Type.CORE X X - X
NlpPipeline.Type.OPEN X X X -
NlpPipeline.Type.GATE X X X X
NlpPipeline.Type.MITIE X X - -
NlpPipeline.Type.IXA X X - X

Named Entity Categories Support

NamedEntity.Category
ORGANIZATION
PERSON
LOCATION

Parts-of-Speech Language Support

NlpStage.POS Language.ENGLISH Language.SPANISH Language.FRENCH Language.GERMAN
NlpPipeline.Type.CORE X X X X
NlpPipeline.Type.OPEN X X X X
NlpPipeline.Type.IXA X X X X

Data Structures

  • org.icij.datashare.text.Language
  • org.icij.datashare.text.Document
  • org.icij.datashare.text.NamedEntity
  • org.icij.datashare.text.nlp.NlpStage
  • org.icij.datashare.text.nlp.NlpPipeline
  • org.icij.datashare.text.nlp.Tag
  • org.icij.datashare.text.nlp.Annotation

Store and Search Documents and Named Entities

API

  • org.icij.datashare.text.indexing.Indexer

    Implementations

  • org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer

    Elasticsearch, v5.1.1 (Apache licence v2)

Data Structures

  • org.icij.datashare.text.NamedEntity
  • org.icij.datashare.text.Document

Usage

Distribution Directory Structure

Build process yields the following structure

datashare-dist-<VERSION>-all

|__dist

|__lib

|__logs

|__scr

|__src

|__ start-cli

|__ start-idx

|__ start-ws

|__ stop-idx

|__ stop-ws

Execution

Requirements:

  • Version JRE8+,
  • File encoding UTF-8,
  • Memory 8+GB

#### Command-Line Interface

./start-cli

--input-dir, -i:
Path towards source directory containing documents to be processed.
Required

--output-dir, -o:
Path towards directory where to write result files.
Defaults to system /tmp directory

--pipeline, -p:
NLP pipelines to be run; in {GATE,CORE,OPEN,MITIE, IXA}.
Defaults to GATE

--parallelism, -t:
Number of threads per NLP pipeline.
Defaults to 1

--stages, -s:
NLP stages to be run by pipelines; in {POS,NER}.
Defaults to NER

--entities, -e:
Named entity categories to be extracted.
Defaults to all: {ORGANIZATION,PERSON,LOCATION}

--no-caching:
Disable caching of pipeline's models and annotators.
Default is --caching

--ocr:
Enable OCR when parsing source documents.
Install Tesseract beforehand; very slow currently.
Defaults to --no-ocr

examples:

  • start-cli --input-dir path/to/source/docs/
  • start-cli --input-dir path/to/source/docs/ -p OPEN,CORE -s POS,NER -e PERS,ORG --ocr

Web Server

./start-ws

See all routes at datashare/datashare-web/datashare-web-play/conf/routes

Processing examples:

  • curl -XPOST 'localhost:9000/datashare/process/local/<INPUT_DIR>'
  • curl -XPOST 'localhost:9000/datashare/process/local/<INPUT_DIR>?parallelism=2'

NB: concrete INPUT_DIR is evaluated on web server and must be escaped, eg %2Fpath%2Fto%2Fsource%2Fdocs

Indexing examples:

  • list all indices: curl -XGET 'localhost:9000/datashare/index'
  • commit index: curl -XPUT 'localhost:9000/datashare/index/<INDEX>'
  • delete index: curl -XDELETE 'localhost:9000/datashare/index/<INDEX>'
  • search all indices: curl -XPOST 'localhost:9000/datashare/index/<QUERY_STRING>'
  • search index/type/query: curl -XPOST 'localhost:9000/datashare/index/<INDEX>/<TYPE>/<QUERY_STRING>'

See Query String syntax

Index

./start-idx

Compilation / Build

Requires
JDK 8 and
Maven 3

From datashare root directory, type: mvn package

Source Directory Stucture

datashare

|__datashare-api

|__datashare-cli

|__datashare-dist

|__datashare-extract

      |__ datashare-extract-tika

|__datashare-index

      |__ datashare-index-elasticsearch

|__datashare-nlp

      |__ datashare-nlp-corenlp

      |__ datashare-nlp-gate

      |__ datashare-nlp-ixapipe

      |__ datashare-nlp-mitie

      |__ datashare-nlp-opennlp

|__datashare-web

      |__ datashare-web-play

Documentation

Browse the JavaDoc from datashare/doc/index.html

License

DataShare is released under the GNU General Public License

Feedback

We would be happy to get your feedback as well as your contributions!

For any bug, question, comment or (pull) request,

please contact us at [email protected] or [email protected]

What's next

  • Test suite
  • Handle Embedded documents with Tika
  • Embed Tesseract (Tess4J)
  • Web module graphical user interface
  • Web module Security
  • User Management module
  • Networking module
  • Data Sharing module

DataShare

23 May 20:23
Compare
Choose a tag to compare
DataShare Pre-release
Pre-release

DataShare

DataShare allows for valuable knowledge about people and companies
locked within hundreds of pages of documents inside a computer
to be sieved into indexes and shared securely within a network of
trusted individuals, fostering unforeseen collaboration and prompting
new and better investigations that uncover corruption, transnational
crime and abuse of power.

DataShare: connecting local data with a global collective intelligence

Release Overview

This program extracts named entities from documents contained in specified input-dir

Program is controlled through command line interface.

Program name: datashare

Arguments:

  • --input-dir: Path towards source directory containing documents to be processed
  • --output-dir: Path towards directory where to write result files. Defaults to system /tmp directory.
  • --nlp-pipeline: Pipelines to be run; any combination of {GATENLP, CORENLP, OPENNLP}.
  • --enable-ocr: Run OCR when parsing source documents. Very slow for now.

For each input-dir/document.ext, processing yields at most one CSV (semicolon separated ;) result file

output-dir/document.ext.csv that has the following columns:

  • named_entity_mention
  • named_entity_category
  • mention_offset
  • source_document_path
  • nlp_pipeline
  • mention_normal_form
  • mention_hash
  • source_document_hash

Usage Example

Input directory is a required argument:

datashare --input-dir /path/to/source/docs/directory/

Run GATENLP and CORENLP pipelines only:

datashare --input-dir /path/to/source/docs/directory/ --nlp-pipeline GATENLP,CORENLP

Recognize PERSON and ORGANIZATION entities only:

datashare --input-dir /path/to/source/docs/directory/ --entity-cat PERSON,ORGANIZATION

Activate OCR (very slow):

datashare --input-dir /path/to/source/docs/directory/ --enable-ocr

Feedback

We would be happy to get your feedback!

For any bug, remark, question or comment, please contact [email protected]