Skip to content
This repository was archived by the owner on Nov 7, 2018. It is now read-only.

Fix a typo in the README.md #19

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,21 +12,24 @@ Python library to extract text from any file type compatiable with [TIKA](http:/
- [Xpdf](http://www.foolabs.com/xpdf/)

##### Installation
1. Download tika-server-1.7.jar from [Apache Tika](http://www.apache.org/dyn/closer.cgi/tika/tika-server-1.7.jar)
2. Mac: `brew install ghostscripts` Ubuntu: `sudo apt-get install ghostscript`
1. Download tika-server-1.16.jar from [Apache Tika](http://www.apache.org/dyn/closer.cgi/tika/tika-server-1.16.jar)
2. Mac: `brew install ghostscript` Ubuntu: `sudo apt-get install ghostscript`
3. Mac: `brew install tesseract` Ubuntu: `sudo apt-get install tesseract-ocr`
4. Mac: `brew tap homebrew/x11` and `brew install xpdf` Ubuntu: `sudo apt-get install poppler-utils`
5. Install Python dependencies with `pip install -r requirements.txt`

##### Usage
These script assume that an instance of Tika server is running.
Starting Tika Servers
`java -jar tika-server-1.7.jar --port 9998`
`java -jar tika-server-1.16.jar --port 9998`

In Python script
```python
from textextraction.extractors import text_extractor
text_extractor(doc_path=doc_path, force_convert=False)

from textextraction.extractors import (TextExtraction)

text = TextExtraction(doc_path).doc_to_text()

```

##### Tests
Expand Down