5
5
[ ![ Code insight] ( https://img.shields.io/sensiolabs/i/ec066502-0fde-4455-9fc3-8e9fe6867834.svg )] ( https://insight.sensiolabs.com/projects/ec066502-0fde-4455-9fc3-8e9fe6867834 )
6
6
[ ![ License] ( https://img.shields.io/github/license/vaites/php-apache-tika.svg )] ( https://github.com/vaites/php-apache-tika/blob/master/LICENSE )
7
7
8
- PHP Apache Tika
9
- ===============
8
+ # PHP Apache Tika
10
9
11
10
This tool provides [ Apache Tika] ( https://tika.apache.org ) bindings for PHP, allowing to extract text and metadata
12
11
from documents, images and other formats.
@@ -21,8 +20,7 @@ Although the library contains a list of supported versions, any version of Apach
21
20
backward compatibility is maintained by Tika team. Therefore, it is not necessary to wait for an update of the library
22
21
to work with the new versions of the tool.
23
22
24
- Features
25
- --------
23
+ ## Features
26
24
27
25
* Simple class interface to Apache Tika features:
28
26
* Text and HTML extraction
@@ -34,8 +32,7 @@ Features
34
32
* Compatible with Apache Tika 1.7 or greater
35
33
* Tested up to 1.18
36
34
37
- Requirements
38
- ------------
35
+ ## Requirements
39
36
40
37
* PHP 5.4 or greater
41
38
* [ Multibyte String support] ( http://php.net/manual/en/book.mbstring.php )
@@ -46,8 +43,7 @@ Requirements
46
43
* Java 7 for Tika 1.10 or greater
47
44
* [ Tesseract] ( https://github.com/tesseract-ocr/tesseract ) (optional for OCR recognition)
48
45
49
- Installation
50
- ------------
46
+ ## Installation
51
47
52
48
Install using Composer:
53
49
@@ -61,8 +57,7 @@ If you want to use OCR you must install [Tesseract](https://github.com/tesseract
61
57
62
58
The library assumes ` tesseract ` binary is in path, so you can compile it yourself or install using any other method.
63
59
64
- Usage
65
- -----
60
+ ## Usage
66
61
67
62
Start Apache Tika server with [ caution] ( http://www.openwall.com/lists/oss-security/2015/08/13/5 ) :
68
63
@@ -95,9 +90,61 @@ Or use to extract text from images:
95
90
You can use an URL instead of a file path and the library will download the file and pass it to Apache Tika. There's
96
91
** no need** to add ` -enableUnsecureFeatures -enableFileUrl ` to command line when starting the server, as described
97
92
[ here] ( https://wiki.apache.org/tika/TikaJAXRS#Specifying_a_URL_Instead_of_Putting_Bytes ) .
93
+
94
+ ### Methods
95
+
96
+ Tika related methods:
97
+
98
+ $client->getMetadata($file);
99
+ $client->getLanguage($file);
100
+ $client->getMIME($file);
101
+ $client->getHTML($file);
102
+ $client->getText($file);
103
+ $client->getMainText($file);
104
+
105
+ Get the version of current Tika app/server:
106
+
107
+ $client->getVersion();
108
+
109
+ Get the full list of Apacke Tika supported versions:
110
+
111
+ $client->getSupportedVersions();
112
+
113
+ Set/get a callback for sequential read of response:
114
+
115
+ $client->setCallback($callback);
116
+ $client->getCallback();
117
+
118
+ Set/get the chunk size for secuential read:
119
+
120
+ $client->setChunkSize($size);
121
+ $client->getChunkSize();
122
+
123
+ Set/get JAR/Java paths (only CLI mode):
124
+
125
+ $client->setPath($path);
126
+ $client->getPath();
127
+
128
+ $client->setJava($java);
129
+ $client->getJava();
130
+
131
+ Set/get host properties (only server mode):
132
+
133
+ $client->setHost($host);
134
+ $client->getHost();
135
+
136
+ $client->setPort($port);
137
+ $client->getPort();
138
+
139
+ $client->setRetries($retries);
140
+ $client->getRetries();
141
+
142
+ Set/get [ cURL client options] ( http://php.net/manual/en/function.curl-setopt.php ) (only server mode):
143
+
144
+ $client->setOptions($options);
145
+ $client->getOptions();
98
146
99
- Tests
100
- -----
147
+ ## Tests
101
148
102
149
Tests are designed to ** cover all features for all supported versions** of Apache Tika in app mode and server mode.
103
150
There are a few samples to test against:
@@ -108,16 +155,14 @@ There are a few samples to test against:
108
155
* ** sample4** : unsupported media
109
156
* ** sample5** : huge text for callbacks
110
157
111
- Issues
112
- ------------
158
+ ## Issues
113
159
114
160
There are some issues found during tests, not related with this library:
115
161
116
162
* 1.9 version running Java 7 on server mode throws random error 500 (* Unexpected RuntimeException* )
117
163
* 1.14 version on server mode throws random errors (* Expected ';', got ','* ) when parsing image metadata
118
164
* Tesseract slows down document parsing as described in [ TIKA-2359] ( https://issues.apache.org/jira/browse/TIKA-2359 )
119
165
120
- Integrations
121
- -----
166
+ ## Integrations
122
167
123
168
- [ Symfony2 Bundle] ( https://github.com/welcoMattic/ApacheTikaBundle )
0 commit comments