Skip to content

Commit f0e1898

Browse files
committed
Added Client::setEncoding() and associated tests
1 parent c2bb706 commit f0e1898

File tree

6 files changed

+103
-4
lines changed

6 files changed

+103
-4
lines changed

README.md

Lines changed: 30 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ from documents, images and other formats.
1313

1414
The following modes are supported:
1515
* **App mode**: run app JAR via command line interface
16-
* **Server mode**: make HTTP requests to [JSR 311 network server](http://wiki.apache.org/tika/TikaJAXRS)
16+
* **Server mode**: make HTTP requests to [JSR 311 network server](https://cwiki.apache.org/confluence/display/TIKA/TikaServer)
1717

1818
Server mode is recommended because is 5 times faster, but some shared hosts don't allow run processes in background.
1919

@@ -144,6 +144,12 @@ $client->getAvailableDetectors();
144144
$client->getAvailableParsers();
145145
$client->getVersion();
146146
```
147+
148+
Encoding methods:
149+
```php
150+
$client->getEncoding();
151+
$client->setEncoding('UTF-8');
152+
```
147153

148154
Supported versions related methods:
149155

@@ -219,6 +225,26 @@ $client->setTimeout($seconds);
219225
$client->getTimeout();
220226
```
221227

228+
## Troubleshooting
229+
230+
### Empty responses or unexpected results
231+
232+
This library is only a _proxy_ so if you get an empy responses or unexpected results the most common cause is Tika
233+
itself. A simple test is using the GUI to check the response:
234+
235+
1. Run the Tika app without arguments: `java -jar tika-app-x.xx.jar`
236+
2. Drop your file or select it using _File -> Open_
237+
3. Wait until the metadata appears
238+
4. Get the text or HTML using _View_ menu
239+
240+
If the results are the same, you must take a look into [Tika's Jira](https://issues.apache.org/jira/projects/TIKA/issues)
241+
and open an issue if necessary.
242+
243+
### Encoding
244+
245+
By default the returned text is encoded with UTF-8 but there are some issues with the encoding when using the app mode.
246+
The `Client::setEncoding()` method allows to set the expected encoding (this will be fixed in the upcoming 1.0 release).
247+
222248
## Tests
223249

224250
Tests are designed to **cover all features for all supported versions** of Apache Tika in app mode and server mode.
@@ -229,8 +255,10 @@ There are a few samples to test against:
229255
* **sample3**: text recognition
230256
* **sample4**: unsupported media
231257
* **sample5**: huge text for callbacks
258+
* **sample6**: remote calls
259+
* **sample7**: text encoding
232260

233-
## Issues
261+
## Known issues
234262

235263
There are some issues found during tests, not related with this library:
236264

samples/sample7.doc

25.5 KB
Binary file not shown.

samples/sample7.pdf

61.8 KB
Binary file not shown.

src/Client.php

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,13 @@ abstract class Client
5858
*/
5959
protected $cache = [];
6060

61+
/**
62+
* Text encoding
63+
*
64+
* @var \Closure
65+
*/
66+
protected $encoding = null;
67+
6168
/**
6269
* Callback called on secuential read
6370
*
@@ -130,6 +137,30 @@ public static function prepare($param1 = null, $param2 = null, $options = [])
130137
return self::make($param1, $param2, $options, false);
131138
}
132139

140+
/**
141+
* Get the encoding
142+
*
143+
* @return \Closure|null
144+
*/
145+
public function getEncoding()
146+
{
147+
return $this->encoding;
148+
}
149+
150+
/**
151+
* Set the encoding
152+
*
153+
* @param string $encoding
154+
* @return $this
155+
* @throws \Exception
156+
*/
157+
public function setEncoding($encoding)
158+
{
159+
$this->encoding = $encoding;
160+
161+
return $this;
162+
}
163+
133164
/**
134165
* Get the callback
135166
*
@@ -144,6 +175,7 @@ public function getCallback()
144175
* Set the callback (callable or closure) for call on secuential read
145176
*
146177
* @param mixed $callback
178+
* @param bool $append
147179
* @return $this
148180
* @throws \Exception
149181
*/

src/Clients/CLIClient.php

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -247,8 +247,8 @@ public function exec($command)
247247
*/
248248
protected function getArguments($type, $file = null)
249249
{
250-
// parameters for command
251-
$arguments = [];
250+
$arguments = $this->encoding ? ["--encoding={$this->encoding}"] : [];
251+
252252
switch($type)
253253
{
254254
case 'html':

tests/BaseTest.php

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -597,6 +597,35 @@ public function testDirectRemoteDocumentText($file)
597597
}
598598
}
599599

600+
/**
601+
* Encoding tests
602+
*
603+
* @dataProvider encodingProvider
604+
*
605+
* @param string $file
606+
* @throws \Exception
607+
*/
608+
public function testEncodingDocumentText($file)
609+
{
610+
$client =& self::$client;
611+
612+
if($client::MODE == 'web' && version_compare(self::$version, '1.9') == 0)
613+
{
614+
$this->markTestSkipped('Apache Tika 1.9 throws random "Error while processing document" errors');
615+
}
616+
else
617+
{
618+
//$client->setEncoding('UTF-8');
619+
620+
$this->assertThat($client->getText($file), $this->logicalAnd
621+
(
622+
$this->stringContains('L’espéranto'),
623+
$this->stringContains('世界語'),
624+
$this->stringContains('Эспера́нто')
625+
));
626+
}
627+
}
628+
600629
/**
601630
* Test available detectors
602631
*
@@ -690,6 +719,16 @@ public function remoteProvider()
690719
];
691720
}
692721

722+
/**
723+
* File provider for encoding testing
724+
*
725+
* @return array
726+
*/
727+
public function encodingProvider()
728+
{
729+
return $this->samples('sample7');
730+
}
731+
693732
/**
694733
* File provider using "samples" folder
695734
*

0 commit comments

Comments
 (0)