Skip to content

Commit bae31c6

Browse files
authored
Update README.md
1 parent d033e77 commit bae31c6

File tree

1 file changed

+35
-50
lines changed

1 file changed

+35
-50
lines changed

README.md

Lines changed: 35 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ php-text-analysis
66

77
[![Total Downloads](https://poser.pugx.org/yooper/php-text-analysis/downloads)](https://packagist.org/packages/yooper/php-text-analysis)
88

9-
[![Latest Unstable Version](https://poser.pugx.org/yooper/php-text-analysis/v/unstable)](https://packagist.org/packages/yooper/php-text-analysis)
109

1110
PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language. All the documentation for this project can be found in the wiki.
1211

@@ -21,66 +20,52 @@ Documentation for the library resides in the wiki.
2120
https://github.com/yooper/php-text-analysis/wiki
2221

2322

24-
25-
26-
Dictionary Installation
27-
=============
28-
29-
Not required unless you use the dictionary stemmers
30-
31-
*For Ubuntu < 16*
32-
```
33-
sudo apt-get install libpspell-dev
34-
sudo apt-get install php5-pspell
35-
sudo apt-get install aspell-en
36-
sudo apt-get install php5-enchant
37-
```
38-
*For Ubuntu >= 16*
39-
```
40-
sudo apt-get install libpspell-dev php7.0-pspell aspell-en php7.0-enchant
23+
### Tokenization
24+
```php
25+
$tokens = tokenize($text);
4126
```
4227

43-
44-
*For Centos*
45-
```
46-
sudo yum install php5-pspell
47-
sudo yum install aspell-en
48-
sudo yum install php5-enchant
28+
You can customize which type of tokenizer to tokenize with by passing in the name of the tokenizer class
29+
```php
30+
$tokens = tokenize($text, \TextAnalysis\Tokenizers\PennTreeBankTokenizer::class);
4931
```
32+
The default tokenizer is **\TextAnalysis\Tokenizers\GeneralTokenizer::class** . Some tokenizers require parameters to be set upon instantiation.
5033

51-
*PHP Pecl Stem* is not currently available in php 7.0.
34+
### Normalization
35+
By default, **normalize_tokens** uses the function **strtolower** to lowercase all the tokens. To customize
36+
the normalize function, pass in either a function or a string to be used by array_map.
5237

38+
```php
39+
$normalizedTokens = normalize_tokens(array $tokens);
40+
```
5341

54-
Tokenize
55-
=============
42+
```php
43+
$normalizedTokens = normalize_tokens(array $tokens, 'mb_strtolower');
5644

57-
There are several tokenizers available
45+
$normalizedTokens = normalize_tokens(array $tokens, function($token){ return mb_strtoupper($token); });
46+
```
5847

59-
* FixedLengthTokenizer
60-
* GeneralTokenizer
61-
* LambdaTokenizer
62-
* PennTreeBankTokenizer
63-
* RegexTokenizer
64-
* SentenceTokenizer
65-
* WhitespaceTokenizer
48+
### Frequency Distributions
6649

67-
*Tokenizer Usage*
68-
```
69-
$tokenizer = new GeneralTokenizer()
70-
$tokens = $tokenizer->tokenize("Enter your text here");
50+
The call to **freq_dist** returns a [FreqDist](https://github.com/yooper/php-text-analysis/blob/master/src/Analysis/FreqDist.php) instance.
51+
```php
52+
$freqDist = freq_dist(tokenize($text));
7153
```
7254

73-
Frequency Distribution
74-
=============
55+
### Ngram Generation
56+
By default bigrams are generated.
57+
```php
58+
$bigrams = ngrams($tokens);
7559
```
76-
$tokenizer = new \TextAnalysis\Tokenizers\GeneralTokenizer();
77-
$tokens = $tokenizer->tokenize("time flies like an arrow and an arrow flies like time");
78-
$freqDist = new \TextAnalysis\Analysis\FreqDist($tokens);
79-
$freqDist->getHapaxes(); //Get the Hapaxes
80-
$freqDist->getTotalTokens();
81-
$freqDist->getTotalUniqueTokens();
60+
Customize the ngrams
61+
```php
62+
// create trigrams with a pipe delimiter in between each word
63+
$trigrams = ngrams($tokens,3, '|');
8264
```
83-
Check out the API for full documentation
84-
https://github.com/yooper/php-text-analysis/blob/master/src/Analysis/FreqDist.php
85-
8665

66+
Dictionary Installation
67+
=============
68+
69+
To do
70+
71+

0 commit comments

Comments
 (0)