@@ -6,7 +6,6 @@ php-text-analysis
6
6
7
7
[](https://packagist.org/packages/yooper/php-text-analysis)
8
8
9
- [](https://packagist.org/packages/yooper/php-text-analysis)
10
9
11
10
PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language. All the documentation for this project can be found in the wiki.
12
11
@@ -21,66 +20,52 @@ Documentation for the library resides in the wiki.
21
20
https://github.com/yooper/php-text-analysis/wiki
22
21
23
22
24
-
25
-
26
- Dictionary Installation
27
- =============
28
-
29
- Not required unless you use the dictionary stemmers
30
-
31
- *For Ubuntu < 16*
32
- ```
33
- sudo apt-get install libpspell-dev
34
- sudo apt-get install php5-pspell
35
- sudo apt-get install aspell-en
36
- sudo apt-get install php5-enchant
37
- ```
38
- *For Ubuntu >= 16*
39
- ```
40
- sudo apt-get install libpspell-dev php7.0-pspell aspell-en php7.0-enchant
23
+ ### Tokenization
24
+ ```php
25
+ $tokens = tokenize($text);
41
26
```
42
27
43
-
44
- *For Centos*
45
- ```
46
- sudo yum install php5-pspell
47
- sudo yum install aspell-en
48
- sudo yum install php5-enchant
28
+ You can customize which type of tokenizer to tokenize with by passing in the name of the tokenizer class
29
+ ```php
30
+ $tokens = tokenize($text, \TextAnalysis\Tokenizers\PennTreeBankTokenizer::class);
49
31
```
32
+ The default tokenizer is **\TextAnalysis\Tokenizers\GeneralTokenizer::class** . Some tokenizers require parameters to be set upon instantiation.
50
33
51
- *PHP Pecl Stem* is not currently available in php 7.0.
34
+ ### Normalization
35
+ By default, **normalize_tokens** uses the function **strtolower** to lowercase all the tokens. To customize
36
+ the normalize function, pass in either a function or a string to be used by array_map.
52
37
38
+ ```php
39
+ $normalizedTokens = normalize_tokens(array $tokens);
40
+ ```
53
41
54
- Tokenize
55
- =============
42
+ ```php
43
+ $normalizedTokens = normalize_tokens(array $tokens, 'mb_strtolower');
56
44
57
- There are several tokenizers available
45
+ $normalizedTokens = normalize_tokens(array $tokens, function($token){ return mb_strtoupper($token); });
46
+ ```
58
47
59
- * FixedLengthTokenizer
60
- * GeneralTokenizer
61
- * LambdaTokenizer
62
- * PennTreeBankTokenizer
63
- * RegexTokenizer
64
- * SentenceTokenizer
65
- * WhitespaceTokenizer
48
+ ### Frequency Distributions
66
49
67
- *Tokenizer Usage*
68
- ```
69
- $tokenizer = new GeneralTokenizer()
70
- $tokens = $tokenizer->tokenize("Enter your text here");
50
+ The call to **freq_dist** returns a [FreqDist](https://github.com/yooper/php-text-analysis/blob/master/src/Analysis/FreqDist.php) instance.
51
+ ```php
52
+ $freqDist = freq_dist(tokenize($text));
71
53
```
72
54
73
- Frequency Distribution
74
- =============
55
+ ### Ngram Generation
56
+ By default bigrams are generated.
57
+ ```php
58
+ $bigrams = ngrams($tokens);
75
59
```
76
- $tokenizer = new \TextAnalysis\Tokenizers\GeneralTokenizer();
77
- $tokens = $tokenizer->tokenize("time flies like an arrow and an arrow flies like time");
78
- $freqDist = new \TextAnalysis\Analysis\FreqDist($tokens);
79
- $freqDist->getHapaxes(); //Get the Hapaxes
80
- $freqDist->getTotalTokens();
81
- $freqDist->getTotalUniqueTokens();
60
+ Customize the ngrams
61
+ ```php
62
+ // create trigrams with a pipe delimiter in between each word
63
+ $trigrams = ngrams($tokens,3, '|');
82
64
```
83
- Check out the API for full documentation
84
- https://github.com/yooper/php-text-analysis/blob/master/src/Analysis/FreqDist.php
85
-
86
65
66
+ Dictionary Installation
67
+ =============
68
+
69
+ To do
70
+
71
+
0 commit comments