Skip to content

Commit 82f70ba

Browse files
author
Jesus Seijas
committed
doc: Add documentation for the language support
1 parent 3be159f commit 82f70ba

File tree

4 files changed

+69
-4
lines changed

4 files changed

+69
-4
lines changed

README.md

+3-2
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@
2020
- Natural Language Processing Classifier, to classify utterance into intents.
2121
- Natural Language Generation Manager, so from intents and conditions it can generate an answer.
2222
- NLP Manager: a tool able to manage several languages, the Named Entities for each language, the utterance and intents for the training of the classifier, and for a given utterance return the entity extraction, the intent classification and the sentiment analysis. Also, it is able to maintain a Natural Language Generation Manager for the answers.
23-
- 27 languages supported: Arabic (ar), Armenian (hy), Basque (eu), Catala (ca), Chinese (zh), Czech (cs), Danish (da), Dutch (nl), English (en), Farsi (fa), Finnish (fi), French (fr), German (de), Hungarian (hu), Indonesian (id), Irish (ga), Italian (it), Japanese (ja), Norwegian (no), Portuguese (pt), Romanian (ro), Russian (ru), Slovene (sl), Spanish (es), Swedish (sv), Tamil (ta), Turkish (tr)
24-
23+
- 27 languages with stemmers supported: Arabic (ar), Armenian (hy), Basque (eu), Catala (ca), Chinese (zh), Czech (cs), Danish (da), Dutch (nl), English (en), Farsi (fa), Finnish (fi), French (fr), German (de), Hungarian (hu), Indonesian (id), Irish (ga), Italian (it), Japanese (ja), Norwegian (no), Portuguese (pt), Romanian (ro), Russian (ru), Slovene (sl), Spanish (es), Swedish (sv), Tamil (ta), Turkish (tr)
24+
- Any other language is supported through tokenization, even fantasy languages
2525
<div align="center">
2626
<img src="https://github.com/axa-group/nlp.js/raw/master/screenshots/hybridbot.gif" width="auto" height="auto"/>
2727
</div>
@@ -37,6 +37,7 @@
3737
- [Classification](docs/language-support.md#classification)
3838
- [Sentiment Analysis](docs/language-support.md#sentiment-analysis)
3939
- [Builtin Entity Extraction](docs/language-support.md#builtin-entity-extraction)
40+
- [Example with languages](docs/example-with-languages)
4041
- [Language Guesser](docs/language-guesser.md)
4142
- [Similar Search](docs/similar-search.md)
4243
- [NLP Classifier](docs/nlp-classifier.md)

docs/benchmarking.md

+3
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,14 @@ We compute the `f1` score for each corpus and the overall `f1`:
2828
| Watson | 0.97 | 0.92 | 0.83 | 0.92 |
2929
| Botfuel | 0.98 | 0.90 | 0.80 | 0.91 |
3030
| Luis | 0.98 | 0.90 | 0.81 | 0.91 |
31+
| NLP.js (no stem) | 1.00 | 0.92 | 0.73 | 0.91 |
3132
| Snips | 0.96 | 0.83 | 0.78 | 0.89 |
3233
| Recast | 0.99 | 0.86 | 0.75 | 0.89 |
3334
| RASA | 0.98 | 0.86 | 0.74 | 0.88 |
3435
| API (DialogFlow) | 0.93 | 0.85 | 0.80 | 0.87 |
3536

37+
You can se two entries for NLP.js, the best one is using stemmer the other one is only by using the tokenizer and the artificial intelligence. This is added because there are 27 languages supported with stemmers, but any other language is supported using only the tokenizer, but the result is good enough, in fact in english is at the middle of the table, being better than other systems that use more advanced methods than tokenization.
38+
3639
<div align="center">
3740
<img src="https://github.com/axa-group/nlp.js/raw/master/screenshots/benchmark.png" width="auto" height="auto"/>
3841
</div>

docs/example-with-languages.md

+59
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Example with languages
2+
3+
This example show how to handle the three kind of different scenarios with languages:
4+
1. The language has stemmer
5+
2. The language exists but has no stemmer
6+
3. The language does not exists (fantasy language)
7+
8+
This example uses english, korean and klingon.
9+
10+
```javascript
11+
const { NlpManager } = require('node-nlp');
12+
13+
const manager = new NlpManager({ languages: ['en', 'ko', 'kl'] });
14+
// Gives a name for the fantasy language
15+
manager.describeLanguage('kl', 'Klingon');
16+
// Train Klingon
17+
manager.addDocument('kl', 'nuqneH', 'hello');
18+
manager.addDocument('kl', 'maj po', 'hello');
19+
manager.addDocument('kl', 'maj choS', 'hello');
20+
manager.addDocument('kl', 'maj ram', 'hello');
21+
manager.addDocument('kl', `nuqDaq ghaH ngaQHa'moHwI'mey?`, 'keys');
22+
manager.addDocument('kl', `ngaQHa'moHwI'mey lujta' jIH`, 'keys');
23+
// Train Korean
24+
manager.addDocument('ko', '여보세요', 'greetings.hello');
25+
manager.addDocument('ko', '안녕하세요!', 'greetings.hello');
26+
manager.addDocument('ko', '여보!', 'greetings.hello');
27+
manager.addDocument('ko', '어이!', 'greetings.hello');
28+
manager.addDocument('ko', '좋은 아침', 'greetings.hello');
29+
manager.addDocument('ko', '안녕히 주무세요', 'greetings.hello');
30+
manager.addDocument('ko', '안녕', 'greetings.bye');
31+
manager.addDocument('ko', '친 공이 타자', 'greetings.bye');
32+
manager.addDocument('ko', '상대가 없어 남는 사람', 'greetings.bye');
33+
manager.addDocument('ko', '지엽적인 것', 'greetings.bye');
34+
manager.addDocument('en', 'goodbye for now', 'greetings.bye');
35+
manager.addDocument('en', 'bye bye take care', 'greetings.bye');
36+
manager.addDocument('en', 'okay see you later', 'greetings.bye');
37+
manager.addDocument('en', 'bye for now', 'greetings.bye');
38+
manager.addDocument('en', 'i must go', 'greetings.bye');
39+
manager.addDocument('en', 'hello', 'greetings.hello');
40+
manager.addDocument('en', 'hi', 'greetings.hello');
41+
manager.addDocument('en', 'howdy', 'greetings.hello');
42+
43+
// Train also the NLG
44+
manager.addAnswer('en', 'greetings.bye', 'Till next time');
45+
manager.addAnswer('en', 'greetings.bye', 'see you soon!');
46+
manager.addAnswer('en', 'greetings.hello', 'Hey there!');
47+
manager.addAnswer('en', 'greetings.hello', 'Greetings!');
48+
49+
// Train and save the model.
50+
await manager.train();
51+
manager.save();
52+
53+
// English and Korean can be automatically detected
54+
manager.process('I have to go').then(console.log);
55+
manager.process('상대가 없어 남는 편').then(console.log);
56+
// For Klingon, as it cannot be automatically deteced,
57+
// you must provide the locale
58+
manager.process('kl', `ngaQHa'moHwI'mey nIH vay'`).then(console.log);
59+
```

docs/language-support.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
# Language Support
22

3-
There are several languages supported. The language support can be for the Stemmers or for Sentiment Analysis.
3+
Any language is supported, even fantasy languages, but there are 27 languages with stemmer support. The difference between using an stemmer or only tokenization exists, but with a good training is not so big. You can take a look into [Benchmarking](docs/benchmarking.md). For english, using the SIGDIAL22 to compare, with stemmer the success is 94%, only with tokenization is 91%, so is good enough.
4+
45
Inside Stemmers there are three type of stemmers: Natural, Snowball and Custom. Natural stemmers are these supported by the Natural library, while Snowball stemmers are the ported version from the Snowball ones from Java. Custom stemmers are those with custom development out of the scope of Natural or Snowball.
6+
57
Inside Sentiment Analysis, there are three possible algoritms: AFINN, Senticon and Pattern.
68

7-
## Classification
9+
## Classification
810

911
| Language | Natural | Snowball | Custom |
1012
| :-------------- | :-----: | :------: | :----: |

0 commit comments

Comments
 (0)