doc: Add documentation for the language support

Jesus Seijas · Jesus Seijas · commit 82f70ba65de8 · 2019-01-25T23:25:26.000+01:00
diff --git a/README.md b/README.md
@@ -20,8 +20,8 @@
 - Natural Language Processing Classifier, to classify utterance into intents.
 - Natural Language Generation Manager, so from intents and conditions it can generate an answer.
 - NLP Manager: a tool able to manage several languages, the Named Entities for each language, the utterance and intents for the training of the classifier, and for a given utterance return the entity extraction, the intent classification and the sentiment analysis. Also, it is able to maintain a Natural Language Generation Manager for the answers.
-- 27 languages supported: Arabic (ar), Armenian (hy), Basque (eu), Catala (ca), Chinese (zh), Czech (cs), Danish (da), Dutch (nl), English (en), Farsi (fa), Finnish (fi), French (fr), German (de), Hungarian (hu), Indonesian (id), Irish (ga), Italian (it), Japanese (ja), Norwegian (no), Portuguese (pt), Romanian (ro), Russian (ru), Slovene (sl), Spanish (es), Swedish (sv), Tamil (ta), Turkish (tr)
-
+- 27 languages with stemmers supported: Arabic (ar), Armenian (hy), Basque (eu), Catala (ca), Chinese (zh), Czech (cs), Danish (da), Dutch (nl), English (en), Farsi (fa), Finnish (fi), French (fr), German (de), Hungarian (hu), Indonesian (id), Irish (ga), Italian (it), Japanese (ja), Norwegian (no), Portuguese (pt), Romanian (ro), Russian (ru), Slovene (sl), Spanish (es), Swedish (sv), Tamil (ta), Turkish (tr)
+- Any other language is supported through tokenization, even fantasy languages
 <div align="center">
 <img src="https://github.com/axa-group/nlp.js/raw/master/screenshots/hybridbot.gif" width="auto" height="auto"/>
 </div>
@@ -37,6 +37,7 @@
   - [Classification](docs/language-support.md#classification)
   - [Sentiment Analysis](docs/language-support.md#sentiment-analysis)
   - [Builtin Entity Extraction](docs/language-support.md#builtin-entity-extraction)
+  - [Example with languages](docs/example-with-languages)
 - [Language Guesser](docs/language-guesser.md)
 - [Similar Search](docs/similar-search.md)
 - [NLP Classifier](docs/nlp-classifier.md)
diff --git a/docs/benchmarking.md b/docs/benchmarking.md
@@ -28,11 +28,14 @@ We compute the `f1` score for each corpus and the overall `f1`:
 | Watson           | 0.97    | 0.92       | 0.83             | 0.92    |
 | Botfuel          | 0.98    | 0.90       | 0.80             | 0.91    |
 | Luis             | 0.98    | 0.90       | 0.81             | 0.91    |
+| NLP.js (no stem) | 1.00    | 0.92       | 0.73             | 0.91    |
 | Snips            | 0.96    | 0.83       | 0.78             | 0.89    |
 | Recast           | 0.99    | 0.86       | 0.75             | 0.89    |
 | RASA             | 0.98    | 0.86       | 0.74             | 0.88    |
 | API (DialogFlow) | 0.93    | 0.85       | 0.80             | 0.87    |
 
+You can se two entries for NLP.js, the best one is using stemmer the other one is only by using the tokenizer and the artificial intelligence. This is added because there are 27 languages supported with stemmers, but any other language is supported using only the tokenizer, but the result is good enough, in fact in english is at the middle of the table, being better than other systems that use more advanced methods than tokenization.
+
 <div align="center">
 <img src="https://github.com/axa-group/nlp.js/raw/master/screenshots/benchmark.png" width="auto" height="auto"/>
 </div>
diff --git a/docs/example-with-languages.md b/docs/example-with-languages.md
@@ -0,0 +1,59 @@
+# Example with languages
+
+This example show how to handle the three kind of different scenarios with languages:
+1. The language has stemmer
+2. The language exists but has no stemmer
+3. The language does not exists (fantasy language)
+
+This example uses english, korean and klingon.
+
+```javascript
+const { NlpManager } = require('node-nlp');
+
+const manager = new NlpManager({ languages: ['en', 'ko', 'kl'] });
+// Gives a name for the fantasy language
+manager.describeLanguage('kl', 'Klingon');
+// Train Klingon
+manager.addDocument('kl', 'nuqneH', 'hello');
+manager.addDocument('kl', 'maj po', 'hello');
+manager.addDocument('kl', 'maj choS', 'hello');
+manager.addDocument('kl', 'maj ram', 'hello');
+manager.addDocument('kl', `nuqDaq ghaH ngaQHa'moHwI'mey?`, 'keys');
+manager.addDocument('kl', `ngaQHa'moHwI'mey lujta' jIH`, 'keys');
+// Train Korean
+manager.addDocument('ko', '여보세요', 'greetings.hello');
+manager.addDocument('ko', '안녕하세요!', 'greetings.hello');
+manager.addDocument('ko', '여보!', 'greetings.hello');
+manager.addDocument('ko', '어이!', 'greetings.hello');
+manager.addDocument('ko', '좋은 아침', 'greetings.hello');
+manager.addDocument('ko', '안녕히 주무세요', 'greetings.hello');
+manager.addDocument('ko', '안녕', 'greetings.bye');
+manager.addDocument('ko', '친 공이 타자', 'greetings.bye');
+manager.addDocument('ko', '상대가 없어 남는 사람', 'greetings.bye');
+manager.addDocument('ko', '지엽적인 것', 'greetings.bye');
+manager.addDocument('en', 'goodbye for now', 'greetings.bye');
+manager.addDocument('en', 'bye bye take care', 'greetings.bye');
+manager.addDocument('en', 'okay see you later', 'greetings.bye');
+manager.addDocument('en', 'bye for now', 'greetings.bye');
+manager.addDocument('en', 'i must go', 'greetings.bye');
+manager.addDocument('en', 'hello', 'greetings.hello');
+manager.addDocument('en', 'hi', 'greetings.hello');
+manager.addDocument('en', 'howdy', 'greetings.hello');
+
+// Train also the NLG
+manager.addAnswer('en', 'greetings.bye', 'Till next time');
+manager.addAnswer('en', 'greetings.bye', 'see you soon!');
+manager.addAnswer('en', 'greetings.hello', 'Hey there!');
+manager.addAnswer('en', 'greetings.hello', 'Greetings!');
+
+// Train and save the model.
+await manager.train();
+manager.save();
+
+// English and Korean can be automatically detected
+manager.process('I have to go').then(console.log);
+manager.process('상대가 없어 남는 편').then(console.log);
+// For Klingon, as it cannot be automatically deteced, 
+// you must provide the locale
+manager.process('kl', `ngaQHa'moHwI'mey nIH vay'`).then(console.log);
+```
diff --git a/docs/language-support.md b/docs/language-support.md
@@ -1,10 +1,12 @@
 # Language Support
 
-There are several languages supported. The language support can be for the Stemmers or for Sentiment Analysis.
+Any language is supported, even fantasy languages, but there are 27 languages with stemmer support. The difference between using an stemmer or only tokenization exists, but with a good training is not so big. You can take a look into [Benchmarking](docs/benchmarking.md). For english, using the SIGDIAL22 to compare, with stemmer the success is 94%, only with tokenization is 91%, so is good enough.
+
 Inside Stemmers there are three type of stemmers: Natural, Snowball and Custom. Natural stemmers are these supported by the Natural library, while Snowball stemmers are the ported version from the Snowball ones from Java. Custom stemmers are those with custom development out of the scope of Natural or Snowball.
+
 Inside Sentiment Analysis, there are three possible algoritms: AFINN, Senticon and Pattern.
 
-## Classification
+## Classification 
 
 | Language        | Natural | Snowball | Custom |
 | :-------------- | :-----: | :------: | :----: |