JohnSnowLabs
diff --git a/‎tutorials/streamlit_notebooks/CONTEXTUAL_WORD_MEANING.ipynb
Lines changed: 327 additions & 0 deletions b/‎tutorials/streamlit_notebooks/CONTEXTUAL_WORD_MEANING.ipynb
Lines changed: 327 additions & 0 deletions
@@ -0,0 +1,327 @@
+{
+ "cells": [
+  {
+   "source": [
+    "\n",
+    "\n",
+    "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CONTEXTUAL_WORD_MEANING.ipynb)\n",
+    "\n"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "source": [
+    "# **Infer word meaning from context**"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "source": [
+    "Compare the meaning of words in two different sentences and evaluate ambiguous pronouns."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "source": [
+    "## 1. Colab Setup"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install java\n",
+    "!apt-get update -qq\n",
+    "!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null\n",
+    "!java -version\n",
+    "\n",
+    "# Install pyspark\n",
+    "!pip install --ignore-installed -q pyspark==2.4.4\n",
+    "\n",
+    "# Install Sparknlp\n",
+    "!pip install --ignore-installed spark-nlp\n",
+    "\n",
+    "# Update environmental variables\n",
+    "import os\n",
+    "os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
+    "os.environ[\"PATH\"] = os.environ[\"JAVA_HOME\"] + \"/bin:\" + os.environ[\"PATH\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import os\n",
+    "import json\n",
+    "from pyspark.ml import Pipeline\n",
+    "from pyspark.sql import SparkSession\n",
+    "import pyspark.sql.functions as F\n",
+    "from sparknlp.annotator import *\n",
+    "from sparknlp.base import *\n",
+    "import sparknlp\n",
+    "from sparknlp.pretrained import PretrainedPipeline"
+   ]
+  },
+  {
+   "source": [
+    "## 2. Start Spark Session"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "spark = sparknlp.start()"
+   ]
+  },
+  {
+   "source": [
+    "## 3. Select the model to use"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#MODEL_NAME = 't5_small'\n",
+    "MODEL_NAME = 't5_base'"
+   ]
+  },
+  {
+   "source": [
+    "### 3.1 Select the task"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "source": [
+    "The `T5 Transformer` model is able to perform 18 different tasks (ref.: [this paper](https://arxiv.org/abs/1910.10683)). To infer word meaning from context, we can use the following tasks:\n",
+    "\n",
+    "- `wic`: Classify for a pair of sentences and a disambigous word if the word has the same meaning in both sentences.\n",
+    "- `wsc-dpr`: Predict for an ambiguous pronoun in a sentence what it is referring to."
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#TASK = 'wic'\n",
+    "TASK = 'wsc-dpr'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Prefix to be used on the T5Transformer().setTask(<<prefix>>)\n",
+    "task_prefix = {\n",
+    "                'wic': 'wic pos::', \n",
+    "                'wsc-dpr': 'wsc:',\n",
+    "            }"
+   ]
+  },
+  {
+   "source": [
+    "## 4 Examples to try on the model"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text_lists = {\n",
+    "            'wic':      [\"\"\"\n",
+    "                        pos:\n",
+    "                        sentence1: The expanded window will give us time to catch the thieves.\n",
+    "                        sentence2: You have a two-hour window of turning in your homework.\n",
+    "                        word: window\n",
+    "                        \"\"\"],\n",
+    "            'wsc-dpr':  [\"\"\"The stable was very roomy, with four good stalls; a large swinging window opened into the yard , which made *it* pleasant and airy.\"\"\"]\n",
+    "            }"
+   ]
+  },
+  {
+   "source": [
+    "## 5. Define the Spark NLP pipeline"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "t5_base download started this may take some time.\n",
+      "Approximate size to download 446 MB\n",
+      "[OK!]\n"
+     ]
+    }
+   ],
+   "source": [
+    "document_assembler = DocumentAssembler()\\\n",
+    "    .setInputCol(\"text\")\\\n",
+    "    .setOutputCol(\"documents\")\n",
+    "\n",
+    "t5 = T5Transformer() \\\n",
+    "    .pretrained(MODEL_NAME) \\\n",
+    "    .setTask(task_prefix[TASK])\\\n",
+    "    .setMaxOutputLength(200)\\\n",
+    "    .setInputCols([\"documents\"]) \\\n",
+    "    .setOutputCol(\"T5\")\n",
+    "\n",
+    "pipeline = Pipeline(stages=[document_assembler, t5])"
+   ]
+  },
+  {
+   "source": [
+    "## 6. Run the pipeline"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Fit on empty data frame (model is pretrained)\n",
+    "empty_df = spark.createDataFrame([['']]).toDF('text')\n",
+    "pipeline_model = pipeline.fit(empty_df)\n",
+    "\n",
+    "# Send example texts to spark data frame\n",
+    "text_df = spark.createDataFrame(pd.DataFrame({'text': text_lists[TASK]}))\n",
+    "\n",
+    "# Predict with the Pipeline model\n",
+    "result = pipeline_model.transform(text_df)\n",
+    "\n",
+    "# Create Light Pipeline\n",
+    "lmodel = LightPipeline(pipeline_model)\n",
+    "\n",
+    "# Predict with then Ligh Pipeline model\n",
+    "res = lmodel.fullAnnotate(text_lists[TASK])"
+   ]
+  },
+  {
+   "source": [
+    "## 7. Visualize the results"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "source": [
+    "Using Light Pipeline:"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "The stable was very roomy, with four good stalls; a large swinging window opened into the yard , which made *it* pleasant and airy. => True\n"
+     ]
+    }
+   ],
+   "source": [
+    "for r in res:\n",
+    "    print(f\"{r['documents'][0].result} => {r['T5'][0].result}\")"
+   ]
+  },
+  {
+   "source": [
+    "Using pipeline model:"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "+-----------------------------------------------------------------------------------------------------------------------------------+------+\n|                                                                                                                               text|result|\n+-----------------------------------------------------------------------------------------------------------------------------------+------+\n|The stable was very roomy, with four good stalls; a large swinging window opened into the yard , which made *it* pleasant and airy.|[True]|\n+-----------------------------------------------------------------------------------------------------------------------------------+------+\n\n"
+     ]
+    }
+   ],
+   "source": [
+    "result.select('text', 'T5.result').show(truncate=150)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.5-final"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}