Skip to content

Commit 40a8cf0

Browse files
authored
Merge pull request #164 from JohnSnowLabs/new_st_notebooks
Added 5 notebooks related to T5 tasks
2 parents fed403e + 09b58cd commit 40a8cf0

File tree

5 files changed

+1689
-0
lines changed

5 files changed

+1689
-0
lines changed
Lines changed: 327 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,327 @@
1+
{
2+
"cells": [
3+
{
4+
"source": [
5+
"\n",
6+
"\n",
7+
"![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)\n",
8+
"\n",
9+
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CONTEXTUAL_WORD_MEANING.ipynb)\n",
10+
"\n"
11+
],
12+
"cell_type": "markdown",
13+
"metadata": {}
14+
},
15+
{
16+
"source": [
17+
"# **Infer word meaning from context**"
18+
],
19+
"cell_type": "markdown",
20+
"metadata": {}
21+
},
22+
{
23+
"source": [
24+
"Compare the meaning of words in two different sentences and evaluate ambiguous pronouns."
25+
],
26+
"cell_type": "markdown",
27+
"metadata": {}
28+
},
29+
{
30+
"source": [
31+
"## 1. Colab Setup"
32+
],
33+
"cell_type": "markdown",
34+
"metadata": {}
35+
},
36+
{
37+
"cell_type": "code",
38+
"execution_count": null,
39+
"metadata": {},
40+
"outputs": [],
41+
"source": [
42+
"# Install java\n",
43+
"!apt-get update -qq\n",
44+
"!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null\n",
45+
"!java -version\n",
46+
"\n",
47+
"# Install pyspark\n",
48+
"!pip install --ignore-installed -q pyspark==2.4.4\n",
49+
"\n",
50+
"# Install Sparknlp\n",
51+
"!pip install --ignore-installed spark-nlp\n",
52+
"\n",
53+
"# Update environmental variables\n",
54+
"import os\n",
55+
"os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
56+
"os.environ[\"PATH\"] = os.environ[\"JAVA_HOME\"] + \"/bin:\" + os.environ[\"PATH\"]"
57+
]
58+
},
59+
{
60+
"cell_type": "code",
61+
"execution_count": 1,
62+
"metadata": {},
63+
"outputs": [],
64+
"source": [
65+
"import pandas as pd\n",
66+
"import numpy as np\n",
67+
"import os\n",
68+
"import json\n",
69+
"from pyspark.ml import Pipeline\n",
70+
"from pyspark.sql import SparkSession\n",
71+
"import pyspark.sql.functions as F\n",
72+
"from sparknlp.annotator import *\n",
73+
"from sparknlp.base import *\n",
74+
"import sparknlp\n",
75+
"from sparknlp.pretrained import PretrainedPipeline"
76+
]
77+
},
78+
{
79+
"source": [
80+
"## 2. Start Spark Session"
81+
],
82+
"cell_type": "markdown",
83+
"metadata": {}
84+
},
85+
{
86+
"cell_type": "code",
87+
"execution_count": 2,
88+
"metadata": {},
89+
"outputs": [],
90+
"source": [
91+
"spark = sparknlp.start()"
92+
]
93+
},
94+
{
95+
"source": [
96+
"## 3. Select the model to use"
97+
],
98+
"cell_type": "markdown",
99+
"metadata": {}
100+
},
101+
{
102+
"cell_type": "code",
103+
"execution_count": 3,
104+
"metadata": {},
105+
"outputs": [],
106+
"source": [
107+
"#MODEL_NAME = 't5_small'\n",
108+
"MODEL_NAME = 't5_base'"
109+
]
110+
},
111+
{
112+
"source": [
113+
"### 3.1 Select the task"
114+
],
115+
"cell_type": "markdown",
116+
"metadata": {}
117+
},
118+
{
119+
"source": [
120+
"The `T5 Transformer` model is able to perform 18 different tasks (ref.: [this paper](https://arxiv.org/abs/1910.10683)). To infer word meaning from context, we can use the following tasks:\n",
121+
"\n",
122+
"- `wic`: Classify for a pair of sentences and a disambigous word if the word has the same meaning in both sentences.\n",
123+
"- `wsc-dpr`: Predict for an ambiguous pronoun in a sentence what it is referring to."
124+
],
125+
"cell_type": "markdown",
126+
"metadata": {}
127+
},
128+
{
129+
"cell_type": "code",
130+
"execution_count": 5,
131+
"metadata": {},
132+
"outputs": [],
133+
"source": [
134+
"#TASK = 'wic'\n",
135+
"TASK = 'wsc-dpr'"
136+
]
137+
},
138+
{
139+
"cell_type": "code",
140+
"execution_count": 10,
141+
"metadata": {},
142+
"outputs": [],
143+
"source": [
144+
"# Prefix to be used on the T5Transformer().setTask(<<prefix>>)\n",
145+
"task_prefix = {\n",
146+
" 'wic': 'wic pos::', \n",
147+
" 'wsc-dpr': 'wsc:',\n",
148+
" }"
149+
]
150+
},
151+
{
152+
"source": [
153+
"## 4 Examples to try on the model"
154+
],
155+
"cell_type": "markdown",
156+
"metadata": {}
157+
},
158+
{
159+
"cell_type": "code",
160+
"execution_count": 11,
161+
"metadata": {},
162+
"outputs": [],
163+
"source": [
164+
"text_lists = {\n",
165+
" 'wic': [\"\"\"\n",
166+
" pos:\n",
167+
" sentence1: The expanded window will give us time to catch the thieves.\n",
168+
" sentence2: You have a two-hour window of turning in your homework.\n",
169+
" word: window\n",
170+
" \"\"\"],\n",
171+
" 'wsc-dpr': [\"\"\"The stable was very roomy, with four good stalls; a large swinging window opened into the yard , which made *it* pleasant and airy.\"\"\"]\n",
172+
" }"
173+
]
174+
},
175+
{
176+
"source": [
177+
"## 5. Define the Spark NLP pipeline"
178+
],
179+
"cell_type": "markdown",
180+
"metadata": {}
181+
},
182+
{
183+
"cell_type": "code",
184+
"execution_count": 12,
185+
"metadata": {},
186+
"outputs": [
187+
{
188+
"output_type": "stream",
189+
"name": "stdout",
190+
"text": [
191+
"t5_base download started this may take some time.\n",
192+
"Approximate size to download 446 MB\n",
193+
"[OK!]\n"
194+
]
195+
}
196+
],
197+
"source": [
198+
"document_assembler = DocumentAssembler()\\\n",
199+
" .setInputCol(\"text\")\\\n",
200+
" .setOutputCol(\"documents\")\n",
201+
"\n",
202+
"t5 = T5Transformer() \\\n",
203+
" .pretrained(MODEL_NAME) \\\n",
204+
" .setTask(task_prefix[TASK])\\\n",
205+
" .setMaxOutputLength(200)\\\n",
206+
" .setInputCols([\"documents\"]) \\\n",
207+
" .setOutputCol(\"T5\")\n",
208+
"\n",
209+
"pipeline = Pipeline(stages=[document_assembler, t5])"
210+
]
211+
},
212+
{
213+
"source": [
214+
"## 6. Run the pipeline"
215+
],
216+
"cell_type": "markdown",
217+
"metadata": {}
218+
},
219+
{
220+
"cell_type": "code",
221+
"execution_count": 13,
222+
"metadata": {},
223+
"outputs": [],
224+
"source": [
225+
"# Fit on empty data frame (model is pretrained)\n",
226+
"empty_df = spark.createDataFrame([['']]).toDF('text')\n",
227+
"pipeline_model = pipeline.fit(empty_df)\n",
228+
"\n",
229+
"# Send example texts to spark data frame\n",
230+
"text_df = spark.createDataFrame(pd.DataFrame({'text': text_lists[TASK]}))\n",
231+
"\n",
232+
"# Predict with the Pipeline model\n",
233+
"result = pipeline_model.transform(text_df)\n",
234+
"\n",
235+
"# Create Light Pipeline\n",
236+
"lmodel = LightPipeline(pipeline_model)\n",
237+
"\n",
238+
"# Predict with then Ligh Pipeline model\n",
239+
"res = lmodel.fullAnnotate(text_lists[TASK])"
240+
]
241+
},
242+
{
243+
"source": [
244+
"## 7. Visualize the results"
245+
],
246+
"cell_type": "markdown",
247+
"metadata": {}
248+
},
249+
{
250+
"source": [
251+
"Using Light Pipeline:"
252+
],
253+
"cell_type": "markdown",
254+
"metadata": {}
255+
},
256+
{
257+
"cell_type": "code",
258+
"execution_count": 14,
259+
"metadata": {},
260+
"outputs": [
261+
{
262+
"output_type": "stream",
263+
"name": "stdout",
264+
"text": [
265+
"The stable was very roomy, with four good stalls; a large swinging window opened into the yard , which made *it* pleasant and airy. => True\n"
266+
]
267+
}
268+
],
269+
"source": [
270+
"for r in res:\n",
271+
" print(f\"{r['documents'][0].result} => {r['T5'][0].result}\")"
272+
]
273+
},
274+
{
275+
"source": [
276+
"Using pipeline model:"
277+
],
278+
"cell_type": "markdown",
279+
"metadata": {}
280+
},
281+
{
282+
"cell_type": "code",
283+
"execution_count": 15,
284+
"metadata": {},
285+
"outputs": [
286+
{
287+
"output_type": "stream",
288+
"name": "stdout",
289+
"text": [
290+
"+-----------------------------------------------------------------------------------------------------------------------------------+------+\n| text|result|\n+-----------------------------------------------------------------------------------------------------------------------------------+------+\n|The stable was very roomy, with four good stalls; a large swinging window opened into the yard , which made *it* pleasant and airy.|[True]|\n+-----------------------------------------------------------------------------------------------------------------------------------+------+\n\n"
291+
]
292+
}
293+
],
294+
"source": [
295+
"result.select('text', 'T5.result').show(truncate=150)"
296+
]
297+
},
298+
{
299+
"cell_type": "code",
300+
"execution_count": null,
301+
"metadata": {},
302+
"outputs": [],
303+
"source": []
304+
}
305+
],
306+
"metadata": {
307+
"kernelspec": {
308+
"display_name": "Python 3",
309+
"language": "python",
310+
"name": "python3"
311+
},
312+
"language_info": {
313+
"codemirror_mode": {
314+
"name": "ipython",
315+
"version": 3
316+
},
317+
"file_extension": ".py",
318+
"mimetype": "text/x-python",
319+
"name": "python",
320+
"nbconvert_exporter": "python",
321+
"pygments_lexer": "ipython3",
322+
"version": "3.6.5-final"
323+
}
324+
},
325+
"nbformat": 4,
326+
"nbformat_minor": 4
327+
}

0 commit comments

Comments
 (0)