Skip to content

Commit e0a1237

Browse files
authored
Fix image display and clean up in ethics/NLP tutorial (#114)
1 parent 7c7e97a commit e0a1237

File tree

1 file changed

+53
-52
lines changed

1 file changed

+53
-52
lines changed

content/tutorial-nlp-from-scratch.md

+53-52
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ jupyter:
66
extension: .md
77
format_name: markdown
88
format_version: '1.3'
9-
jupytext_version: 1.11.4
9+
jupytext_version: 1.11.5
1010
kernelspec:
1111
display_name: Python 3 (ipykernel)
1212
language: python
@@ -23,14 +23,14 @@ Your deep learning model (the LSTM) is a form of a Recurrent Neural Network and
2323
Today, Deep Learning is getting adopted in everyday life and now it is more important to ensure that decisions that have been taken using AI are not reflecting discriminatory behavior towards a set of populations. It is important to take fairness into consideration while consuming the output from AI. Throughout the tutorial we'll try to question all the steps in our pipeline from an ethics point of view.
2424

2525

26-
## Prerequisites
26+
## Prerequisites
2727

2828
You are expected to be familiar with the Python programming language and array manipulation with NumPy. In addition, some understanding of Linear Algebra and Calculus is recommended. You should also be familiar with how Neural Networks work. For reference, you can visit the [Python](https://docs.python.org/dev/tutorial/index.html), [Linear algebra on n-dimensional arrays](https://numpy.org/doc/stable/user/tutorial-svd.html) and [Calculus](https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/multivariable-calculus.html) tutorials.
2929

3030
To get a refresher on Deep Learning basics, You should consider reading [the d2l.ai book](https://d2l.ai/chapter_recurrent-neural-networks/index.html), which is an interactive deep learning book with multi-framework code, math, and discussions. You can also go through the [Deep learning on MNIST from scratch tutorial](https://numpy.org/numpy-tutorials/content/tutorial-deep-learning-on-mnist.html) to understand how a basic neural network is implemented from scratch.
3131

3232
In addition to NumPy, you will be utilizing the following Python standard modules for data loading and processing:
33-
- [`pandas`](https://pandas.pydata.org/docs/) for handling dataframes
33+
- [`pandas`](https://pandas.pydata.org/docs/) for handling dataframes
3434
- [`Matplotlib`](https://matplotlib.org/) for data visualization
3535
- [`pooch`](https://www.fatiando.org/pooch/latest/https://www.fatiando.org/pooch/latest/) to download and cache datasets
3636

@@ -39,13 +39,13 @@ This tutorial can be run locally in an isolated environment, such as [Virtualenv
3939

4040
## Table of contents
4141

42-
1. Data Collection
42+
1. Data Collection
4343

4444
2. Preprocess the datasets
4545

4646
3. Build and train a LSTM network from scratch
4747

48-
4. Perform sentiment analysis on collected speeches
48+
4. Perform sentiment analysis on collected speeches
4949

5050
5. Next steps
5151

@@ -105,24 +105,26 @@ We made sure to include different demographics in our data and included a range
105105
>The GloVe word embeddings include sets that were trained on billions of tokens, some up to 840 billion tokens. These algorithms exhibit stereotypical biases, such as gender bias which can be traced back to the original training data. For example certain occupations seem to be more biased towards a particular gender, reinforcing problematic stereotypes. The nearest solution to this problem are some de-biasing algorithms as the one presented in https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184/reports/6835575.pdf which one can use on embeddings of their choice to mitigate bias, if present.
106106
<!-- #endregion -->
107107
108-
You'll start with importing the necessary packages to build our Deep Learning network
108+
You'll start with importing the necessary packages to build our Deep Learning network.
109109

110-
```python tags=[]
111-
# Importing the necessary packages
112-
import numpy as np
113-
import pandas as pd
114-
import matplotlib.pyplot as plt
110+
```python
111+
# Importing the necessary packages
112+
import numpy as np
113+
import pandas as pd
114+
import matplotlib.pyplot as plt
115115
import pooch
116116
import string
117-
import re
118-
import zipfile
117+
import re
118+
import zipfile
119119
import os
120120
```
121121

122-
```python tags=["hide-input"]
122+
Next, you'll define set of text preprocessing helper functions.
123+
124+
```python
123125
class TextPreprocess:
124126
"""Text Preprocessing for a Natural Language Processing model."""
125-
127+
126128
def txt_to_df(self, file):
127129
"""Function to convert a txt file to pandas dataframe.
128130
@@ -133,7 +135,7 @@ class TextPreprocess:
133135
134136
Returns
135137
-------
136-
Pandas dataframe
138+
Pandas dataframe
137139
txt file converted to a dataframe.
138140
139141
"""
@@ -145,22 +147,22 @@ class TextPreprocess:
145147
reviews[lines[1]] = float(lines[0])
146148
df = pd.DataFrame(reviews.items(), columns=['review', 'sentiment'])
147149
df = df.sample(frac=1).reset_index(drop=True)
148-
return df
149-
150+
return df
151+
150152
def unzipper(self, zipped, to_extract):
151153
"""Function to extract a file from a zipped folder.
152154
153155
Parameters
154156
----------
155157
zipped : str
156158
Path to the zipped folder.
157-
159+
158160
to_extract: str
159161
Path to the file to be extracted from the zipped folder
160162
161163
Returns
162164
-------
163-
str
165+
str
164166
Path to the extracted file.
165167
166168
"""
@@ -266,7 +268,7 @@ class TextPreprocess:
266268
267269
Returns
268270
-------
269-
list
271+
list
270272
sentences with punctuation removed.
271273
272274
"""
@@ -299,7 +301,7 @@ class TextPreprocess:
299301
300302
Returns
301303
-------
302-
Dict
304+
Dict
303305
mapping from word to corresponding word embedding.
304306
305307
"""
@@ -328,7 +330,7 @@ class TextPreprocess:
328330
329331
Returns
330332
-------
331-
list
333+
list
332334
paragraphs of specified length.
333335
334336
"""
@@ -350,14 +352,14 @@ class TextPreprocess:
350352

351353
```python
352354
data = pooch.create(
353-
# folder where the data will be stored in the
354-
# default cache folder of your Operating System
355+
# folder where the data will be stored in the
356+
# default cache folder of your Operating System
355357
path=pooch.os_cache("numpy-nlp-tutorial"),
356358
# Base URL of the remote data store
357359
base_url="",
358360
# The cache file registry. A dictionary with all files managed by this pooch.
359361
# The keys are the file names and values are their respective hash codes which
360-
# ensure we download the same, uncorrupted file each time.
362+
# ensure we download the same, uncorrupted file each time.
361363
registry={
362364
"imdb_train.txt": "6a38ea6ab5e1902cc03f6b9294ceea5e8ab985af991f35bcabd301a08ea5b3f0",
363365
"imdb_test.txt": "7363ef08ad996bf4233b115008d6d7f9814b7cc0f4d13ab570b938701eadefeb",
@@ -444,12 +446,11 @@ Unlike an MLP, the RNN was designed to work with sequence prediction problems.RN
444446
The problem with an RNN however, is that it cannot retain long-term memory because the influence of a given input on the hidden layer, and therefore on the network output, either decays or blows up exponentially as it cycles around the network’s recurrent connections. This shortcoming is referred to as the vanishing gradient problem. Long Short-Term Memory (LSTM) is an RNN architecture specifically designed to address the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem).
445447

446448

447-
### Overview of the Model Architecture
449+
### Overview of the Model Architecture
448450

449-
<img src="_static/lstm.gif" width="900" align="center">
451+
![Overview of the model architecture, showing a series of animated boxes. There are five identical boxes labeled A and receiving as input one of the words in the phrase "life's a box of chocolates". Each box is highlighted in turn, representing the memory blocks of the LSTM network as information passes through them, ultimately reaching a "Positive" output value.](_static/lstm.gif)
450452

451-
452-
In the above gif, The rectangles labeled $A$ are called `Cells` and they are the **Memory Blocks** of our LSTM network. They are responsible for choosing what to remember in a sequence and pass on that information to the next cell via two states called the `hidden state` $H_{t}$ and the `cell state` $C_{t}$ where $t$ indicates the time-step. Each `Cell` has dedicated gates which are responsible for storing, writing or reading the information passed to an LSTM. You will now look closely at the architecture of the network by implementing each mechanism happening inside of it.
453+
In the above gif, the rectangles labeled $A$ are called `Cells` and they are the **Memory Blocks** of our LSTM network. They are responsible for choosing what to remember in a sequence and pass on that information to the next cell via two states called the `hidden state` $H_{t}$ and the `cell state` $C_{t}$ where $t$ indicates the time-step. Each `Cell` has dedicated gates which are responsible for storing, writing or reading the information passed to an LSTM. You will now look closely at the architecture of the network by implementing each mechanism happening inside of it.
453454

454455

455456
Lets start with writing a function to randomly initialize the parameters which will be learned while our model trains
@@ -513,7 +514,7 @@ The **Forget Gate** takes the current word embedding and the previous hidden sta
513514
def fp_forget_gate(concat, parameters):
514515
ft = sigmoid(np.dot(parameters['Wf'], concat)
515516
+ parameters['bf'])
516-
return ft
517+
return ft
517518
```
518519

519520
The **Input Gate** takes the current word embedding and the previous hidden state concatenated together as input. and governs how much of the new data we take into account via the **Candidate Memory Gate** which utilizes the [Tanh](https://d2l.ai/chapter_multilayer-perceptrons/mlp.html?highlight=tanh#tanh-function) to regulate the values flowing through the network.
@@ -524,7 +525,7 @@ def fp_input_gate(concat, parameters):
524525
+ parameters['bi'])
525526
cmt = np.tanh(np.dot(parameters['Wcm'], concat)
526527
+ parameters['bcm'])
527-
return it, cmt
528+
return it, cmt
528529
```
529530

530531
Finally we have the **Output Gate** which takes information from the current word embedding, previous hidden state and the cell state which has been updated with information from the forget and input gates to update the value of the hidden state.
@@ -540,18 +541,18 @@ def fp_output_gate(concat, next_cs, parameters):
540541
The following image summarizes each gate mechanism in the memory block of a LSTM network:
541542
>Image has been modified from [this](https://link.springer.com/chapter/10.1007%2F978-3-030-14524-8_11) source
542543
543-
<img src="_static/mem_block.png" width="800" align="center">
544-
544+
![Diagram showing three sections of a memory block, labeled "Forget gate", "Input gate" and "Output gate". Each gate contains several subparts, representing the operations performed at that stage of the process.](_static/mem_block.png)
545545

546546
### But how do you obtain sentiment from the LSTM's output?
547+
547548
The hidden state you obtain from the output gate of the last memory block in a sequence is considered to be a representation of all the information contained in a sequence. To classify this information into various classes (2 in our case, positive and negative) we use a **Fully Connected layer** which firstly maps this information to a predefined output size (1 in our case). Then, an activation function such as the sigmoid converts this output to a value between 0 and 1. We'll consider values greater than 0.5 to be indicative of a positive sentiment.
548549

549550
```python
550551
def fp_fc_layer(last_hs, parameters):
551552
z2 = (np.dot(parameters['W2'], last_hs)
552553
+ parameters['b2'])
553554
a2 = sigmoid(z2)
554-
return a2
555+
return a2
555556
```
556557

557558
Now you will put all these functions together to summarize the **Forward Propagation** step in our model architecture:
@@ -579,22 +580,22 @@ def forward_prop(X_vec, parameters, input_dim):
579580

580581
# Input to the gates is concatenated previous hidden state and current word embedding
581582
concat = np.vstack((prev_hs, xt))
582-
583+
583584
# Calculate output of the forget gate
584585
ft = fp_forget_gate(concat, parameters)
585586

586587
# Calculate output of the input gate
587588
it, cmt = fp_input_gate(concat, parameters)
588-
io = it * cmt
589-
590-
# Update the cell state
589+
io = it * cmt
590+
591+
# Update the cell state
591592
next_cs = (ft * prev_cs) + io
592-
593+
593594
# Calculate output of the output gate
594595
ot, next_hs = fp_output_gate(concat, next_cs, parameters)
595596

596597
# store all the values used and calculated by
597-
# the LSTM in a cache for backward propagation.
598+
# the LSTM in a cache for backward propagation.
598599
lstm_cache = {
599600
"next_hs": next_hs,
600601
"next_cs": next_cs,
@@ -612,12 +613,12 @@ def forward_prop(X_vec, parameters, input_dim):
612613
prev_hs = next_hs
613614
prev_cs = next_cs
614615

615-
# Pass the LSTM output through a fully connected layer to
616-
# obtain probability of the sequence being positive
616+
# Pass the LSTM output through a fully connected layer to
617+
# obtain probability of the sequence being positive
617618
a2 = fp_fc_layer(next_hs, parameters)
618619

619620
# store all the values used and calculated by the
620-
# fully connected layer in a cache for backward propagation.
621+
# fully connected layer in a cache for backward propagation.
621622
fc_cache = {
622623
"a2" : a2,
623624
"W2" : parameters['W2']
@@ -642,7 +643,7 @@ def initialize_grads(parameters):
642643
return grads
643644
```
644645

645-
Now, for each gate and the fully connected layer, we define a function to calculate the gradient of the loss with respect to the input passed and the parameters used. To understand the mathematics behind how the derivatives were calculated we suggest you to follow this helpful [blog](https://christinakouridi.blog/2019/06/19/backpropagation-lstm/) by Christina Kouridi
646+
Now, for each gate and the fully connected layer, we define a function to calculate the gradient of the loss with respect to the input passed and the parameters used. To understand the mathematics behind how the derivatives were calculated we suggest you to follow this helpful [blog](https://christinakouridi.blog/2019/06/19/backpropagation-lstm/) by Christina Kouridi.
646647

647648

648649
Define a function to calculate the gradients in the **Forget Gate**:
@@ -659,7 +660,7 @@ def bp_forget_gate(hidden_dim, concat, dh_prev, dc_prev, cache, gradients, param
659660
gradients['dbf'] += np.sum(dft, axis=1, keepdims=True)
660661
# dh_f = dft * dft/dh_prev
661662
dh_f = np.dot(parameters["Wf"][:, :hidden_dim].T, dft)
662-
return dh_f, gradients
663+
return dh_f, gradients
663664
```
664665

665666
Define a function to calculate the gradients in the **Input Gate** and **Candidate Memory Gate**:
@@ -686,7 +687,7 @@ def bp_input_gate(hidden_dim, concat, dh_prev, dc_prev, cache, gradients, parame
686687
dh_i = np.dot(parameters["Wi"][:, :hidden_dim].T, dit)
687688
# dhcm = dcmt * dcmt/dh_prev
688689
dh_cm = np.dot(parameters["Wcm"][:, :hidden_dim].T, dcmt)
689-
return dh_i, dh_cm, gradients
690+
return dh_i, dh_cm, gradients
690691
```
691692

692693
Define a function to calculate the gradients for the **Output Gate**:
@@ -702,7 +703,7 @@ def bp_output_gate(hidden_dim, concat, dh_prev, dc_prev, cache, gradients, param
702703
gradients['dbo'] += np.sum(dot, axis=1, keepdims=True)
703704
# dho = dot * dot/dho
704705
dh_o = np.dot(parameters["Wo"][:, :hidden_dim].T, dot)
705-
return dh_o, gradients
706+
return dh_o, gradients
706707
```
707708

708709
Define a function to calculate the gradients for the **Fully Connected Layer**:
@@ -721,14 +722,14 @@ def bp_fc_layer (target, caches, gradients):
721722
# dh_last = dZ2 * W2
722723
W2 = caches['fc_values'][0]["W2"]
723724
dh_last = np.dot(W2.T, dZ2)
724-
return dh_last, gradients
725+
return dh_last, gradients
725726
```
726727

727728
Put all these functions together to summarize the **Backpropagation** step for our model:
728729

729730
```python
730731
def backprop(y, caches, hidden_dim, input_dim, time_steps, parameters):
731-
732+
732733
# Initialize gradients
733734
gradients = initialize_grads(parameters)
734735

@@ -742,7 +743,7 @@ def backprop(y, caches, hidden_dim, input_dim, time_steps, parameters):
742743
# loop back over the whole sequence
743744
for t in reversed(range(time_steps)):
744745
cache = caches['lstm_values'][t]
745-
746+
746747
# Input to the gates is concatenated previous hidden state and current word embedding
747748
concat = np.concatenate((cache["prev_hs"], cache["xt"]), axis=0)
748749

@@ -765,7 +766,7 @@ def backprop(y, caches, hidden_dim, input_dim, time_steps, parameters):
765766
return gradients
766767
```
767768

768-
### Updating the Parameters
769+
### Updating the Parameters
769770

770771
We update the parameters through an optimization algorithm called [Adam](https://optimization.cbe.cornell.edu/index.php?title=Adam) which is an extension to stochastic gradient descent that has recently seen broader adoption for deep learning applications in computer vision and natural language processing. Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters `beta1` and `beta2` control the decay rates of these moving averages. Adam has shown increased convergence and robustness over other gradient descent algorithms and is often recommended as the default optimizer for training.
771772

0 commit comments

Comments
 (0)